28 terms

cases

objects described by a set of data;

examples: customers, companies, subjects in a study, units in an experiment, etc.

examples: customers, companies, subjects in a study, units in an experiment, etc.

label

special variable used in some data sets to distinguish the different cases

variable

a characteristic of a case

Different cases can have different ______ of the variables.

values

What does the distribution of a variable tell us?

what values it takes and how often it takes these values

spreadsheet

very useful for doing simple computations;

-> you can type in a formula and have the same computation performed for each row

-> you can type in a formula and have the same computation performed for each row

The key characteristics of a data set answer the questions: ______ , _______ and _______.

Who? What? Why?

two basis strategies of exploratory data analysis

1. examine each variable by itself; then move on to study the relationship among the variables

2. begin with a graph; then add numerical summaries of specific aspects of the data

2. begin with a graph; then add numerical summaries of specific aspects of the data

distribution of a categorical variable

lists the categories and gives either the

- count

- or the percent

of cases that fall in each category

- count

- or the percent

of cases that fall in each category

tails

the extreme values of a distribution;

high values are in the upper, or right, tail

low values are in the lower, or left tail

high values are in the upper, or right, tail

low values are in the lower, or left tail

How to examine a distribution:

1. look for the overall pattern and for striking deviations from that pattern

2. describe the overall pattern of a distribution by its shape, center and spread

3. look for outliers

2. describe the overall pattern of a distribution by its shape, center and spread

3. look for outliers

time plot

plots each observation against the time at which it was measured

resistant measure

when a measurement can resist the influence of extreme observations;

-> its value does not respond strongly to changes in a few observations

-> its value does not respond strongly to changes in a few observations

Why is the mean weak in measuring center?

mean is sensitive to the influence of extreme observations;

it is not a resistant measure

it is not a resistant measure

robust measure

synonym for resistant measure

How to find the median:

1. arrange all observations in order of size;

-> from smallest to largest

2. if the number of observations n is odd (ungerade),

the median M is the center observation in the

ordered list

-> find the location of the median by counting (n+1)/2

observations up from the bottom of the list

3. if the number of observations n is even, the median

M is the mean of the two center observations in the

ordered list

-> from smallest to largest

2. if the number of observations n is odd (ungerade),

the median M is the center observation in the

ordered list

-> find the location of the median by counting (n+1)/2

observations up from the bottom of the list

3. if the number of observations n is even, the median

M is the mean of the two center observations in the

ordered list

Name the two most common measures of the center of a distribution.

mean&median

Compare the mean and the median:

- of a symmetric distribution

- of a skewed distribution

- of a symmetric distribution

- of a skewed distribution

symmetric: close together

skewed: the mean is farther out on the long tail than is the median

skewed: the mean is farther out on the long tail than is the median

variance s^2

the average of the squares of the deviations of the observations from their mean;

standard deviation s

the square root of the variance s^2;

degrees of freedom

the number n-1

properties of the standard deviation

- s measures spread about the mean and should be

used only when the mean is chosen as the measure of

center

- s=0 only when there is no spread

-> happens only when all observations have the same

value

-> as the observations become more spread about

their mean, s gets larger

- s, like the mean, is not resistant

-> a few outliers can make s very large

used only when the mean is chosen as the measure of

center

- s=0 only when there is no spread

-> happens only when all observations have the same

value

-> as the observations become more spread about

their mean, s gets larger

- s, like the mean, is not resistant

-> a few outliers can make s very large

linear transformation

changes the original variable x into the new variable x new given by an equation of the form:

x new = a+bx

-> adding the constant a shifts all values of x upward or

downward by the same amount

-> such a shift changes the origin (zero point) of the

variable

-> multiplying by the positive constant b changes the size of the unit of measurement

x new = a+bx

-> adding the constant a shifts all values of x upward or

downward by the same amount

-> such a shift changes the origin (zero point) of the

variable

-> multiplying by the positive constant b changes the size of the unit of measurement

density curve

a curve that:

- is always on or above the horizontal axis

- has area exactly 1 underneath it

-> describes the overall pattern of a distribution

-> the area under the curve and above any range of values is the proportion of all observations that fall in that range

- is always on or above the horizontal axis

- has area exactly 1 underneath it

-> describes the overall pattern of a distribution

-> the area under the curve and above any range of values is the proportion of all observations that fall in that range

Why are Normal distributions so important for statistics?

1. they are good descriptions for some distributions of

real data

2. Normal distributions are good approx. to the results

of many kinds of chance outcomes (such as tossing a

coin)

3. many statistical inference procedures based on

Normal distribution work well for other roughly

symmetric distributions

real data

2. Normal distributions are good approx. to the results

of many kinds of chance outcomes (such as tossing a

coin)

3. many statistical inference procedures based on

Normal distribution work well for other roughly

symmetric distributions

68-95-99.7 rule

in the Normal distribution with mean µ and standard deviation σ:

- approx. 68% of the observations fall within one

standard deviation of µ

- approx. 95% within two standard deviations of µ

- approx. 99.7% within three standard deviations of µ

- approx. 68% of the observations fall within one

standard deviation of µ

- approx. 95% within two standard deviations of µ

- approx. 99.7% within three standard deviations of µ

z-score or standardized value

if x is an observation from a distribution that has mean µ and standard deviation σ, the standardized value of x is:

z = (x - µ) / σ

z = (x - µ) / σ

standard Normal distribution

Normal distribution N(0,1) with mean 0 and standard deviation 1