Numerically summarizing data

### 3.1 Measures of Central Tendency: Mean

- The average.

- Used when data are quantitative and the frequency distribution is roughly symmetric.

- The sum of all the values of the variable (in the data set), divided by the number of values / observations. (Another way of saying it: add up all the observations and divide by the number of observations)

- Population mean (mew) = a parameter

- Sample mean (x-bar) = a statistic

Calculate Mean using calculator:

- Stat, Select #1, Enter

- Enter data in L1

- Press Stat, highlight Calc, and select #1: 1-Var Stats, Enter

- Press 2nd then 1 to insert L1 under "List" and press Enter

- The mean for population and sample is shows as x-bar

### 3.1 Measures of Central Tendency: Median

- The data value "in the middle"

- Used when the data are quantitative and the frequency distribution is skewed left or skewed right.

- The median divides the bottom 50% of the data from the top 50%

- Computation: Arrange data in ascending order and divide the data set in half

- If number of observations is odd, then the median is the data value exactly in the middle.

- If the number of observations is even, then the median is the mean of the two middle observations in the data set. (add the two middle observations together and divide by 2)

### 3.1 Measures of Central Tendency: Mode

- The most frequently occurring value

- Can be applied to both quantitative and qualitative data

- Set of data can have no mode or 1 or more modes.

### 3.1 Measures of Central Tendency: Resistant Statistic

Resistant statistic is a numerical summary (statistic or parameter) for which extreme values do not affect its value substantially.

- The Mean is not resistant since it is affected by extreme values / outliers

- The Median is resistant since it is not affected by extreme values / outliers

- We tend to use the mean as the best measure of central tendency when the distribution is symmetric

- We tend to use the median for skewed data (left or right)

### 3.2 Measures of dispersion

dispersion = the degree to which the data are spread out.

Two data sets might have the same mean, but one might be more dispersed / spread out than the other.

- Numerical measures of dispersion quantifies the spread of data:

1. Range

2. Variance

3. Standard deviation

### 3.2 Measures of dispersion: 1. Range

The range (R), of a variable, is the difference between the largest data value and the smallest data value

R = largest data value - smallest data value

Range is not resistant since it is affected by extreme values. It only uses 2 values in the data set.

### 3.2 Measures of dispersion: 2. Variance

The variance is a measure of the variability of dispersion of the data. It is given by calculating and squaring the differences between the values and the mean, and calculating the average (mean) of those values.

- How far, on average, each observation is from the mean. Variance is based on the deviation about the mean.

- The sum of all deviations about the mean = 0. Because the sum of deviations about the mean is zero, we cannot use the average deviation about the mean as a measure of spread, thus we use the average squared deviation.

- Population variance: see flashcard formula

- Sample variance: see flashcard formula

### 3.2 Measures of dispersion: 3. Standard deviation

- Population standard deviation (sigma): square root of the population variance

- Sample standard deviation (s): square root of the sample variance

- A larger standard deviation means the data is more spread out.

- Standard deviation is not resistant

- The standard deviation is used in conjunction with the mean to numerically describe distributions that are bell shaped and symmetric. The mean measures the center of the distribution, while the standard deviation measures the spread of the distribution.

### 3.2 Measures of dispersion: The Empirical Rule

For an approximately bell shaped (normal) distribution certain approximate percentage of the data lies within 1 standard deviation, 2 standard deviations, and 3 standard deviations of the mean.

1 standard deviations: 34% on either side of mean = for total of 68%

2 standard deviations: 13.5% on either side

3 standard deviations: 2.35% on either side

See chart on p139

### 3.3 Measures of central tendency and dispersion from grouped data

Grouped data: data that is aggregated into classes.

Measures of central tendency and dispersion for grouped data:

Mean, variance, standard deviation and weighted average.

### 3.3 Measures of central tendency and dispersion from grouped data: Mean

Mean for grouped data: take the midpoint of each class (as an approximation of the values), multiply by the frequency. Sum them all and divide by the total number of the frequencies.

Calculate midpoints by adding consecutive lower class limits and dividing the result by 2

In calculator:

- Stat - #1 (Edit)

- Enter the class midpoints in L1

- Enter the class frequencies in L2

- Stat - Calc - 1-Var Stats

- List: L1

- FreqList: L2

- Enter

### 3.3 Measures of central tendency and dispersion from grouped data: Weighted Average / Weighted Mean

Weighted average/mean (eg. GPA) is like a distribution of grouped data but the frequencies are replaced by "weights"

Multiply each value of the variable with its corresponding weight, summing these products, and dividing the result by the sum of the weights.

GPA: A=4, B=3, C=2, D=1, F=0

(credit) x (grade point value) + (credit) x (grade point value)...+ divided by sum of all credits

### 3.4 Measures of position and outliers: z-score

We use the z-score (NB for test) as a measure of how many standard deviations a data point lies from the mean.

Subtract mean from the data value, and divide this result by the standard deviation.

Population and sample z-score.

The z-score gives us a measure of relatively how far a data value is from the mean. If a data value is larger than the mean, the z-score will be positive. If a data value is smaller than the mean, the z-score will be negative. If the data value equals the mean, the z-score will be 0.

### 3.4 Measures of position and outliers: Percentile

The kth percentile is the data value such that K% of the data are less than or equal to the data value.

See example 2 on p156

### 3.4 Measures of position and outliers: Quartiles

Quartiles divide data sets into fourths, or four equal parts.

- Q1 divides the bottom 25% from the top 75% (Q1 = 25%ile)

- Q2 divides lower half from upper half (don't need to know since we use median)

- Q3 divides bottom 75% from the top 25% (Q3 = 75%ile)

### 3.4 Measures of position and outliers: Interquartile range (IQR)

Interquartile range (IQR) is another measure of dispersion. IQR is the range of the middle 50% of the observations in the data set. The more spread out the data is, the higher the interquartile range will be.

IQR = Q3-Q1

eg. Interpretation: IQR = 256.9

256.9 crimes per 100,000 population; the middle 50% of all observations have a range of 256.9 crimes per 100,000 population

### 3.4 Measures of position and outliers: lower fence & upper fence

Lower fence = Q1 - 1.5(IQR)

Upper fence = Q3 + 1.5(IQR)

This is used to determine if there are outliers. An outlier is a data point to the left of the lower fence (or less than the lower fence) or a data point that is to the right of the upper fence (more than the upper fence)