Statistics Test #4
Terms in this set (55)
Describe the location of an individual within a distribution (percentile/relative frequency graph/z score/density curves/normal curves)
The pth percentile of a distribution is the value with p percent of the observations less than it.
*Don't count the observation itself
Divide the counts by total and multiply by 100 to convert to a percent
Add the counts in the frequency column for the current class and all classes with smaller values of the variable
Cumulative relative frequency
Divide the cumulative frequency counts by the total and multiply by 100 to convert to a percent
Cumulative relative frequency graph
Group observations within equal width classes. The x axis are the values and the y values are the cumulative relative frequencies. Plot a point corresponding to the cumulative relative frequency in each class at the smallest value of the next class. Start the graph at a heigh of 0% at the smallest value of the first class. The last point we plot should be at height 100%. Connect consecutive points with a line segment to form the graph
Uses of cumulative relative frequency graph
1.) Describe the position of an individual within a distribution
2.) Locate the specified percentile of the distribution
Relationship between percentiles and quartiles
Median (second quartile) corresponds to 50th percentile. Q1 (first quartile) corresponds to 25th percentile. Q3 (third quartile) corresponds to 75th percentile.
Standardized value/z score (How to calculate)
If x is an observation from a distribution that has known mean and standard deviation, the standardized value of X is: (X - mean)/standard deviation = z score
Standardized value/z score (Meaning)
A z-score tells us how many standard deviations from the mean an observation falls, and in what direction. Observations larger than the mean have positive z-scores. Observations smaller than the mean have negative z-scores.
Z-scores for comparisons
We can use z-scores to compare the position of individuals in different distributions. We standardized observations to express them on a common scale. (relative standing)
What effect transformations (adding, subtracting, multiplying, dividing) have on the shape, center and spread of the entire distribution.
Effect of adding or subtracting a constant
Adding or subtracting the same number a to each observation
- Adds/subtracts a to measures of center and location (mean, median, quartiles, percentiles)
- Does not change the shape of the distribution or measures of spread (range, IQR, standard deviation)
*Percentile values not percents
Effect of multiplying or dividing by a constant
Multiplying or dividing each observation by the same number b
- Multiples/divides measures of center and location (mean, median, quartiles, percentiles) by b
- Multiplies/divides measures of spread (range, IQR, standard deviation) by |b| but does not change the shape of the distribution
- Cannot have negative amount of variability
*Percentile values not percents
Effect of transformations on location in distribution
Adding, subtracting, multiplying and dividing do not change an individual data value's location within a distribution (at the same percentile/quartile)
Connecting transformations and z-scores; how does standardizing every data value affect distribution (shape/center/spread)
- Shape: The shape of the distribution does not change
- Center: The mean is now 0
- Spread: The standard deviation is now 1
How to explore quantitative data (one variable; numerically and graphically)
1. Always plot your data; make a graph, usually a histogram, dot plot, stem plot or box plot
2. Look for the overall pattern (shape, center, spread) and for striking departures (outliers/gaps)
3. Calculate a numerical summary to describe center and spread
4. Sometimes the overall pattern of a large number of observations is so regular that we can describe it by a smooth curve.
A curve that is always on or above the horizontal axis and has area exactly 1 underneath it.
- It describes the overall pattern of a distribution
- The area under the curve and above any interval of values on the horizontal axis is the proportion of observations that fall in that interval (areas under a density curve represent proportions of the total number of observations)
Problems with density curves
They come in many shapes; they are often a good description of the overall pattern of a distribution. However, outliers (departures from the overall pattern) are not described by the curve
No real set of data is exactly described by a density curve. The curve is an approximation that is easy to use and accurate enough for practical use.
Area of histogram
The total area of the bars in the histogram is 100% (a proportion of 1), since all of the observations are represented.
Distinguishing the median and mean of a density curve
*Mean/median can be used to describe real observations and density curves
- The median of a density curve is the equal-areas point, the point that divides the area under the curve in half (half of the observations on either side)
- The mean of a density curve is the balance point, at which the curve would balance if made of solid material
- Because density curves are idealized patterns, a symmetric density curve is exactly symmetric
- The mean and median are the same for a symmetric density curve. They both lie at the center of the curve.
- The mean of a skewed curve is pulled away from the median in the direction of the long tail.
Notation of mean and standard deviation for density curves
Because a density curve is an idealized description of a distribution of data, we need to distinguish b/w the mean and standard deviation of the density curve and the mean x bar and the standard deviation sx computed from actual observations.
- Mean of density curve is μ (Greek letter mu)
- Standard deviation of density curve is σ (Greek letter sigma)
*Can locate μ of density curve by eye but not σ
*Same notation for population
What information do z-scores and percentiles provide?
1. Individual's location within a distribution
2. Compare relative standings of individuals within different distributions
Be careful w/ Normal curves
1. Normal curves are not usual/typical. They are special.
2. No set of real data is exactly described by a normal curve. Normal curves are approximations.
Normal distribution and Normal curve
Normal distribution is described by a normal density curve. Any particular Normal distribution is completely specified by two numbers: its mean μ and standard deviation σ. The mean of a Normal distribution is at the center of the symmetric Normal curve. The standard deviation is the distance from the center to the change-of-curvature (inflection) points on either side
How to abbreviate Normal distributions
We abbreviate the Normal distribution with mean μ and standard deviation σ as N(μ, σ).
Shape of Normal curves
Mean of Normal distribution
Located at the center of the symmetric curve and is the same as the median. Changing the mean without changing the standard deviation moves the Normal curve along the horizontal axis without changing its spread.
Standard deviation of Normal distribution
Controls the spread of a Normal curve. Curves with larger standard deviations are more spread out.
*We can locate σ by eye on a Normal curve
Why mean and standard deviation of Normal curves are special
μ and σ alone don't specify the shape of most distributions. The shape of density curves in general does not reveal σ. These are special properties of Normal distributions.
Why are Normal distributions important in statistics?
1. They are good descriptions for SOME distributions of real data.
2. Good approximations to the results of many kinds of chance outcomes like the number of heads in many tosses of a fair coin
3. Statistical inference procedures are based on Normal distributions
Distributions that are often close to Normal include:
1. Scores on tests taken by many people (SAT, ACT, IQ)
2. Repeated careful measurements of the same quantity (diameter of a tennis ball)
3. Characteristics of biological populations (lengths of crickets or yields of corn)
Distributions that are not normal
Many sets of data follow a Normal distribution, though many do not. For example, most income distributions are skewed to the right and are not normal. Some distributions are symmetric but not Normal. Non normal data are more common than Normal data.
The 68-95-99.7 rule
In the Normal distribution with mean μ and standard deviation σ:
- Approx. 68% of the observations fall within σ of μ.
- Approx. 95% of the observations fall within 2σ of μ.
- Approx. 99.7% of the observations fall within 3σ of μ.
*See figure 2.12
*Applies only to Normal distributions
Use of the 68-95-99.7 rule
Estimate percent of observations in a specified interval
Is there a rule that applies to any distribution (not just normal)
Chebyshev's inequality: says that in any distribution, the proportion of observations falling within k standard deviations of the mean is at least 1 - 1/k^2
All models are wrong but some are useful
68-95-99.7 rule describes distributions that are exactly normal. However, real data is never exactly Normal. We use a Normal distribution because it's a good approximation.
Normal distributions are best at describing:
Often describe real data better in the center of the distribution than in the extreme high and low tails
Standard Normal distribution
Normal distribution with mean 0 and standard deviation 1. If variable x has any Normal distribution N(μ, σ) with mean μ and standard deviation σ, then the standardized variable (z = x - μ /σ) has the standard Normal distribution
How to find the percent of observations that fall on an interval that is not 1/2/3 standard deviations away from mean on Normal distribution
An area under a density curve is a proportion of the observations in a distribution. We can't always use the 68-95-99.7 rule. Because all Normal distributions are the same when we standardize, we can use Table A.
Table A: the standard Normal table
A table of areas under the standard Normal curve. The table entry for each value z is the area under the curve to the left of z.
*This is for standardized values
*Can also be used to find z-score that corresponds to a particular area so use Table A backwards. Find given proportion in body of table, read corresponding z from left column and top row, then "unstandardize" to get the observed value.
Common mistake with Table A
A common student mistake is to look up a z-value in Table A and report the entry corresponding to that z-value, regardless of whether the problem asks for the area to the left or to the right of that z-value. To prevent making this mistake, always sketch the standard Normal curve, mark the z-value, and shade the area of interest. And before you finish, make sure your answer is reasonable in the context of the problem.
4 step process: how to find the proportion of the distribution in any region/how to solve problems involving Normal distributions
*We can answer questions about proportions of observations in any Normal distribution by standardizing and using the standard Normal table
*Must standardize/must be approximately normal
State: express problems in terms of observed variable x
Plan: Draw a picture of the distribution and shade the area of interest under the curve
Do: Perform calculations
- Standardize x to restate problem in terms of standard Normal variable z
- Use table A and fact that total area under the curve is 1 to find required area under the standard Normal curve
Conclude: write your conclusion in the context of the problem
Greater than vs. greater than/equal to and less than vs. less than/equal to for Normal distributions
In a Normal distribution, the proportions of observations with x ≥ is the same as the proportion with x > since there is no area under the curve exactly above the point (the areas are the same). The Normal distribution is an easy to use approximation, not a description of every detail in the actual data.
Good strategy for Normal calculations
Sketch the area you want, then match the area with the area the table gives you.
Z values that a more extreme than those appearing in Table A
The values in Table A leave only area .0002 in each tail unaccounted for. For practical purposes, we can act as if there is zero area outside the range of Table A.
Calculator and Normal distributions
- Calculator can help perform Normal distribution calculations: scratch pad, 6, 5, 2 (value to percent) or 3 (percent to value)
- Must communicate (include certain information)
- Don't use calculator speak when showing work
Examples of distributions that are skewed and not Normal
Examples include economic variables such as personal income and total sales of business firms, the survival times of cancer patients after treatment, and the lifetime of electronic devices.
Assessing Normality: how to determine if a distribution is Normal
Must use 2 of 3 methods
1.) Look at the shape of a graph
- Symmetrical (NOT skewed)
- Unimodal (NOT multimodal/multiple peaks)
2.) Check the 68-95-99.7 rule
- Find the mean and standard deviation
- Find intervals (1, 2, 3 standard deviations from mean)
- Find observations in each interval and divide by total
- Compare percentiles
3.) Graph a Normal probability plot
Making a Normal probability plot
1. Arrange observed data values from smallest to largest
2. Record percentile corresponding to each observation
3. Use standard Normal distribution (Table A) or calculator to find z-scores at these percentiles
3. X values on x axis and z scores on y axis
Interpreting Normal probability plots
If the points on a Normal probability plot lie close to a straight line, the plot indicates that the data are Normal.
Systematic deviations from a straight line indicate a non-Normal distribution. Outliers appear as points that are far away from the overall pattern of the plot (mention these).
*Real data almost always show some departure from normality. Look for shapes that show clear departures from Normality. Don't overreact to minor wiggles.
How can we determine the shape of a distribution from a Normal probability plot
- If observations fall systematically to the right of the line, they are larger than expected based on percentiles and corresponding z-scores
- If observations fall systematically to the left of the line, they are smaller than expected
- In a right-skewed distribution, the largest observations fall distinctly to the right of the line drawn through main body of points
- In a left-skewed distribution, the smallest observations fall distinctly to the left of the line drawn through the main body of points
The variable is _ standard deviations larger/smaller than the mean. The person's _ is above/below average.
Interpreting curve of normal probability plot
Discuss whether linear