Study sets, textbooks, questions
Upgrade to remove ads
Terms in this set (33)
Places individual into one of several groups or categories.
Takes numerical values for which arithmetic operations make sense.
Exploratory data analysis
Is the process of using statistical tools and ideas to examine data in order to describe their main features.
The distribution of a variable tells us what values it takes and how often it takes these values.
lists the categories and gives the count or percent of individuals who fall into that category.
show the distribution of a categorical variable as a "pie" whose slices are sized by the counts or percents for the categories. Pies are about percentages. Has to = 100%
represent each category as a bar whose heights show the category counts or percents. Bars we don't really care about % but rather numerical differences.
variable tells us what values the variable takes on and how often it takes those values.
show the distribution of a quantitative variable by using bars whose height represents the number of individuals who take on a value within a particular class.
For quantitative variables that take many values and/or large datasets.
Divide the possible values into classes (equal widths).
Count how many observations fall into each interval (may change to percents).
Draw picture representing the distribution―bar heights are equivalent to the number (percent) of observations in each interval.
Y-axis we set the frequency...not unit
No gap like bar gap... continuous
separate each observation into a stem and a leaf that are then plotted to display the distribution while maintaining the original values of the variable.
For quantitative variables.
Separate each observation into a stem (first part of the number) and a leaf (the remaining part of the number).
Write the stems in a vertical column; draw a vertical line to the right of the stems.
Write each leaf in the row to the right of its stem; order leaves if desired.
Smaller value goes on the left side...
A distribution is symmetric if the right and left sides of the graph are approximately mirror images of each other.
A distribution is skewed to the right (right-skewed) if the right side of the graph (containing the half of the observations with larger values) is much longer than the left side.
It is skewed to the left (left-skewed) if the left side of the graph is much longer than the right side.
To find the mean (pronounced "x-bar") of a set of observations, add their values and divide by the number of observations. If the n observations are x1, x2, x3, ..., xn, their mean is:
Xbar = sum of observations/n= x1 + x2.../n
or in more compact notation
Because the mean cannot resist the influence of extreme observations, it is not a resistant measure of center.
The median M is the midpoint of a distribution, the number such that half of the observations are smaller and the other half are larger.
To find the median of a distribution:
Arrange all observations from smallest to largest.
If the number of observations n is odd, the median M is the center observation in the ordered list.
If the number of observations n is even, the median M is the average of the two center observations in the ordered list. Median only changes slight in its value (resistant measure), while mean changes dramatically.
Mean vs. Median
The mean and median of a roughly symmetric distribution are close together.
If the distribution is exactly symmetric, the mean and median are exactly the same.
In a skewed distribution, the mean is usually farther out in the long tail than is the median.
Measuring Spread: Quartiles
To calculate the quartiles:
Arrange the observations in increasing order and locate the median M.
The first quartile Q1 is the median of the observations located to the left of the median in the ordered list.
The third quartile Q3 is the median of the observations located to the right of the median in the ordered list.
The interquartile range (IQR) is defined as: IQR = Q3 - Q1
If you have an odd number of individuals find the middle individual and find the median on the left side including the middle number and same goes for the right median number.
Even number of individuals you must calculate the median using the 2 middle numbers. Then from there calculate median without using the made up median. Even number of individuals we make up the middle number by taking the average. IQR is the distance between Q1 and Q3. any number that is not within the range is an outlier.
The minimum and maximum values alone tell us little about the distribution as a whole. Likewise, the median and quartiles tell us little about the tails of a distribution.
To get a quick summary of both center and spread, combine all five numbers.
The five-number summary of a distribution consists of the smallest observation, the first quartile, the median, the third quartile, and the largest observation, written in order from smallest to largest.
Minimum Q1 M Q3 Maximum
The five-number summary divides the distribution roughly into quarters. This leads to a new way to display quantitative data, the boxplot.
Suspected Outliers: The 1.5 x IQR Rule
In addition to serving as a measure of spread, the interquartile range (IQR) is used as part of a rule of thumb for identifying outliers.
The 1.5 xIQR Rule for Outliers
Call an observation an outlier if it falls more than 1.5 x IQR above the third quartile or below the first quartile.
Q3+1.5XIQR=max... Both give you range. If it does not fall within these numbers, number is an outlier.
Measuring Spread: Standard Deviation
The most common measure of spread looks at how far each observation is from the mean. This measure is called the...The standard deviation sx measures the average distance of the observations from their mean. It is calculated by finding an average of the squared distances and then taking the square root. This average squared distance is called the variance.
square root (s)
Center vs. Spread
The median and IQR are usually better than the mean and standard deviation for describing a skewed distribution or a distribution with outliers.
Use mean and standard deviation only for reasonably symmetric distributions that don't have outliers.
If the scale is adjusted so the total area under the curve is exactly 1, then this curve is called a ...
is always on or above the horizontal axis
has an area of exactly 1 underneath it
A density curve describes the overall pattern of a distribution. The area under the curve and above any range of values on the horizontal axis is the proportion of all observations that fall in that range.
The median of a density curve is the equal-areas point, the point that divides the area under the curve in half.
The mean of a density curve is the balance point, at which the curve would balance if made of solid material.
The median and the mean are the same for a symmetric density curve. They both lie at the center of the curve. The mean of a skewed curve is pulled away from the median in the direction of the long tail.
The mean and standard deviation computed from actual observations (data) are denoted by and s, respectively.
The mean and standard deviation of the actual distribution represented by the density curve are denoted by µ ("mu") and ("sigma"), respectively.
All Normal curves are symmetric, single-peaked, and bell-shaped
A Specific Normal curve is described by giving its mean µ and standard deviation σ.
The mean of a Normal distribution is the center of the symmetric Normal curve.
The standard deviation is the distance from the center to the change-of-curvature points on either side.
We abbreviate the Normal distribution with mean µ and standard deviation σ as N(µ,σ).
The 68-95-99.7 Rule
In the Normal distribution with mean µ and standard deviation σ:
Approximately 68% of the observations fall within σ of µ.
Approximately 95% of the observations fall within 2σ of µ.
Approximately 99.7% of the observations fall within 3σ of µ.
The Standard Normal Distribution
is the Normal distribution with mean 0 and standard deviation 1.
If a variable x has any Normal distribution N(µ,σ) with mean µ and standard deviation σ, then the standardized variable
has the standard Normal distribution, N(0,1).
Z=how many deviations from the center.
is a table of areas under the standard Normal curve. The table entry for each value z is the area under the curve to the left of z.
The most useful graph for displaying the relationship between two quantitative variables. shows the relationship between two quantitative variables measured on the same individuals. The values of one variable appear on the horizontal axis, and the values of the other variable appear on the vertical axis. Each individual in the data appears as a point on the graph.
As in any graph of data, look for the overall pattern and for striking departures from that pattern.
You can describe the overall pattern of a scatterplot by the direction, form, and strength of the relationship.
An important kind of departure is an outlier, an individual value that falls outside the overall pattern of the relationship.
Two variables have a positive association when above-average values of one tend to accompany above-average values of the other, and when below-average values also tend to occur together.
Two variables have a negative association when above-average values of one tend to accompany below-average values of the other.
The correlation r measures the strength of the linear relationship between two quantitative variables.
r is always a number between -1 and 1.
r > 0 indicates a positive association.
r < 0 indicates a negative association.
Values of r near 0 indicate a very weak linear relationship.
The strength of the linear relationship increases as r moves away from 0 toward -1 or 1.
The extreme values r = -1 and r = 1 occur only in the case of a perfect linear relationship.
Facts About Correlation
Correlation makes no distinction between explanatory and response variables.
r has no units and does not change when we change the units of measurement of x, y, or both.
Positive r indicates positive association between the variables, and negative r indicates negative association.
The correlation r is always a number between -1 and 1.
Correlation requires that both variables be quantitative.
Correlation does not describe curved relationships between variables, no matter how strong the relationship is.
Correlation is not resistant. r is strongly affected by a few outlying observations.
Correlation is not a complete summary of two-variable data.
The only thing you can tell when you take out an outlier is that r will be different, but how? We do not know if it is getting closer to 1,-1 or 0.
Recommended textbook explanations
A Survey of Mathematics with Applications
Allen R. Angel, Christine D. Abbott, Dennis C. Runde
Statistical Techniques in Business and Economics
Douglas A. Lind, Samuel A. Wathen, William G. Marchal
A First Course in Probability
Mathematical Statistics with Applications
Dennis Wackerly, Richard L. Scheaffer, William Mendenhall
Sets found in the same folder
Chapter 6 Vocabulary
VOCAB for Test 2: 3-2 to 3-4, 10-2 & 10-3
Vocabulary Chapters 1-4
Stat unit 1: Ch. 1-4 online hmwk
Sets with similar terms
COMAP, For All Practice Purposes, 9e - Chapter 5
Introduction to the Practice of Statistics 8th Edi…
Stat 201 exam 2
Other sets by this creator
Neuroscience exam 2
Organic Chemistry Exam 1 (16,17, & 18)
Neuroscience exam 1
Other Quizlet sets
Education Psychology Chapter 11
Suctioning a Tracheostomy: Open System
NSG 110: Test #5- Evidence-Based Practice and Info…