Search
Create
AP Statistics Summary
STUDY
Flashcards
Learn
Write
Spell
Test
PLAY
Match
Gravity
Terms in this set (399)
Population
An entire group of study
Census
When we ask every member of the entire population to gain information about it
Parameter
Number used to describe the population (Ex. 96% of WLN students went to college last year; the parameter is the 96%).
Sample
Small section of population that we use to gain information about
Statistic
Number used to describe the sample (Ex. Out of 45 parents, 60% agree that children are a good idea; the statistic is the 60%)
A statistic describes a _______
Sample
A parameter describes a ________
Population
Observational Study
We observe individuals and measure variables without imposing treatment or attempting to influence the response in any way
Experiment
Impose a treatment, then measure and observe the response variables; we evaluate the effects of treatments imposed on experimental units
Survey (Sample Survey)
We ask questions to sample groups to make inferences about the population
What is another word for a Survey?
Sample Survey
When referring to Surveys, what does the term "inference" imply?
A guess as to what the population would be like if a census were done
What three topics must we think of before preparing a survey?
1). What population do we want to describe?
2). What are we measuring?
3). How will we measure it?
When conducting a survey, you'll most likely get ___________ (define)
Sampling variability - We're not likely to get the exact same, identical result with each sample
What can go wrong with sample surveys?
There will always be error due to sampling variability, so statistic will usually not be identical to the parameter.
Bias
Using a method that will consistently over/underestimate the wanted value. It favors an outcome. Whenever we have bias, we must elaborate whether we think we are overestimating or underestimating
Sampling Error
Errors from the way we pick a sample
Sampling Frame
The original LIST that we pull the sample from (ideally should be the population)
Give three examples of sampling frames
1). Yellow Pages
2). Classroom
3). Registered members of the Commerce Library Book Club
What are the types of sampling errors?
1). Convenience
2). Voluntary
3). Undercoverage
Convenience Sampling
When we only use people to which we have easy access to, so they usually have similar thoughts as we do
What is usually true about the experimental units of a convenience sample?
They usually have similar thoughts as the experimenter does
Voluntary Sampling
Self-selection; it allows people to choose to take part in the experiment; they put themselves in the sample. They usually have very strong opinions
What is usually true about the experimental units of a voluntary sample?
They typically have strong opinions, because they went out of their way to ensure that their voice was heard
Undercoverage
Leaving people out of the sampling frame
Give an example of Undercoverage
Using the yellow pages to call (excludes unlisted people, people without phones, prison inmates, homeless, people who only use cell phones, people in college)
Non-Sampling Error
An error that does not draw from the way we pick a sample (The sample was picked well, but other problems afterward caused bias)
Nonresponse Bias
People can't be reached or refuse to answer
Response Bias
Subjects give unreliable answers because they are uncomfortable telling the truth to the experimenter
Wording Bias
The wording of the question(s) the experimenter asks are confusing or leading; the order of questions skews the answers
SRS
Simple Random Sample; "Of size n consists of n individuals from population chosen in such a way that every set of n individuals has an equal chance to be in the sample"; every individual and every sample has an equal chance of being chosen; like pulling names from a hat
How do you create an SRS?
1). Label - Assign a number to every individual in the sample
2). Table - Randomly pick the sample (numbers in a hat, Random Digit Generator, Random Digit Table
Is an SRS always realistic?
No, so we use other methods
Stratified Random Sample
Samples are grouped by strata and we pull SRSs from each stratum
Strata
Similar groups (s. stratum)
Cluster Sample
Split population into groups and use SRS to pick an entire group and interview every member of that group
Give an example of a Cluster Sample
Use SRS to pick a grade and ask each member of that grade their opinion on taking Finals (each grade is a cluster)
Systematic Sample
Use some sort of fixed sampling method starting at a random point. If the first point is self-picked (not by chance) that is a self-selection sampling error; The first item is selected at random from the first k items in the frame, and then every kth term is included in the sample
How does Stratifying reduce sampling variability?
It used strata (it groups by similar characteristics)
Why do we sample?
To get an overall feel for the population (keep in mind, though, that the results most likely WILL NOT be exact)
Which give more accurate results - larger or smaller samples?
Larger samples
Why do you we sample instead of conduct a census?
It would be too expensive, take too long, and be too impractical to conduct a whole census every time we want information, so we use samples to infer about the population
Double-Blind Experiment
Neither the test subject not the experimenter know the treatment given, as to avoid bias
Statistically Significant
A response variable to great to be caused by chance
Matched Pairs Design
Experimental units are paired in two's based on a blocking variable and then treatments are randomly assigned between the two individuals in the pair
Blocking
Using groups of similar individuals in an experiment. Can help to avoid confounding variables
Stratifying is to sampling as ___________ is to experimenting
Blocking
Placebo Effect
Skewed results due to mental thought
Confounding
Indecision as to whether or not response variable y was manipulated by explanatory variable x or another explanatory/lurking variable z
Experimental Units
Non-human individuals involved in an experiment
Treatment
Something imposed on test subjects for evaluation; Combination of all factors and each level; What is done to units
Factor
Otherwise known as an Explanatory Variable, it is the cause of the outcome
What are the three Principles of Experimental Design?
1). Control - Control the environment to make conditions as similar as possible
2). Randomization - Make groups roughly equivalent by spreading out uncontrollable things
3). Replication - Use enough units in each group to show the same results (don't trust if only done once)
Completely Randomized Experiment
Treatments assigned to all experimental units by chance (tend to use a control group as the baseline of data)
How does utilizing randomization in samples differ from randomization in experiments?
Sample: Randomly pick people to be involved
Experiment: Pick people based on certain characteristics and randomly assign treatment (self-pick units and randomly give treatments)
How would a diagram to chart Experimental Research be Drawn?
Randomly Assign individuals into groups, each with different treatments, and compare the responses at the end
At the end of an experiment, you need to know how to write at the end:
• How you would assign groups
• How you would assign treatments
• Description of each treatment
• Always compare at the end
Blinding
Not telling my people who had what treatment (one or both parties don't know who has what treatment)
Single-Blind Experiment
One class (those who influence results and those who evaluate) doesn't know which group of individuals has which treatment (doesn't matter as to which class doesn't know)
The goal of an experiment is to show:
Cause/Effect (Causation)
What percent of an experiment's response determined by chance must be had in order for an experiment to be statistically significant?
<5%
The best experiments have these qualities:
• Randomized
• Results are compared
• Double-Blind
• Placebo-Controlled
True or False: Only an Experiment can show Causation
True; only an experiment can show causation (in order to show cause/effect, there must be an experiment at play)
What do you know about the Blinding of an Experiment with a Placebo?
It is at least Single-Blind (because the people don't know they're given a placebo), but can be Double-Blind
What can spread out lurking variables?
Randomization
Is Replication a good practice in experimental designs because it eliminates chance variation, or because it allows for chance variation to be estimated?
It allows for chance variation to be estimated
Does randomization make the treatment groups as similar as possible, make the treatment groups as different as possible, or reduce variability within treatment groups?
Makes the treatment groups as similar as possible
Keep in mind, randomization spreads out lurking variables, and by doing so, you are making the test groups more similar. This way, when causation is shown, the exact cause can be pinpointed
True or False: It is key that an experimenter randomly assigns treatments in an experiment. If not, then it is an Observational Study
True
Are different sample sizes a potential issue?
No
Which method: Blocking or Matched-Pairs will be the best guard against confounding variables?
Blocking because you want to separate the large characteristics (ex. gender) before you conduct an experiment, as those can prove to be lurking variables
If a sample survey is simply asking questions to sample groups without attempting to get a feel for the population those sample groups represent, is that still a sample survey?
No, it is an Observational Study
Does random mean haphazard?
No
Explain the quote: "Chance behavior is unpredictable in the short run, but has a predictable, regular pattern in the long run"
Things are random if individual outcomes are uncertain, but there is a regular distribution of outcomes in a large number of repetitions
Probability
Long-run relative frequency of events; proportion of times the outcome would occur in a very long series of repetitions
In order for an event to be random, it must be ______________
Independent
Law of Large Numbers
Long run frequency of repeated events gets closer and closer to the true probability as the number of trials increases
Law of Averages
BAD! STAY AWAY!
Thinking an outcome will happen ("it's due") based on the past
Sample Space
SET of all possible outcomes
{H, T} ----------> Sample Space of Flipping a Coin
Event
An outcome or set of outcomes of a random phenomenon; what we are looking for to happen
Probability Model
Mathematical description of a random phenomenon containing two parts:
+ Sample Space
+ Way of Assigning Probabilities to Outcomes
P(A)
Probability of Event A
Multiplication Principle
If you can do one event n number of ways, and another m number of ways, then you can do both in mn number of ways
What are the five rules of Probability?
1). The P(an event) is always between 0 or 1. It either never happens, always happens, or falls in the middle
2). P(S) = 1
(The probability of any outcome in a sample space occurring is always equal to 1)
3). The complement of A (or A', sometimes seen as A with a superscript c) is the probability that A doesn't occur and can be found by:
P(A') = 1 - P(A)
4). Addition Rule (of Disjoint Events) - If two events A and B are disjoint, then the union of the two events P(A U B) = the addition of both of those
P(A U B) = P(A) + P(B)
5). Multiplication Rule (of Independent Events) - If two events A and B are independent, then the intersection of the two events P(A n B) = the multiplication of both of those
P(A n B) = P(A)P(B)
The Union of two events uses addition or multiplication?
Addition
The Intersection of two events uses addition or multiplication?
Multiplication
Union means AND or OR?
Or
Intersection means AND or OR?
And
Independent
Knowledge of Event A gives no information about B
Disjoint
Events A and B can't happen at the same time
The Addition Rule is used for _____________ Events
Disjoint
The Multiplication Rule is used for _______________ Events
Independent
If A and B are independent, then so are ______________, but not ____________
A and A'
A' and B
A' and B'
NOT A and A'
Two-Way Table
A table that describes two categorical variables
Marginal Distribution
The total in the margins
Joint Frequency
Each entry in the table (main part, not the margins)
Conditional Distribution
Finding the Conditional Probability of a Two-Way Table:
Joint Frequency / Marginal Distribution
Is the General Addition Rule used exclusively for disjoint events? What is the General Addition Rule?
No! That is the Addition Rule!
The General Addition Rule states:
For any two events A and B, P(A U B) can be found by:
P(A U B) = P(A) + P(B) - P(A n B)
The last part could be ignored in disjoint cases because for disjoint events, P(A n B) = 0
Is the General Multiplication Rule used exclusively for independent events? What is the General Multiplication Rule?
No! That is the Multiplication Rule!
The General Multiplication Rule states:
For any two events A and B, P(A n B) can be found by:
P(A n B) = P(A)P(B I A)
OR
P(A n B) = P(B)P(A I B)
The last part can be ignored in independent cases because for independent events, the A should have no effect on B and vice versa. The P(B) should be equal to the P(B I A) because A doesn't affect B. This is the basis behind the equation to figure out if two events are independent or not
Conditional Probability
Probability that one event happens given another event
P(B I A) = P(A n B) / P(A)
We get this by modifying the General Multiplication Rule [Divide both sides by P(A) to isolate P(B I A)]
Can disjoint events ever be independent?
NO!
If we know that A and B are disjoint, then we know that A happening affects B happening in the sense that B cannot occur
What is the equation to find out if events are independent or not?
P(B I A) = P(B)
Because if A and B are independent, then A should have no affect on B
When do you use a Venn Diagram?
When two events share probabilities
When do you use a Tree Diagram?
When there are multiple steps involved
Pie Chart
Used with categorical data and it must add up to 100%
Very rarely used (only to compare relation as a whole)
For Categorical Data (I.e.: the colors of different socks)
Bar Graph
+ Used with Categorical Data
+ Usually Shows Frequency (Count)
+ Bars are free standing (Don't Touch)
+ Bars can be in any order
+ Used for categorical data (I.e.: the colors of different socks)
Quantitative Data
Data from surveys, experiments, and observational studies
What makes a Good Graph?
+ Title
+ Axes Labeled
+ Constant Scale
Dot Plot (How is One Made)?
+ Draw a Horizontal Line and Scale Based on the Numbers
+ Title the Dot Plot and Label the Axes
+ Place One Dot Above the Appropriate Value for each data point
What Four Attributes are Used to Interpret the Distribution of Quantitative Data
+ Shape
+ Center
+ Spread
+ Outliers or Not
What are the Four Shapes of a Quantitative Data Graph?
+ Bell
+ Uniform
+ Left - Skewed
+ Right - Skewed
Shape of a Bell Graph
Ummm... Bell -shaped?
Uniform Graph
Shaped Like a Table:
____________________
I I
(Except the lines touch)
Remember that "Uniform" refers to all the data being the same
Left - Skewed
The mean pulls the graph to the left
Right - Skewed
The mean pulls the graph to the right
What Two Concepts are Associated in Determining the Center of a Graph?
The median/mean
Spread of a Graph
How much variability from the first number to the last number
Written: Lowest to Highest Intervals (Ex. 4 - 23)
Range
A way of determining spread, how many numbers the data ranges:
Highest Value - Lowest Value = Range
(Ex. 23 - 4 = 19)
IS RESISTANT TO OUTLIERS
Outlier
Anything outside the norm of other points
Stemplot (Steam-and-Leaf Plot)
Quick picture or shape of distribution while including actual values
How to Create a Stemplot
1). Create a stem (all numbers in the tens place written vertically with a vertical line to the right)
2). Write a life in the appropriate row with increasing numbers of the ones place extending out
3). Include a key (and always include a title)
Splitting Stems
Take one stem and break it into two when too much data (refer to Assignment 31 - Chapter 1: Data Set #1)
What can Splitting the Stems help to identify?
If there are any outliers
When do we Split the Stem?
When there is too much data, or data is overcrowded
Back-to-Back Stemplot
Two sets of data share 1 stem (refer to Assignment #31 - Chapter 1: Data Set #1)
Histogram
Breaks values into classes/intervals of equal width and displays frequency/percent of each class. The bars touch because of the continuous interval.
If there is a large sample size, a Histogram is the best choice for displaying the data (if, of course, it is quantitative)
What is the Only Downside to a Histogram?
It only tells the frequency/percent, you can't see specific data points
When a Curve Appears to be a Bell, but is not Exact, What is the Shape Referred to As?
Approximately Symmetric (Refer to Assignment #31 - Chapter 1: Data Set #1)
Ogive
Also known as Relative Cumulative Frequency Graph, an Ogive is used when we need to know how one value compares to others.
An Ogive shows percents
In the Left-Most Set of a Back-to-Back Stemplot, where is the "Right Side" of the Data?
The bottom
So right-skewed would mean that most of the data is found near the top (AKA, the "Left Side")
Mean
Average Value
The fair share; the amount everyone would get if they all had the same amount; the balancing point
When referring to Quantitative Data Distribution Graphs, what does n mean?
n = the number of data entries
Mean of the Population
Mean of the Sample
Mean Formula
"The sum of observations / n"
Is Mean Resistant to Outliers?
No, because one number can make the mean much larger/smaller than it actually is
Median
Typical Value
Midpoint of distribution; half the observations are below this point and half are above
How to Find Median
+ Arrange all Numbers Smallest to Largest
+ In cases where n is odd, the median is the middle number
+ In cases where n is even, the median is the average of the two middle number
Is Median Resistant to Outliers?
Yes, because we are only interested in the positions of numbers, not their values
Mean, Median, and Mode Locations of a Symmetrical or Approximately Symmetrical Graph
Mean, Median, and Mode Locations of a Left - Skewed Graph
Mean, Median, and Mode Locations of a Right - Skewed Graph
When is Mean Used?
When a graph is symmetric or approximately symmetric (because it is affected by outliers)
When is Median Used?
When a graph is skewed or has outliers (because outliers don't affect it)
Spread
The amount of variability in the distribution
What Can We Use to Improve the Distribution of Spread?
The IQR (middle 50% of data)
Quartile 1
Q1
Median of the lower half of data
Quartile 3
Q3
Median of upper half of data
IQR
Q3 - Q1
Represents the middle 50% of data
Are IQR and Quartiles Resistant to Outliers?
Yes, because they are based on the median
Boundaries for Outliers
[Anything beyond these values are outliers]
Q1 - 1.5IQR
Q3 + 1.5IQR
If you know that an error occurred in your gathering of data, and in turn this creates an outlier, what can you do?
Drop the value from the data set
HOWEVER
If there is an outlier that wasn't caused by an error, then you MUST include it in the data set
Five Number Summary
Minimum, Quartile 1, Median, Quartile 3, and Maximum
Box Plot
Shows the Five Number Summary and Any Outliers
+ Box from Q1 to Q3
+ Vertical Line inside the Box is the Median
+ Horizontal Lines Extend to Minimum and Maximum (Not Including Outliers)
+ Outliers are Dots Outside of Lines
How to Make a Box Plot and Find the Five Number Summary with a Calculator (TI - 84)
+ Insert Data
+ 2nd - Stat Plot (Hit 2nd - y=)
+ Choose Plot 1
+ Turn on Plot 1
+ Go to the Fourth Option (Box Plot with the Dots)
+ Zoom 9
+ Trace to Jump Between Values of the Five Number Summary
Percentile
Measure of relative location
Finds what percent of observations lie at or below a certain value:
# of values at or below a certain number / # of values
Most Common Measure of Spread
Standard Deviation
Standard Deviation
Mean based; Looks at how far, on average, each observation is from the mean
Variance (in a Population)
Variance (in a Sample)
Standard Deviation (in a Population)
Standard Deviation (in a Sample)
When determining Standard Deviation (or Variance), we divide by n in the case of:
Populations
When determining Standard Deviation (or Variance), we divide by (n-1) in the case of:
Samples
When is the ONLY time we use Standard Deviation?
When the mean is used (Data is skewed or outliers are used)
What is always true about the value of Standard Deviation?
Sx ≥ 0
Sx = 0 when there is no variability in the data set at all (all the observations are the same)
When does Sx get larger?
The more spread out the data is
Why do we take the square root when finding Standard Deviation?
Sx has the same unit of measure as the original observations
How to Find Mean and Standard Deviation in the Calculator:
+ Put Data into a List
+ Stat. ---> Calc. ---> 1 Variable Statistics
+ (Data is in the List you Put it In)
Meanings of the Different Symbols You'll Encounter when Finding Sx in a Calculator
x̄ = Mean
∑x = Sum of x
∑x^2 = Square each data point, then find the sum
Sx = Standard Deviation of Sample (Divide by n-1)
σx = Standard Deviation of Population (Divide by n)
Can you ever cross between using Mean and IQR, or Median and Standard Deviation?
No!
Median and IQR are used for one distribution
Mean and Sx are used for another
How to Describe the Distribution of an Addition/Subtraction Error on a Calculator:
+ Put Data into a List
+ Go One List Over and Do all of the First List's Values ± the Correct Number (For example, L2 - 13 will go into the L3 column)
How will Adding or Subtracting the same #a to each observation affect the distribution?
+ Add/Subtract #a to each measure of center (Mean, Median, Quartiles) and to each measure of location (Percentiles)
+ Shape is unaffected
+ Spread is unaffected
How to Describe the Distribution of an Multiplication/Division Error on a Calculator:
+ Put Data into a List (L1)
+ Go One List Over (L2) and Do all of L1's Values Multiplied by the Change (L1 * 23 information will go into L2)
How will Multiplying or Dividing the same #b to each observation affect the distribution?
+ Multiply/Divide #b to each measure of center/location (Mean, Median, Quartiles, Percentiles)
+ Multiply/Divide each measure of spread (IQR, Range, Sx) by |b|
+ Shape is never affected
Is a Percentile affected by both a change in the addition/subtraction of a distribution as well as a multiplication/division change?
Yes
Is a Measure of Spread affected by both a change in the addition/subtraction of a distribution as well as a multiplication/division change?
No, only a multiplication/division change
Is a Measure of Center affected by both a change in the addition/subtraction of a distribution as well as a multiplication/division change?
Yes
Is Shape affected by both a change in the addition/subtraction of a distribution as well as a multiplication/division change?
No, shape isn't affect by either
How would you go about doing x times 5%?
1.05x
Univariate Statistics
Single Variable Statistics (How is the Data Distributed?)
Bivariate Statistics
Two Variable Statistics (To Examine the Relationship Between the Two Variable)
Explanatory Variable
Explains or influences change in the response variable (explanatory variable is on the x - axis)
Response Variable
Measures the outcome of a study (on the y - axis)
Do We Always have Explanatory and Response Variables?
No
Scatterplot
Most useful graph for plotting two quantitative variable's measures on the SAME individuals
Each appears as a single point on the graph
Does Association Imply Causation?
No, only an experiment can show causation
How is a Scatterplot Interpreted?
Direction, Form, and Strength
What are the Three Different Measures of Direction in a Scatterplot?
Positive (aimed up), Negative (aimed down), None (no direction)
What are the Three Different Measures of Form in a Scatterplot?
Linear, Curved, and Clustered
What are the Three Different Measures of Strength in a Scatterplot?
Strong, Moderate, Weak
Strength
How close the points follow a clear form
Correlation Coefficient (r)
Measures the direction and strength of a linear relationship between two quantitative variables
What are the Seven Concepts of a Correlation Coefficient?
-1 ≤ r ≤ 1
r > 0 if positive, r < 0 if negative
If r is close to 0, there is a weak linear relationship
If r = 1 or r = -1, there is a perfectly linear relationship
Makes no distinction between explanatory and response variables
r doesn't change if we change units, because...
The correlation coefficient has no units
Can the Correlation Coefficient be Used for a Scatterplot of Any Form?
No, only a linear form
Cautions about Calculating the Correlation Coefficient
+ Both variables MUST be quantitative
+ Only strength and direction of a linear form
+ 0 does not mean "no relationship", it means "no LINEAR relationship"
+ r is NOT RESISTANT to outliers (because mean and standard deviation are in the formula)
+ r cannot say that the explanatory variable caused the response variable, it only says there is SOME sort of relationship
Correlation
Measures direction and strength of a linear relationship
Regression Line (Line of Best Fit)
A line that describes how a response variable, y, changes with an explanatory variable, x
Regression Line Formula
ŷ = a + bx
ŷ
Predicted value of a response variable for a given value of x
a (in the Regression Formula)
The y-intercept when x = 0. Only important if values are close to 0
bx (in the Regression Formula)
Slope; the amount by which y is predicted to change when x is increased by 1 unit
Extrapolation
Use of a Regression Line for predicting values outside the whole interval of x - values. (If the x - values range from 10 - 20, you couldn't use a Regression Line outside of that range)
Residual
(Error) Difference between actual and predicted values
Residual = actual y - ŷ
If a Residual is Negative, did the Line Over or Under Predict?
Over predicted
If ŷ = 2, but y = 1, then:
Residual = -1
This means that the predicted value over predicted because ŷ > y
What are the Seven Concepts of a Least - Squares Regression Line?
+ No line is perfect (in most cases) but aim to be as close as possible
+ Due to the fact that we will not pass every point and the distance is as small as possible, there will be error (residual)
+ The goal of our line is to make the vertical distances (residuals) from the line of mean residuals to the actual y as small as possible
+ Always passes through the point (mean of x, mean of y)
+ If all the residuals were added together, it would be equal to 0, because it passes through the balancing point (mean of x, mean of y). Remember that the mean is the balancing point
+ We need to square the residuals to find the sum
+ The goal of the Least - Squares Regression Line is to make the sum of the squares as small as possible
Slope Formula Given the Standard Deviation of x, Standard Deviation of y, and the r
b = r(Sy / Sx)
Y - Intercept Formula Given the Mean of y and the Slope (Found in Another Formula)
a = (mean of y) - b(mean of x)
How to Interpret the Slope when Finding ŷ?
___________ is predicted to _____________ (increase/decrease based on whether or not b is positive or negative) by ____________ (|b|) for each additional _____________________ (x)
Residual Plot
A graphical tool to tell how well a model fits the data and if it is appropriate to be used
Should show no obvious pattern, shape, or direction and the residuals should be relatively small (line that comes close to most of the points)
How to Make a Residual Plot in a TI - 84 Calculator
Turn on a Scatterplot ---> 2nd Stat ---> 7 (Residual)
Standard Deviation of Residuals
Average distance residuals are from the regression line, the closer to 0, the better
Coefficient of Determination
r^2
A numerical value that tells how well a Least - Squares Regression Line does at predicting y-values. This value can be given as a decimal or as a percent
How do you Interpret the Coefficient of Determination?
_______% of the variation in [the response variable y] can be accounted by the linear relationship with [explanatory variable x]
When Given a Computer Output, How do you Find r and r^2?
r^2 is the number next to "R squared" NOT "R squared (adjusted)"
r is the square root of this number
When Given a Computer Output, How do you Find the Response Variable?
It is given at the top, listed as the "Dependent Variable"
When Given a Computer Output, How do you Find the Standard Devision of the Residuals?
It is the number equal to "s"
When Given a Computer Output, How do you Find the whole Regression Line Formula?
ŷ is the "Dependent Variable" hat
a is the first number listed under "Coefficient" (next to "Constant"
b is the second number listed under "Coefficient" (next to the explanatory variable)
x is the second number listed under "Variable" (under constant)
Outlier in a Scatterplot
Still lies outside of the overall pattern; points that are outliers in the y direction but not the x have large residuals
Influential Points
Points that are outliers in the direction of the x, but not the y (removing it would greatly change the Least - Squares Regression Line)
What Happens to the y-Intercept and the Slope of a Least - Squares Regression Line when an Influential Point is Present?
The y - intercept increases, because the slope decreases (think of it on a spring)
What are the Three Concepts you Use to Determine if a Relationship is Linear or Not?
+ Does the Scatterplot Look Linear?
+ Does the Residual Plot Look Random?
+ Does r^2 Indicate that a Linear Model is Appropriate? In Other Words, is it a High Percentage
When do we Re-Express Data?
When we want to represent the data, but we can't because it doesn't look linear
Exponential Model
Refers to a case in which we attempt to re-express non-linear data to interpret it correctly
An exponential model changes the original y vs. x scatterplot and makes it:
log(y) vs. x
Only the log(y) is taken
Power Model
Refers to a case in which we attempt to re-express non-linear data to interpret it correctly
An exponential model changes the original y vs. x scatterplot and makes it:
log(y) vs. log(x)
Both the log(y) and the log(x) are taken
What happens to the correlation coefficient if all of the points change in the same way?
It DOES NOT CHANGE!
If you add 90 to all the x-values, multiply by 56 to all the y-values, then switch the x- and y-values, r will stay the same!
Can points be spread differently if they have the same regression line?
Yes! Absolutely!
Addition Rule
P(A ∪ B) = P(A) + P(A) - P(A ∩ B) aids in computing the chances of one of several events occurring at a given time.
Alpha (α)
The probability of a Type I error. See significance level.
Alternative Hypothesis
The hypothesis stating what the researcher is seeking evidence of. A statement of inequality. It can be written looking for the difference or change in one direction from the null hypothesis or both.
Association
Relationship between or among variables.
Back-Transform
The process by which values are substituted into a model of transformed data, and then reversing the transforming process to obtain the predicted value or model for nontransformed data.
Bar Chart
A graphical display used with categorical data, where frequencies for each category are shown in vertical bars.
Bell-Shaped
Often used to describe the normal distribution. See mound-shaped.
Beta (β)
The probability of a Type II error. See power.
Bias
The term for systematic deviation from the truth (parameter), caused by systematically favoring some outcomes over others.
Biased
A sampling method is biased if it tends to produce samples that do not represent the population.
Bimodal
A distribution with two clear peaks.
Binomial Distribution
The probability distribution of a binomial random variable.
Binomial Random Variable
A random variable x (a) that has a fixed number of trials of a random phenomenon n, (b) that has only two possible outcomes on each trial, (c) for which the probability of a success is constant for each trial, and (d) for which each trial is independent of other trials.
Bins
The intervals that define the "bars" of a histrogram.
Bivariate Data
Consists of two variables, an explanatory and a response variable, usually quantitative.
Blinding
Practice of denying knowledge to subjects about which treatment is imposed upon them.
Blocks
Subgroups of the experimental units that are separated by some characteristic before treatments are assigned because they may respond differently to the treatments.
Box-And-Whisker Plot/Boxplot
A graphical display of the five-number summary of a set of data, which also shows outliers.
Categorical Variable
A variable recorded as labels, names, or other non-numerical outcomes.
Census
A study that observes, or attempts to observe, every individual in a population.
Central Limit Theorem
As the size n of a simple random sample increases, the shape of the sampling distribution of x̄ tends toward being normally distributed.
Chance Device
A mechanism used to determine random outcomes.
Cluster Sample
A sample in which a simple random sample of heterogeneous subgroups of a population is selected.
Clusters
Heterogeneous subgroups of a population.
Coefficient of Determination (r²)
Percent of variation in the response variable explained by its linear relationship with the explanatory variable.
Complement
The compliment of an event is that event not occurring.
Complementary Randomized Design
One in which all experimental units are assigned treatments solely by chance.
Conditional Distribution
See conditional frequencies.
Conditional Frequencies
Relative frequencies for each cell in a two-way table relative to one variable.
Conditional Probability
The probability of an event occurring given that another has occurred. The probability of A given that B has occurred is denoted as P(A|B).
Confidence Intervals
Give an estimated range that is likely to contain an unknown population parameter.
Confidence Level
The level of certainty that a population parameter exists in the calculated confidence interval.
Confounding
The situation where the effects of two or more explanatory variables on the response variable cannot be separated.
Confounding Variable
A variable whose effect on the response variable cannot be untangled from the effects of the treatment.
Contingency Table
See two-way table.
Continuous Random Variables
Those typically found by measuring, such as heights or temperatures.
Control Group
A baseline group that may be given no treatment, a faux treatment like a placebo, or an accepted treatment that is to be compared to another.
Control
The principle that potential sources of variation due to variables not under consideration must be reduced.
Convenience Sample
Composed of individuals who are easily accessed or contacted.
Correlation Coefficient (r)
A measure of the strength of a linear relationship,
r=(1/(n-1))Σ((xi-x̄)/sx)((yi-ȳ)/sy).
Critical Value
The value that the test statistic must exceed in order to reject the null hypothesis. When computing a confidence interval, the value of t
(or z
) where ±t
(or ± z
z*) bounds the central C% of the t (or z) distribution.
Cumulative Frequency
The sums of the frequencies of the data values from smallest to largest.
Data Set
Collection of observations from a sample or population.
Dependent Events
Two events are called dependent when they are related and the fact that one event has occurred changes the probability that the second event occurs.
Discrete Random Variables
Those usually obtained by counting.
Disjoint Events
Events that cannot occur simultaneously.
Distribution
Frequencies of values in a data set.
Dotplot
A graphical display used with univariate data. Each data point is shown as a dot located above its numerical value on the horizontal axis.
Double-Blind
When both the subjects and data gatherers are ignorant about which treatment a subject received.
Empirical Rule (68-95-99.7) Rule
Gives benchmarks for understanding how probability is distributed under a normal curve. In the normal distribution, 68% of the observations are within one standard deviation of the mean, 95% is within two standard deviations of the mean, and 99.7% is within three standard deviations of the mean.
Estimation
The process of determining the value of a population parameter from a sample statistic.
Expected Value
The mean of a probability distribution.
Experiment
A study where the researcher deliberately influences individuals by imposing conditions and determining the individuals' responses to those conditions.
Experimental Units
Individuals (a person, a plot of land, a machine, or any single material unit) in an experiment.
Explanatory Variable
Explains the response variable, sometimes known as the treatment variable.
Exponential Model
A model of the form y = abˣ.
Extrapolation
Using a model to predict values far outside the range of the explanatory variable, which is prone to creating unreasonable predictions.
Factors
One or more explanatory variables in an experiment.
First Quartile
Symbolized Q1, represents the median of the lower 50% of a data set.
Five-Number Summary
The minimum, first quartile (Q1), median, third quartile (Q3), and maximum values in a data set.
Frequency Table
A display organizing categorical or numerical data and how often each occurs.
Geometric Distribution
The probability distribution of a geometric random variable X. All possible outcomes of X before the first success is seen and their associated probabilities.
Geometric Random Variable
A random variable X (a) that has two possible outcomes of each trial, (b) for which the probability of a success is constant for each trial, and (c) for which each trial is independent of the other trials.
Graphical Display
A visual representation of a distribution.
Histogram
Used with univariate data, frequencies are shown on the vertical axis, and intervals or bins define the values on the horizontal axis.
Independent Events
Two events are called independent when knowing that one event has occurred does not change the probability that the second event occurs.
Independent Random Variables
If the values of one random variable have no association with the values of another, the two variables are called independent random variables.
Influential Point
An extreme value whose removal would drastically change the slope of the least-squares regression model.
Interquartile Range
Describes the spread of middle 50% of a data set, IQR = Q3 - Q1.
Joint Distribution
See joint frequencies.
Joint Frequencies
Frequencies for each cell in a two-way table relative to the total number of data.
Law of Large Numbers
The long-term relative frequency of an event gets closer to the true relative frequency as the number of traits of random phenomenon increases.
Least-Squares Regression Line (LSRL)
The "best-fit" line that is calculated by minimizing the sum of the squares of the differences between the observed and predicted values of the line. The LSRL has the equation ŷ = bo + b1x.
levels
The different quantities or categories of a factor in an experiment.
Linear Regression
A method of finding the best model for a linear relationship between the explanatory and response variable.
Logarithmic Transformation
Procedure that changes a variable by taking the logarithm of each of its values.
Lurking Variable
A variable that has an effect on the outcome of a study but was not part of the investigation.
margin of Error
A range of values to the left and right of a point estimate.
Marginal Distribution
See marginal frequencies.
marginal Frequencies
Row totals and column totals in a two-way table.
Matched-Pairs Design
The design of a study where experimental units are naturally paired by a common characteristic, or with themselves in a before-after type of study.
Maximum
The largest numerical value in a data set.
Mean
The arithmetic average of a data set; the sum of all the values divided by the number of values, x̄ = (Σxi)/n.
Mean of a Binomial Random Variable X
μx = np.
Mean of a Discrete Random Variable
μx = Σ from i=1 to n of xiP(xi).
Mean of a Geometric Random Variable
μx=1/p.
measures of Center
These locate the middle of a distribution. The mean and median are measures of center.
Median
The middle value of a data set; the equal areas point, where 50% of the data are at or below this value, and 50% of the data are at or above this value.
Minimum
The smallest numerical value in a data set.
Mound-Shaped
Resembles a hill or mount; a distribution that is symmetric and unimodal.
Multiplication Rule
P(A ∩ B) = P(A) * P(B|A) is used when we are interested in teh probability of two events occurring simultaneously, or in succession.
Multistage Sample
A sample resulting from multiple applications of cluster, stratified, and/or simple random sampling.
Mutually Exclusive Events
See disjoint events.
Nonresponse Bias
The situation where an individual selected to be in the sample is unwilling, or unable, to provide data.
Normal Distribution
A continuous probability distribution that appears in many situations, both natural and man-made. It has a bell-shape and the area under the normal density curve is always equal to 1.
Null Hypothesis
The hypothesis of no difference, no change, and no association. A statement of equality, usually written in the form Ho: parameter = hypothesized value.
Observational Study
Attempts to determine relationships between variables, but the researcher imposes no conditions as in an experiment.
Observed Values
Actual outcomes or data from a study or an experiment.
One-Way Table
A frequency table of one variable.
Outlier
An extreme value in a data set. Quantified by being less than Q1 - 1.5
IQR or more than Q3 + 1.5
IRQ.
Percentiles
Divide the data set into 100 equal parts. An observation at the Pth percentile is higher tha P percent of all observations.
Placebo
A faux treatment given in an experiment that resembles the real treatment under consideration.
Placebo Effect
A phenomenon where subjects show a response to a treatment merely because the treatment is imposed regardless of its actual effect.
Point Estimate
An approximate value that has been calculated for the unknown parameter.
Population
The collection of all individuals under consideration in a study.
Population Parameter
A characteristic or measure of a population.
Position
Location of a data value relative to the population
Power
The probability of correctly rejecting the null hypothesis when it is in fact false. Equal to 1 - β. See beta and Type II error.
Power Model
A function in the form of y - axᵇ.
Predicted Value
The value of the response variable predicted by a model for a given explanatory variable.
Probability
Describes the chance that a certain outcome of a random phenomenon will occur.
Probability Distribution
A discrete random variable X is a function of all n possible outcomes of the random variable (xi) and their associated probabilities P(xi).
Probability Sample
Composed of individuals selected by chance.
P-Value
The probability of observing a test statistic as extreme as, or more extreme than, the statistic obtained from a sample, under the assumption that the null hypothesis is true.
Quantitative
A variable whose values are counts or measurements.
Random Digit Table
A chance device that is used to select experimental units or conduct simulations.
Random Phenomena
Those outcomes that are unpredictable in the short term, but nevertheless, have a long-term pattern.
Random Sample
A sample composed of individuals selected by chance.
Random Variables
Numerical outcome of a random phenomenon.
Randomization
The process by which treatments are assigned by a chance mechanism to the experimental units.
Randomized Block Design
First, units are sorted into subgroups or blocks, and then treatments are randomly assigned within the blocks.
Range
Calculated as the maximum value minus the minimum value in a data set.
Relative Frequency
Percentage or proportion of the whole number of data.
Replication
The practice of reducing chance variation by assigning each treatment to many experimental units.
Residual
Observed value minus predicted value of the response variable.
Response Bias
Because of the manner in which an interview is conducted, because of the phrasing of questions, or because of the attitude of the respondent, inaccurate data are collected.
Response Variable
Measures the outcomes that have been observed.
Sample
A selected subset of a population from which data are gathered.
Sample Statistic
Result of a sample used to estimate a parameter.
Sample Survey
A study that collects information from a sample of a population in order to determine one or more characteristics of the population.
Sampling Distribution
The probability distribution of a sample statistic when a sample is drawn from a population.
Sampling Distribution of the Sample Mean (x̄)
The distribution of sample means from all possible simple random samples of size n taken from a population.
Sampling Distribution of a Sample Proportion p̂
The distribution of sample proportions from all possible simple random samples of size n taken from a population.
Sampling Error
See sampling variability.
Sampling Variability
Natural variability due to the sampling process. Each possible random sample from a population will generate a different sample statistic.
Scatterplots
Used to visualize bivariate data. The explanatory variable is shown on the horizontal axis and the response variable is shown on the vertical axis.
Significance Level
The probability of a Type I error. A benchmark against which the P-value compared to determine if the null hypothesis will be rejected. See also alpha.
Simple Random Sample (SRS)
A sample where n individuals are selected from a population in a way that every possible combination of n individuals is equally likely.
Simulation
A method of modeling chance behavior that accurately mimics the situation being considered.
Skewed
A unimodal asymmetric, distribution that tends to slant-most of the data are clustered on one side of the distribution and "tails" off on the other side.
Standard Deviation of a Binomial Random Variable X
σₓ=√(np(1-p)).
Standard Deviation of a Discrete Random Variable X
σₓ=√(σ²ₓ).
Standard Deviation
Used to measure variability of a data set. It is calculated as the square root of the variance of a set of data,
s = √((Σ(xi-x̄)²/(n-1)).
Standard Error
An estimate of the standard deviation of the sampling distribution of a statistic.
Standard Normal Probabilities
The probabilities calculated from values of the standard normal distribution.
Standardized Score
The number of standard deviations an observation lies from the mean,
z = (observation - mean) / (standard deviation).
Statistically Significant
When a sample statistic is shown to be far from a hypothesized parameter. When the P-value is less than the significance level.
Stemplot
Also called a stem-and-leaf plot. Data are separated into a stem and leaf by place value and organized in the form of a histogram.
Strata
Subgroups of a population that are similar or homogeneous.
Stratification
Part of the sampling process where units of the study are separated into strata.
Stratified Random Sample
A sample in which simple random samples are selected from each of several homogeneous subgroups of the population, known as strata.
Subjects
individuals in an experiment that are people.
Symmetric
The distribution that resembles a mirror image on either side of the center.
Systematic Random Sample
A sample where every kth individual is selected from a list or queue.
Test Statistic
The number of standard deviations (standard errors) that a sample statistic lies from a hypothesized population parameter.
Third Quartile
Symbolized Q3, represents the median of the upper 50% of a data set.
Transformation
Changing the values of a data set using a mathematical operation.
Treatments
Combinations of different levels of the factors in an experiment.
Two-Way Table
A frequency table that displays two categorical variables.
Type I Error
Rejecting a null hypothesis when it is in fact true.
Type II Error
Failing to reject a null hypothesis when it is in fact false.
Undercoverage
When some individuals of a population are not included in the sampling process.
Uniform
All data values in the distribution have similar frequencies.
Unimodal
A distribution with a single, clearly defined, peak.
Univariate
One-variable data.
Variables
Characteristics of the individuals under study.
Variability
The spread in a data set.
Variance
Used to measure variability, the average of the squared deviations from the mean,
s²ₓ = √((Σ(xi-x̄)²/(n-1)).
Variance of a Binomial Random Variable X
σ²ₓ - np(1-p).
Variance of a Discrete Random Variable X
σ²ₓ = Σ from i=1 to n of (xi-μₓ)²οP(xi).
Venn Diagram
Graphical representation of sets or outcomes and how they intersect.
Voluntary Response Bias
Bias due to the manner in which people choose to respond to voluntary surveys.
Voluntary Response Sample
Composed of individuals who choose to respond to a survey because of interest in the subject.
Z-Score
See standardized score.
YOU MIGHT ALSO LIKE...
ASCP MLT/MLS Certification Exam (BOC) Preparation
scottmooredo
$6.99
STUDY GUIDE
OTHER SETS BY THIS CREATOR
AP Statistics Summary
734 Terms
Robert_Reichert8
TEACHER
AP Statistics Summary
174 Terms
Robert_Reichert8
TEACHER
Chemistry Summary
630 Terms
Robert_Reichert8
TEACHER
Geography Ch
301 Terms
Robert_Reichert8
TEACHER
;