Terms in this set (85)
histogram x-axis has
bins
histogram y-axis has
frequency
bin will contain
greater than the bin before and less than or equal to the bin it is.
Skewness
measures the degree of a graph's asymmetry
right-tailed histogram
right tail is longer
left-tailed histogram
left tail is longer
outliers.
Data points that fall far from the rest
of the data
things to do with an outlier
• leave it as is,
• change it to a corrected value,
• or, very rarely, remove it from the data set.
Q1
, is the 25th percentile (25% of all observations fall below Q1)
Q3
, is the 75th percentile (75% of all observations fall below Q3.)
IQR
interquartile range IQR=Q3-Q1
Is an outlier if...
A data point is less than Q1-1.5(IQR) or greater than Q3+1.5(IQR).
Before gathering and analyzing data
clearly articulate the question we wish to answer
descriptive statistics
describe the data in a more concise way, with just one or two numbers
Mean
average (x with a bar over it for sample, greek U for population)
Median
The middle value of a data set.
Mode
The value that occurs most frequently in a data set
bimodal
A distribution if it has two clearly defined peaks
EXCEL Median
=MEDIAN(number 1, [number 2], ...)
EXCEL Mean
=AVERAGE(number 1, [number 2], ...)
EXCEL Mode
=MODE.SNGL(number 1, [number 2], ...
conditional mean
The mean of a specific subset of data
EXCEL conditional mean
=AVERAGEIF(range, criteria, [average_range])
• range contains the one or more cells to which we want to apply the criteria or condition.
• criteria is the condition that is to be applied to the range.
• [average_range] is the range of cells containing the data we wish to average.
Percentile
the value beneath which a certain percentage of the data lie
EXCEL Percentile
=PERCENTILE.INC(array, k)
• array is the range of data for which we want to calculate a given percentile.
• k is the percentile value. For example, if we want to know the 95th percentile, k would be 0.95.
range
simplest measures of variability, or spread
Range=Maximum value-Minimum value
why variance
To gain more insight into the spread of the distribution and how the data behave between the two extremes of a range
the variance measures
how far each point is from the mean
why standard deviation
To convert the variance to have the same units as the data points, we take the square root of the variance
A small standard deviation indicates
that the data points are close to the mean.
A large standard deviation indicates
a broader spread.
EXCEL variance
=VAR.S(number 1, [number 2], ...)
• number 1 is the first number, cell reference, or range of cells for which to calculate the specified value.
• [number 2],... represents additional numbers, cell references, or ranges of cells. The square brackets indicate that the argument is optional.
EXCEL standard deviation
=STDEV.S(number 1, [number 2], ...)
•number 1 is the first number, cell reference, or range of cells for which to calculate the specified value.
•[number 2],... represents additional numbers, cell references, or ranges of cells. The square brackets indicate that the argument is optional.
Why use coefficient of variation (CV)
To compare variation in two data sets
what is the coefficient of variation
the ratio of the standard deviation to the mean
CV equation
CV=Standard Dev./mean
60% of the observations are
less than or equal to the 60th percentile
The standard deviation is equal to
the square root of the variance
EXCEL square root
=SQRT(number)
EXCEL Minimum
=MIN(number 1, [number 2], ...)
EXCEL Maximum
=MAX(number 1, [number 2], ...)
Why scatter plot
To visualize the relationship between two variables
Why use the correlation coefficient
to measure the strength of the linear relationship between two variables
correlation coefficient scale
on a scale from -1 to +1. Even when the correlation coefficient is 0, a relationship between two variables might exist—just not a linear one.
correlation calculation is
strongly influenced by outliers.
EXEL correlation calculation
=CORREL(array 1, array 2)
• array 1 is a set of numerical variables or cell references containing data for one variable of interest.
• array 2 is a set of numerical variables or cell references containing data for the other variable of interest.
• Note that the number of observations in array 1 must be equal to the number in array 2.
Time Series
Time series data contain data about a given subject in temporal order, measured at regular time intervals (e.g. minutes, months, or years).
Cross-Sectional
Cross-sectional data contain data that measure an attribute across multiple different subjects (e.g. people, organizations, countries) at a given moment in time or during a given time period.
numerical properties of a population
are called parameters
numerical properties of a sample
are called statistics
population mean symbol
µ
population standard deviation symbol
σ
sample mean symbol
x with a bar over it
sample standard deviation symbol
s
EXCEL assign Random number
=RAND()
normal distribution is used
to help us create a range around the sample mean that is very likely to contain the true population mean.
hence the probability that any normal distribution has a value less than its mean is
always 50%
normal distribution curve width depends on
solely on the distribution's standard deviation.
About 68% of the probability is contained in the range reaching
one standard deviation away from the mean on either side
P(μ−σ≤x≤μ+σ)≈68%
About 95% of the probability is contained in the range reaching
two standard deviations (1.96 to be exact) away from the mean on either side:
P(μ−2σ≤x≤μ+2σ)≈95%
About 99.7% of the probability is contained in the range reaching
three standard deviations away from the mean on either side:
P(μ−3σ≤x≤μ+3σ)≈99.7%
The z-values for any normal distribution can be calculated using the formula,
z=x−μσ
z-value means
standardized value
cumulative probability
the probability of being less than a specified value on a normal curve
EXCEL cumulative probability
=NORM.DIST(x, mean, standard_dev, cumulative)
• x is the value at which you want to evaluate the distribution function.
• mean is the mean of the distribution.
standard_dev is the standard deviation of the distribution.
•cumulative is an argument that specifies the type of probability we wish to calculate. We insert "TRUE" to indicate that we wish to find the cumulative probability, that is, the probability of being less than or equal to the x-value.
EXCEL returning the z-value
=STANDARDIZE(x, mean, standard_dev)
• x is the value to be standardized.
• mean is the mean of the distribution.
standard_dev is the standard deviation of the distribution.
Alternate EXCEL cumulative probability function
=NORM.S.DIST(z, cumulative)
The "S" in this function indicates it applies to a standard normal curve.
z is the value (the z-value) at which we want to evaluate the standard normal distribution function.
cumulative is an argument that specifies the type of probability we wish to calculate. We will insert "TRUE".
The mean is
equal to sum of all data points in the set divided by the number of data points: x¯=∑ni=1xin
the conditional mean
is the mean of a subset of the data
range, variance, and standard deviation
measure the spread of the data.
use the correlation coefficient to
measure the strength of the linear relationship between two variables
when the correlation coefficient is 0
a relationship between two variables might exist
correlation calculation gives more weight to points that are further from the mean, so it is
strongly influenced by outliers.
Time Series:
Time series data contain data about a given subject in temporal order, measured at regular time intervals
Cross-Sectional
Cross-sectional data contain data that measure an attribute across multiple different subjects (e.g. people, organizations, countries) at a given moment in time or during a given time period.
The value of the correlation coefficient ranges
between -1 and +1.
A correlation coefficient near zero indicates
a weak or nonexistent linear relationship.
EXCEL correlation coefficient
=CORREL(array 1, array 2)
=AVERAGEIF(range, criteria, [average_range])
Returns the conditional mean, or average of the cells in a specified range that meet the given criteria. (what is the avearge of people who said "no")
=PERCENTILE.INC(array, k)
Returns the k-th percentile of value in the specified array. For example, if we want to know the 95th percentile for an array of data, k would be 0.95.
What happens to the sample mean and standard deviation as you take new samples of equal size?
The sample mean and standard deviation vary but remain fairly close to the population mean and standard deviation
use the RAND function to generate random numbers between any two specified values
we wanted to generate random numbers between 0 and 10 we would multiply the function by 10 and enter =RAND()*10. If we wanted numbers between 5 and 15, we would enter =5+RAND()*10.
What happens to the sample mean and standard deviation as you increase the sample size?
The sample mean and standard deviation generally become closer to the population mean and standard deviation
a normal distribution is symmetric
so its mean and median are the same.
Second, how wide or narrow the curve is
depends
solely on the distribution's standard deviation.
