78 terms

# 700 exam

###### PLAY
mean
This is the most widely used measure of
location.
It is overly sensitive to extreme values.
It is a poor measure of central location
when there are outliers in the data or if
the data are skewed.
measures of location
•Arithmetic Mean
• Median
• Mode
The sample median is defined as follows:
1. If n is odd then the median is the (n+1)/2 item
2. If n is even then the median is the average of the n/2 and (n+1)/2 item
Primary weakness of the median is that it is
determined primarily by the middle values.
Primary strengths include the fact that it is less
sensitive to the actual numerical values of the
remaining data points, and it is insensitive to
extreme values.
mode
Most frequent value: you can have multiple modes in a distribution
of data.
You can also have no mode among all of the
observations in a sample if all values occur
only once.
The primary weakness of the mode is that is it
not a useful measure of central location if
there is a large number of possible values,
each of which occurs infrequently. In most
cases, it is inferior to the arithmetic mean.
measures of dispersion
• Range
• Quantiles/Percentiles
• Mean Deviation
• Sample Variance
• Sample Standard Deviation
• Coefficient of Variation
range
Main advantage - It is very easy to
compute.
Major weaknesses - It is very sensitive to
extreme values and it depends on the
sample size.
quantile percentile
Quantiles/Percentiles
Intuitively the pth percentile is the value Vp such that
p percent of the distribution is less than or equal to
Vp.
Fifty percent of the distribution is less than or equal
to the median. The median is the 50th percentile.

The pth percentile is defined as
1. the (k+1)th largest item in the sample if np/100 is not an
integer (where k is the largest integer less than
np/100).
2. the average of the (np/100)th and the (np/100+1)th
largest items in the sample if np/100 is an integer.
Gives an overall impression of the shape
Main advantages - They are less
sensitive to outliers and not greatly
affected by sample size.
variance
SS/DF = ∑(x₁-x(mean))²/n-1
coefficient of variance
(std deviation / mean)x100%
graphical methods
Bar Graphs
Histograms
Stem and Leaf Plots
Box Plots
Scatterplots
box plot - inputs needed
Graphical technique that can be used to compare the
arithmetic mean and median of a distribution.
It uses the relationship between the median, 25th percentile,
and 75th percentile to describe the skewness of the
distribution.
They can be used to give a feel for the spread of the data
They can be useful in identifying outlying values, or values
that seem inconsistent with the rest of the sample.
population
Population - the set of all possible
outcomes for a random variable
random variable
Random variable - numerical quantity
that takes on different values depending
on chance
event
Event - an outcome or set of outcomes
for a random variable
sample space
Sample space - collection of all possible
outcomes/events
If outcomes A and B are two events that
cannot both happen at the same time,
then P(A or B occurs) =
P(A) + P(B).
AUC
Area under ROC curve (AUC) - summary of
the overall diagnostic accuracy of the test =probability of
correctly distinguishing a normal from an
abnormal subject The accuracy of the test depends on how
well the test separates the group being
tested into those with and without the
disease in question.
Area = 1 represents a perfect test
Area = 0.5 represents a worthless test
ROC
sens vs 1-spec. It shows the tradeoff between sensitivity and
specificity (any increase in sensitivity will be
accompanied by a decrease in specificity).
The closer the curve follows the left hand
border and then the top border of the ROC
space, the more accurate the test.
The closer the curve comes to the 45 degree
diagonal of the ROC space, the less accurate
the test.
PMF
A probability mass function is a
mathematical relationship, or rule, that
assigns to any possible value r of a
discrete random variable X the probability
P(X=r).
discrete expectation
E(X)=ΣxP(X= x )=E(X-μ)²
discrete variance
Σ(xi -μ)²P(X = x )
statistics
Similar properties of samples from populations are called
statistics.
parameters
These population values are sometimes called parameters.
They are properties of a population.
cdf
The cumulative distribution
function
of a discrete random
variable X is denoted
by F(x) and is defined
by P(X≤x).
Binomial distribution
The binomial distribution assumes the following:
A fixed number of n trials.
The trials are independent, that is the outcome of any
particular trial is not affected by the outcome of any other
trial.
Only two possible mutually exclusive outcomes
("success" or "failure")
The probability of success (denoted by p) at each trial is
fixed.
The probability of a failure at each trial is 1 - p.
The random variable of interest is the number of successes
within the n trials.
mean: np var: npq
poisson process
Poisson process - the discrete random
variable of the number of occurrences of an
Poisson process - the discrete random
variable of the number of occurrences of an
event in a continuous interval of time or
space Events occur one at a time; two or more events do
not occur precisely at the same moment or location
• The occurrence of an event in a given period is
independent of the occurrence of an event in any
previous or later nonoverlapping period
• The expected number of events (denoted by μ)
during any one period is the same as during any
other period.
Poisson approx for binomial
Another important application of the Poisson
distribution is that under certain conditions it
is a very good approximation to the binomial
distribution where μ = np.
What are these conditions?
- N must be large (at least 100)
- P must be small (no greater than 0.01)
Expectation and variance for Poisson (different formulation)
The poisson random variable probability
density function is characterized by one
parameters μ=λt, the expected number of
events over time period t.
E(x)= μ=λt
Var(X) = μ=λt
Why is Gaussian used so frequently to analyze data?
Why?
Some random variables are Normal, some
approximately Normal, some can be transformed to
approximate Normality, and the sampling error of
the means tends toward Normality even for non-Normal
populations.
properties of std normal distribution
For the standard normal distribution
P(-1 < X < 1) = 0.68
P(-2 < X < 2) = 0.95
P(-2.5 < X < 2.5) = 0.99
symmetry properties of std normal distribution
Φ(-x) = P(X ≤ -x )
= P( X ≥ x)
= 1 - P(X ≤ x)
= 1 - Φ(x).
important percentiles in std normal dist
z0.95 = 1.645
z0.05 = -1.645
z0.975 = 1.96
z0.025 = -1.96
z0.995 = 2.576
z0.005 = -2.576
when is a continuity correction used for the ~N approx for the binomial?
can use approx when a) np ≥ 5 and
b) n(1-p) ≥ 5, and then expand end points by .5
when is ~N approx for poisson used?
This approximation is used for μ ≥ 10. Use continuity correction
Statistical inference
Statistical inference is inference about a
population from a random sample drawn
from it or, more generally, about a random
process from its observed behavior during a
finite period of time Includes:
estimate: point estimation or interval estimation
hypothesis testing
prediction
sampling distribution.
The probability distribution that characterizes
some aspect of the sampling variability,
usually the mean but not always, is called a
sampling distribution.
Sampling distributions allow us to make
parameters without measuring every object in
the population.
Randomness
Randomness is a lack of order, purpose,
cause, or predictability. A random process is
a repeating process whose outcomes follow
no describable deterministic pattern, but follow
a probability distribution.
The mean of the sampling distribution of the sample mean
is the population mean µ, unbiased estimator of the population mean
true or false: there can be many unbiased estimators of a
parameter.
true
derive the standard error of the mean
SE= sqrt(var(xbar)) - derive from summation equation to get sigma/sqrt(n)
CLT
n is sufficiently large then the distribution
approximately normally distributed i e
∼N( μ σ²) even if the underlying population from which you
are sampling is not normal.
We should not say that the probability is 95% that the
true population mean falls in any particular
confidence interval.
2. Any individual confidence interval may or may not
contain the true population mean.
3. We can say that over a large number of exercises
(draw a sample then compute a 95% confidence
interval) 95% of intervals computed in this way will
contain the true population mean.
three factors that affect the length of
the confidence interval
2*Z(1-α)*σ/√n
1. The sample size n
(as n increases the length decreases).
2. The standard deviation σ
(as σ increases the length increases).
3. The confidence coefficient 100% x (1-α)
(as the confidence coefficient increases the length
increases).
If population variance is unknown:
(xbar -µ )/(s/√n)
t distribution with n-1 DF
how are normal and T distributions different
t has longer tails, lower peaks depending on df, df=1 has lowest peak, longest tail/ R code:
# Display the Student's t distributions with various
# degrees of freedom and compare to the normal distribution

x <- seq(-4, 4, length=100)
hx <- dnorm(x)

degf <- c(1, 3, 8, 30)
colors <- c("red", "blue", "darkgreen", "gold", "black")
labels <- c("df=1", "df=3", "df=8", "df=30", "normal")

plot(x, hx, type="l", lty=2, xlab="x value",
ylab="Density", main="Comparison of t Distributions")

for (i in 1:4){
lines(x, dt(x,degf[i]), lwd=2, col=colors[i])
}

legend("topright", inset=.05, title="Distributions",
labels, lwd=2, lty=c(1, 1, 1, 1, 2), col=colors)
Note that as the degrees of freedom increase, the t distribution
looks more like the unit normal distribution.
The sample variance s² is an unbiased estimator of
the population variance σ².
what distr is used characterize the sampling distribution variance?
chi squared (sum of x²) with n-1 degrees of freedom ~(n-1)s²/ σ²
Can solve the inequality for population variance to get a CI
Caution: For this to be valid the underlying
distribution must be normal. This is
different from what we said regarding the
confidence intervals for the population
mean.
Like the t distribution, the chi-square distribution is a
family of distributions that depend on
the degrees of
freedom
For n ≥ 3 the chi-squared distribution
has a
mode greater than zero and is skewed to the
right.
So why not make α very, very small?
This may be the solution in some cases,
however, reduction in the α level always
increases the probability of a Type II error
define type I and II errors
α=P(reject H0 | H0 true)
1 - β = P(reject H0 | H0 false) =
P(reject H0 | H1 true)
Type 2 error occurs when H0 is false and is not rejected
why is null hypothesis never proven?
Because we have type I and II error and one is
potentially possible in any decision, we NEVER say
that we have proved that H0 is true or that H0 is
false.
two definitions for p value
The p-value is the probability of observing
something as extreme or even more extreme
given that the null hypothesis is true.
• It is also described as the smallest
significance level at which the current test
would result in rejection.
• Another way to define the p-value is that it is
the significance level at which we would be
indifferent about rejecting or failing to reject
the null hypothesis.
The investigator must distinguish between
results that are statistically significant and
The investigator must distinguish between
results that are statistically significant and
results that are clinically significant.
• Statistical significance does not imply clinical
significance.
given α and the true population mean, and xbar, calculate β
see slide 7/17
So what happens when we
don't know the population
variance?
use t distribution and s²
Factors influencing Sample Size:
The sample size increases as σ² increases.
2. The sample size increases as the significance
level is made smaller (α decreases).
3. The sample size increases as the probability
of the type II error (β) decreases (or
equivalently as the power 1 - β increases).
4. The sample size increases as the absolute
value of the difference between the null and
the alternative means decreases.
hypothesis testing for
a population
variance?
must use chi squared test
are chi square tests one and two sided?
yes
when are pvalues in chi square test multiplied by two?
for two sided tests - see slide 7/27
how are proportion hypothesis tests set up?
Use asymptotic normal distribution for binomial: p-p0 over
SE is sqrt (pq/n)
calculating p values when testing proportions
If pˆ < p
p-value = 2 x (z) = twice the area to the left
of z under a N(0, 1) curve. slide 7/29
Longitudinal studies are said to use the
Cross-sectional studies are said to use the
paired-sample design.
independent-samples design.
Computation of the p-value for the
Paired t Test
If t < 0 then
p-value = twice the area to the left of the
observed value of the test statistic under a tn-1
distribution.
find the expectation and variance of the difference two sample means for two populations
~N(µ₁-µ₂, σ²₁/n₁+ σ²₂/n₂)
How do we check for equality of
variances?
The variance ratio s²₁/s²₂ was studied by R. A Fisher
and G Snedecor. They determined that the variance
ratio had an F-distribution with n₁-1 and n₂-1 degrees
of freedom.
What this really means is that you can always put
the largest variance in the numerator.
Performing F test of equality of variance
Null hypothesis when testing for equality of variances
variances are equal
what if the variances are not equal (null hypothesis rejected)
We need to compare the means of two normal
distributions where we know that the variances
are unequal.
Called the Behrens-Fisher problem Two-Sample t Test for Independent Samples with Unequal Variances:
how are degrees of freedom determined for two sample t test with unequal variances?
approximating degrees of freedom
d' where we round d' down to the nearest integer d''.
cardinal data
Cardinal data are data that are measured on a
scale where it is meaningful to measure the
distance between two possible data values.
For cardinal data, if the zero point is arbitrary,
the data are on an interval scale.
If the zero point is fixed, then the data are on
a ratio scale.
Ordinal scale
data are data that can be ordered
but do not have specific numerical values.
Nominal scale data
are data that have different
values but the values cannot be ordered.
Nonparametric Statistical Tests
Sign Test:
Wilcoxon Signed-rank Test
Wilcoxon Rank-Sum Test
Each of these has Normal Theory Methods and Exact Methods
Sign Test Summary:
Notice that the sign test uses only the sign of
the difference.
The magnitude of the difference is not used.
One would think that this could be throwing
away a lot of information.
Comparison of Two Independent Binomial
Proportions
Two-sample binomial test (normal theory method)
Confidence Interval for difference between two population
proportions
Sample size and power formulas
2 x 2 Contingency Tables
equivalent to two-sample binomial test
Fisher's Exact Test
R x C contingency tables
Test of association or independence
Test of homogeneity
Comparison of Two Paired Binomial Proportions
McNemar's Test for correlated proportions