700 exam

Created by rcccward 

Upgrade to
remove ads

78 terms

mean

This is the most widely used measure of
location.
􀁺 It is overly sensitive to extreme values.
􀁺 It is a poor measure of central location
when there are outliers in the data or if
the data are skewed.

measures of location

•Arithmetic Mean
• Median
• Mode

median define and advantage/disadvantages

The sample median is defined as follows:
1. If n is odd then the median is the (n+1)/2 item
2. If n is even then the median is the average of the n/2 and (n+1)/2 item
Primary weakness of the median is that it is
determined primarily by the middle values.
Primary strengths include the fact that it is less
sensitive to the actual numerical values of the
remaining data points, and it is insensitive to
extreme values.

mode

Most frequent value: you can have multiple modes in a distribution
of data.
You can also have no mode among all of the
observations in a sample if all values occur
only once.
The primary weakness of the mode is that is it
not a useful measure of central location if
there is a large number of possible values,
each of which occurs infrequently. In most
cases, it is inferior to the arithmetic mean.

measures of dispersion

• Range
• Quantiles/Percentiles
• Mean Deviation
• Sample Variance
• Sample Standard Deviation
• Coefficient of Variation

range

Main advantage - It is very easy to
compute.
􀁺 Major weaknesses - It is very sensitive to
extreme values and it depends on the
sample size.

quantile percentile

Quantiles/Percentiles
Intuitively the pth percentile is the value Vp such that
p percent of the distribution is less than or equal to
Vp.
Fifty percent of the distribution is less than or equal
to the median. The median is the 50th percentile.

The pth percentile is defined as
1. the (k+1)th largest item in the sample if np/100 is not an
integer (where k is the largest integer less than
np/100).
2. the average of the (np/100)th and the (np/100+1)th
largest items in the sample if np/100 is an integer.
Gives an overall impression of the shape
and spread of the distribution
􀁺 Main advantages - They are less
sensitive to outliers and not greatly
affected by sample size.

variance

SS/DF = ∑(x₁-x(mean))²/n-1

coefficient of variance

(std deviation / mean)x100%

graphical methods

􀁺 Bar Graphs
􀁺 Histograms
􀁺 Stem and Leaf Plots
􀁺 Box Plots
􀁺 Scatterplots

box plot - inputs needed

Graphical technique that can be used to compare the
arithmetic mean and median of a distribution.
􀁺 It uses the relationship between the median, 25th percentile,
and 75th percentile to describe the skewness of the
distribution.
􀁺 They can be used to give a feel for the spread of the data
􀁺 They can be useful in identifying outlying values, or values
that seem inconsistent with the rest of the sample.

population

Population - the set of all possible
outcomes for a random variable

random variable

Random variable - numerical quantity
that takes on different values depending
on chance

event

Event - an outcome or set of outcomes
for a random variable

sample space

Sample space - collection of all possible
outcomes/events

If outcomes A and B are two events that
cannot both happen at the same time,
then P(A or B occurs) =

P(A) + P(B).

AUC

Area under ROC curve (AUC) - summary of
the overall diagnostic accuracy of the test =probability of
correctly distinguishing a normal from an
abnormal subject The accuracy of the test depends on how
well the test separates the group being
tested into those with and without the
disease in question.
􀁺 Area = 1 represents a perfect test
􀁺 Area = 0.5 represents a worthless test

ROC

sens vs 1-spec. It shows the tradeoff between sensitivity and
specificity (any increase in sensitivity will be
accompanied by a decrease in specificity).
􀁺 The closer the curve follows the left hand
border and then the top border of the ROC
space, the more accurate the test.
􀁺 The closer the curve comes to the 45 degree
diagonal of the ROC space, the less accurate
the test.

PMF

A probability mass function is a
mathematical relationship, or rule, that
assigns to any possible value r of a
discrete random variable X the probability
P(X=r).

discrete expectation

E(X)=ΣxP(X= x )=E(X-μ)²

discrete variance

Σ(xi -μ)²P(X = x )

statistics

Similar properties of samples from populations are called
statistics.

parameters

These population values are sometimes called parameters.
􀁺 They are properties of a population.

cdf

The cumulative distribution
function
of a discrete random
variable X is denoted
by F(x) and is defined
by P(X≤x).

Binomial distribution

The binomial distribution assumes the following:
􀁺 A fixed number of n trials.
􀁺 The trials are independent, that is the outcome of any
particular trial is not affected by the outcome of any other
trial.
􀁺 Only two possible mutually exclusive outcomes
("success" or "failure")
􀁺 The probability of success (denoted by p) at each trial is
fixed.
􀁺 The probability of a failure at each trial is 1 - p.
The random variable of interest is the number of successes
within the n trials.
mean: np var: npq

poisson process

Poisson process - the discrete random
variable of the number of occurrences of an
Poisson process - the discrete random
variable of the number of occurrences of an
event in a continuous interval of time or
space Events occur one at a time; two or more events do
not occur precisely at the same moment or location
• The occurrence of an event in a given period is
independent of the occurrence of an event in any
previous or later nonoverlapping period
• The expected number of events (denoted by μ)
during any one period is the same as during any
other period.

Poisson approx for binomial

Another important application of the Poisson
distribution is that under certain conditions it
is a very good approximation to the binomial
distribution where μ = np.
􀁺 What are these conditions?
- N must be large (at least 100)
- P must be small (no greater than 0.01)

Expectation and variance for Poisson (different formulation)

The poisson random variable probability
density function is characterized by one
parameters μ=λt, the expected number of
events over time period t.
􀁺 E(x)= μ=λt
􀁺 Var(X) = μ=λt

Why is Gaussian used so frequently to analyze data?

Why?
Some random variables are Normal, some
approximately Normal, some can be transformed to
approximate Normality, and the sampling error of
the means tends toward Normality even for non-Normal
populations.

properties of std normal distribution

For the standard normal distribution
P(-1 < X < 1) = 0.68
P(-2 < X < 2) = 0.95
P(-2.5 < X < 2.5) = 0.99

symmetry properties of std normal distribution

Φ(-x) = P(X ≤ -x )
= P( X ≥ x)
= 1 - P(X ≤ x)
= 1 - Φ(x).

important percentiles in std normal dist

z0.95 = 1.645
z0.05 = -1.645
z0.975 = 1.96
z0.025 = -1.96
z0.995 = 2.576
z0.005 = -2.576

when is a continuity correction used for the ~N approx for the binomial?

can use approx when a) np ≥ 5 and
b) n(1-p) ≥ 5, and then expand end points by .5

when is ~N approx for poisson used?

This approximation is used for μ ≥ 10. Use continuity correction

Statistical inference

Statistical inference is inference about a
population from a random sample drawn
from it or, more generally, about a random
process from its observed behavior during a
finite period of time Includes:
estimate: point estimation or interval estimation
hypothesis testing
prediction

sampling distribution.

The probability distribution that characterizes
some aspect of the sampling variability,
usually the mean but not always, is called a
sampling distribution.
􀂄 Sampling distributions allow us to make
objective statements about population
parameters without measuring every object in
the population.

Randomness

Randomness is a lack of order, purpose,
cause, or predictability. A random process is
a repeating process whose outcomes follow
no describable deterministic pattern, but follow
a probability distribution.

The mean of the sampling distribution of the sample mean

is the population mean µ, unbiased estimator of the population mean

true or false: there can be many unbiased estimators of a
parameter.

true

derive the standard error of the mean

SE= sqrt(var(xbar)) - derive from summation equation to get sigma/sqrt(n)

CLT

n is sufficiently large then the distribution
approximately normally distributed i e
∼N( μ σ²) even if the underlying population from which you
are sampling is not normal.

warnings about CI

We should not say that the probability is 95% that the
true population mean falls in any particular
confidence interval.
2. Any individual confidence interval may or may not
contain the true population mean.
3. We can say that over a large number of exercises
(draw a sample then compute a 95% confidence
interval) 95% of intervals computed in this way will
contain the true population mean.

three factors that affect the length of
the confidence interval

2*Z(1-α)*σ/√n
1. The sample size n
(as n increases the length decreases).
2. The standard deviation σ
(as σ increases the length increases).
3. The confidence coefficient 100% x (1-α)
(as the confidence coefficient increases the length
increases).

If population variance is unknown:

(xbar -µ )/(s/√n)
t distribution with n-1 DF

how are normal and T distributions different

t has longer tails, lower peaks depending on df, df=1 has lowest peak, longest tail/ R code:
# Display the Student's t distributions with various
# degrees of freedom and compare to the normal distribution

x <- seq(-4, 4, length=100)
hx <- dnorm(x)

degf <- c(1, 3, 8, 30)
colors <- c("red", "blue", "darkgreen", "gold", "black")
labels <- c("df=1", "df=3", "df=8", "df=30", "normal")

plot(x, hx, type="l", lty=2, xlab="x value",
ylab="Density", main="Comparison of t Distributions")

for (i in 1:4){
lines(x, dt(x,degf[i]), lwd=2, col=colors[i])
}

legend("topright", inset=.05, title="Distributions",
labels, lwd=2, lty=c(1, 1, 1, 1, 2), col=colors)

Note that as the degrees of freedom increase, the t distribution

looks more like the unit normal distribution.

The sample variance s² is an unbiased estimator of

the population variance σ².

what distr is used characterize the sampling distribution variance?

chi squared (sum of x²) with n-1 degrees of freedom ~(n-1)s²/ σ²
Can solve the inequality for population variance to get a CI
Caution: For this to be valid the underlying
distribution must be normal. This is
different from what we said regarding the
confidence intervals for the population
mean.

Like the t distribution, the chi-square distribution is a
family of distributions that depend on

the degrees of
freedom

For n ≥ 3 the chi-squared distribution

has a
mode greater than zero and is skewed to the
right.

So why not make α very, very small?

This may be the solution in some cases,
however, reduction in the α level always
increases the probability of a Type II error

define type I and II errors

α=P(reject H0 | H0 true)
1 - β = P(reject H0 | H0 false) =
P(reject H0 | H1 true)
Type 2 error occurs when H0 is false and is not rejected

why is null hypothesis never proven?

Because we have type I and II error and one is
potentially possible in any decision, we NEVER say
that we have proved that H0 is true or that H0 is
false.

two definitions for p value

The p-value is the probability of observing
something as extreme or even more extreme
given that the null hypothesis is true.
• It is also described as the smallest
significance level at which the current test
would result in rejection.
• Another way to define the p-value is that it is
the significance level at which we would be
indifferent about rejecting or failing to reject
the null hypothesis.

The investigator must distinguish between
results that are statistically significant and

The investigator must distinguish between
results that are statistically significant and
results that are clinically significant.
• Statistical significance does not imply clinical
significance.

given α and the true population mean, and xbar, calculate β

see slide 7/17

So what happens when we
don't know the population
variance?

use t distribution and s²

Factors influencing Sample Size:

The sample size increases as σ² increases.
2. The sample size increases as the significance
level is made smaller (α decreases).
3. The sample size increases as the probability
of the type II error (β) decreases (or
equivalently as the power 1 - β increases).
4. The sample size increases as the absolute
value of the difference between the null and
the alternative means decreases.

What about
hypothesis testing for
a population
variance?

must use chi squared test

are chi square tests one and two sided?

yes

when are pvalues in chi square test multiplied by two?

for two sided tests - see slide 7/27

how are proportion hypothesis tests set up?

Use asymptotic normal distribution for binomial: p-p0 over
SE is sqrt (pq/n)

calculating p values when testing proportions

If pˆ < p
p-value = 2 x (z) = twice the area to the left
of z under a N(0, 1) curve. slide 7/29

Longitudinal studies are said to use the
Cross-sectional studies are said to use the

paired-sample design.
independent-samples design.

Computation of the p-value for the
Paired t Test

If t < 0 then
p-value = twice the area to the left of the
observed value of the test statistic under a tn-1
distribution.

find the expectation and variance of the difference two sample means for two populations

~N(µ₁-µ₂, σ²₁/n₁+ σ²₂/n₂)

How do we check for equality of
variances?

The variance ratio s²₁/s²₂ was studied by R. A Fisher
and G Snedecor. They determined that the variance
ratio had an F-distribution with n₁-1 and n₂-1 degrees
of freedom.

What this really means is that you can always put
the largest variance in the numerator.

Performing F test of equality of variance

Null hypothesis when testing for equality of variances

variances are equal

what if the variances are not equal (null hypothesis rejected)

We need to compare the means of two normal
distributions where we know that the variances
are unequal.
􀂆 Called the Behrens-Fisher problem Two-Sample t Test for Independent Samples with Unequal Variances:

how are degrees of freedom determined for two sample t test with unequal variances?

approximating degrees of freedom
d' where we round d' down to the nearest integer d''.

cardinal data

Cardinal data are data that are measured on a
scale where it is meaningful to measure the
distance between two possible data values.
􀂆 For cardinal data, if the zero point is arbitrary,
the data are on an interval scale.
􀂆 If the zero point is fixed, then the data are on
a ratio scale.

Ordinal scale

data are data that can be ordered
but do not have specific numerical values.

Nominal scale data

are data that have different
values but the values cannot be ordered.

Nonparametric Statistical Tests

Sign Test:
Wilcoxon Signed-rank Test
Wilcoxon Rank-Sum Test
Each of these has Normal Theory Methods and Exact Methods

Sign Test Summary:

Notice that the sign test uses only the sign of
the difference.
􀂆 The magnitude of the difference is not used.
􀂆 One would think that this could be throwing
away a lot of information.

Comparison of Two Independent Binomial
Proportions

􀂄 Two-sample binomial test (normal theory method)
􀂄 Confidence Interval for difference between two population
proportions
􀂄 Sample size and power formulas
􀂄 2 x 2 Contingency Tables
􀂇 equivalent to two-sample binomial test
􀂄 Fisher's Exact Test
􀂄 R x C contingency tables
􀂇 Test of association or independence
􀂇 Test of homogeneity

Comparison of Two Paired Binomial Proportions

􀂄 McNemar's Test for correlated proportions

Please allow access to your computer’s microphone to use Voice Recording.

Having trouble? Click here for help.

We can’t access your microphone!

Click the icon above to update your browser permissions above and try again

Example:

Reload the page to try again!

Reload

Press Cmd-0 to reset your zoom

Press Ctrl-0 to reset your zoom

It looks like your browser might be zoomed in or out. Your browser needs to be zoomed to a normal size to record audio.

Please upgrade Flash or install Chrome
to use Voice Recording.

For more help, see our troubleshooting page.

Your microphone is muted

For help fixing this issue, see this FAQ.

Star this term

You can study starred terms together

NEW! Voice Recording

Create Set