78 terms

mean

This is the most widely used measure of

location.

It is overly sensitive to extreme values.

It is a poor measure of central location

when there are outliers in the data or if

the data are skewed.

location.

It is overly sensitive to extreme values.

It is a poor measure of central location

when there are outliers in the data or if

the data are skewed.

measures of location

•Arithmetic Mean

• Median

• Mode

• Median

• Mode

median define and advantage/disadvantages

The sample median is defined as follows:

1. If n is odd then the median is the (n+1)/2 item

2. If n is even then the median is the average of the n/2 and (n+1)/2 item

Primary weakness of the median is that it is

determined primarily by the middle values.

Primary strengths include the fact that it is less

sensitive to the actual numerical values of the

remaining data points, and it is insensitive to

extreme values.

1. If n is odd then the median is the (n+1)/2 item

2. If n is even then the median is the average of the n/2 and (n+1)/2 item

Primary weakness of the median is that it is

determined primarily by the middle values.

Primary strengths include the fact that it is less

sensitive to the actual numerical values of the

remaining data points, and it is insensitive to

extreme values.

mode

Most frequent value: you can have multiple modes in a distribution

of data.

You can also have no mode among all of the

observations in a sample if all values occur

only once.

The primary weakness of the mode is that is it

not a useful measure of central location if

there is a large number of possible values,

each of which occurs infrequently. In most

cases, it is inferior to the arithmetic mean.

of data.

You can also have no mode among all of the

observations in a sample if all values occur

only once.

The primary weakness of the mode is that is it

not a useful measure of central location if

there is a large number of possible values,

each of which occurs infrequently. In most

cases, it is inferior to the arithmetic mean.

measures of dispersion

• Range

• Quantiles/Percentiles

• Mean Deviation

• Sample Variance

• Sample Standard Deviation

• Coefficient of Variation

• Quantiles/Percentiles

• Mean Deviation

• Sample Variance

• Sample Standard Deviation

• Coefficient of Variation

range

Main advantage - It is very easy to

compute.

Major weaknesses - It is very sensitive to

extreme values and it depends on the

sample size.

compute.

Major weaknesses - It is very sensitive to

extreme values and it depends on the

sample size.

quantile percentile

Quantiles/Percentiles

Intuitively the pth percentile is the value Vp such that

p percent of the distribution is less than or equal to

Vp.

Fifty percent of the distribution is less than or equal

to the median. The median is the 50th percentile.

The pth percentile is defined as

1. the (k+1)th largest item in the sample if np/100 is not an

integer (where k is the largest integer less than

np/100).

2. the average of the (np/100)th and the (np/100+1)th

largest items in the sample if np/100 is an integer.

Gives an overall impression of the shape

and spread of the distribution

Main advantages - They are less

sensitive to outliers and not greatly

affected by sample size.

Intuitively the pth percentile is the value Vp such that

p percent of the distribution is less than or equal to

Vp.

Fifty percent of the distribution is less than or equal

to the median. The median is the 50th percentile.

The pth percentile is defined as

1. the (k+1)th largest item in the sample if np/100 is not an

integer (where k is the largest integer less than

np/100).

2. the average of the (np/100)th and the (np/100+1)th

largest items in the sample if np/100 is an integer.

Gives an overall impression of the shape

and spread of the distribution

Main advantages - They are less

sensitive to outliers and not greatly

affected by sample size.

variance

SS/DF = ∑(x₁-x(mean))²/n-1

coefficient of variance

(std deviation / mean)x100%

graphical methods

Bar Graphs

Histograms

Stem and Leaf Plots

Box Plots

Scatterplots

Histograms

Stem and Leaf Plots

Box Plots

Scatterplots

box plot - inputs needed

Graphical technique that can be used to compare the

arithmetic mean and median of a distribution.

It uses the relationship between the median, 25th percentile,

and 75th percentile to describe the skewness of the

distribution.

They can be used to give a feel for the spread of the data

They can be useful in identifying outlying values, or values

that seem inconsistent with the rest of the sample.

arithmetic mean and median of a distribution.

It uses the relationship between the median, 25th percentile,

and 75th percentile to describe the skewness of the

distribution.

They can be used to give a feel for the spread of the data

They can be useful in identifying outlying values, or values

that seem inconsistent with the rest of the sample.

population

Population - the set of all possible

outcomes for a random variable

outcomes for a random variable

random variable

Random variable - numerical quantity

that takes on different values depending

on chance

that takes on different values depending

on chance

event

Event - an outcome or set of outcomes

for a random variable

for a random variable

sample space

Sample space - collection of all possible

outcomes/events

outcomes/events

If outcomes A and B are two events that

cannot both happen at the same time,

then P(A or B occurs) =

cannot both happen at the same time,

then P(A or B occurs) =

P(A) + P(B).

AUC

Area under ROC curve (AUC) - summary of

the overall diagnostic accuracy of the test =probability of

correctly distinguishing a normal from an

abnormal subject The accuracy of the test depends on how

well the test separates the group being

tested into those with and without the

disease in question.

Area = 1 represents a perfect test

Area = 0.5 represents a worthless test

the overall diagnostic accuracy of the test =probability of

correctly distinguishing a normal from an

abnormal subject The accuracy of the test depends on how

well the test separates the group being

tested into those with and without the

disease in question.

Area = 1 represents a perfect test

Area = 0.5 represents a worthless test

ROC

sens vs 1-spec. It shows the tradeoff between sensitivity and

specificity (any increase in sensitivity will be

accompanied by a decrease in specificity).

The closer the curve follows the left hand

border and then the top border of the ROC

space, the more accurate the test.

The closer the curve comes to the 45 degree

diagonal of the ROC space, the less accurate

the test.

specificity (any increase in sensitivity will be

accompanied by a decrease in specificity).

The closer the curve follows the left hand

border and then the top border of the ROC

space, the more accurate the test.

The closer the curve comes to the 45 degree

diagonal of the ROC space, the less accurate

the test.

PMF

A probability mass function is a

mathematical relationship, or rule, that

assigns to any possible value r of a

discrete random variable X the probability

P(X=r).

mathematical relationship, or rule, that

assigns to any possible value r of a

discrete random variable X the probability

P(X=r).

discrete expectation

E(X)=ΣxP(X= x )=E(X-μ)²

discrete variance

Σ(xi -μ)²P(X = x )

statistics

Similar properties of samples from populations are called

statistics.

statistics.

parameters

These population values are sometimes called parameters.

They are properties of a population.

They are properties of a population.

cdf

The cumulative distribution

function

of a discrete random

variable X is denoted

by F(x) and is defined

by P(X≤x).

function

of a discrete random

variable X is denoted

by F(x) and is defined

by P(X≤x).

Binomial distribution

The binomial distribution assumes the following:

A fixed number of n trials.

The trials are independent, that is the outcome of any

particular trial is not affected by the outcome of any other

trial.

Only two possible mutually exclusive outcomes

("success" or "failure")

The probability of success (denoted by p) at each trial is

fixed.

The probability of a failure at each trial is 1 - p.

The random variable of interest is the number of successes

within the n trials.

mean: np var: npq

A fixed number of n trials.

The trials are independent, that is the outcome of any

particular trial is not affected by the outcome of any other

trial.

Only two possible mutually exclusive outcomes

("success" or "failure")

The probability of success (denoted by p) at each trial is

fixed.

The probability of a failure at each trial is 1 - p.

The random variable of interest is the number of successes

within the n trials.

mean: np var: npq

poisson process

Poisson process - the discrete random

variable of the number of occurrences of an

Poisson process - the discrete random

variable of the number of occurrences of an

event in a continuous interval of time or

space Events occur one at a time; two or more events do

not occur precisely at the same moment or location

• The occurrence of an event in a given period is

independent of the occurrence of an event in any

previous or later nonoverlapping period

• The expected number of events (denoted by μ)

during any one period is the same as during any

other period.

variable of the number of occurrences of an

Poisson process - the discrete random

variable of the number of occurrences of an

event in a continuous interval of time or

space Events occur one at a time; two or more events do

not occur precisely at the same moment or location

• The occurrence of an event in a given period is

independent of the occurrence of an event in any

previous or later nonoverlapping period

• The expected number of events (denoted by μ)

during any one period is the same as during any

other period.

Poisson approx for binomial

Another important application of the Poisson

distribution is that under certain conditions it

is a very good approximation to the binomial

distribution where μ = np.

What are these conditions?

- N must be large (at least 100)

- P must be small (no greater than 0.01)

distribution is that under certain conditions it

is a very good approximation to the binomial

distribution where μ = np.

What are these conditions?

- N must be large (at least 100)

- P must be small (no greater than 0.01)

Expectation and variance for Poisson (different formulation)

The poisson random variable probability

density function is characterized by one

parameters μ=λt, the expected number of

events over time period t.

E(x)= μ=λt

Var(X) = μ=λt

density function is characterized by one

parameters μ=λt, the expected number of

events over time period t.

E(x)= μ=λt

Var(X) = μ=λt

Why is Gaussian used so frequently to analyze data?

Why?

Some random variables are Normal, some

approximately Normal, some can be transformed to

approximate Normality, and the sampling error of

the means tends toward Normality even for non-Normal

populations.

Some random variables are Normal, some

approximately Normal, some can be transformed to

approximate Normality, and the sampling error of

the means tends toward Normality even for non-Normal

populations.

properties of std normal distribution

For the standard normal distribution

P(-1 < X < 1) = 0.68

P(-2 < X < 2) = 0.95

P(-2.5 < X < 2.5) = 0.99

P(-1 < X < 1) = 0.68

P(-2 < X < 2) = 0.95

P(-2.5 < X < 2.5) = 0.99

symmetry properties of std normal distribution

Φ(-x) = P(X ≤ -x )

= P( X ≥ x)

= 1 - P(X ≤ x)

= 1 - Φ(x).

= P( X ≥ x)

= 1 - P(X ≤ x)

= 1 - Φ(x).

important percentiles in std normal dist

z0.95 = 1.645

z0.05 = -1.645

z0.975 = 1.96

z0.025 = -1.96

z0.995 = 2.576

z0.005 = -2.576

z0.05 = -1.645

z0.975 = 1.96

z0.025 = -1.96

z0.995 = 2.576

z0.005 = -2.576

when is a continuity correction used for the ~N approx for the binomial?

can use approx when a) np ≥ 5 and

b) n(1-p) ≥ 5, and then expand end points by .5

b) n(1-p) ≥ 5, and then expand end points by .5

when is ~N approx for poisson used?

This approximation is used for μ ≥ 10. Use continuity correction

Statistical inference

Statistical inference is inference about a

population from a random sample drawn

from it or, more generally, about a random

process from its observed behavior during a

finite period of time Includes:

estimate: point estimation or interval estimation

hypothesis testing

prediction

population from a random sample drawn

from it or, more generally, about a random

process from its observed behavior during a

finite period of time Includes:

estimate: point estimation or interval estimation

hypothesis testing

prediction

sampling distribution.

The probability distribution that characterizes

some aspect of the sampling variability,

usually the mean but not always, is called a

sampling distribution.

Sampling distributions allow us to make

objective statements about population

parameters without measuring every object in

the population.

some aspect of the sampling variability,

usually the mean but not always, is called a

sampling distribution.

Sampling distributions allow us to make

objective statements about population

parameters without measuring every object in

the population.

Randomness

Randomness is a lack of order, purpose,

cause, or predictability. A random process is

a repeating process whose outcomes follow

no describable deterministic pattern, but follow

a probability distribution.

cause, or predictability. A random process is

a repeating process whose outcomes follow

no describable deterministic pattern, but follow

a probability distribution.

The mean of the sampling distribution of the sample mean

is the population mean µ, unbiased estimator of the population mean

true or false: there can be many unbiased estimators of a

parameter.

parameter.

true

derive the standard error of the mean

SE= sqrt(var(xbar)) - derive from summation equation to get sigma/sqrt(n)

CLT

n is sufficiently large then the distribution

approximately normally distributed i e

∼N( μ σ²) even if the underlying population from which you

are sampling is not normal.

approximately normally distributed i e

∼N( μ σ²) even if the underlying population from which you

are sampling is not normal.

warnings about CI

We should not say that the probability is 95% that the

true population mean falls in any particular

confidence interval.

2. Any individual confidence interval may or may not

contain the true population mean.

3. We can say that over a large number of exercises

(draw a sample then compute a 95% confidence

interval) 95% of intervals computed in this way will

contain the true population mean.

true population mean falls in any particular

confidence interval.

2. Any individual confidence interval may or may not

contain the true population mean.

3. We can say that over a large number of exercises

(draw a sample then compute a 95% confidence

interval) 95% of intervals computed in this way will

contain the true population mean.

three factors that affect the length of

the confidence interval

the confidence interval

2*Z(1-α)*σ/√n

1. The sample size n

(as n increases the length decreases).

2. The standard deviation σ

(as σ increases the length increases).

3. The confidence coefficient 100% x (1-α)

(as the confidence coefficient increases the length

increases).

1. The sample size n

(as n increases the length decreases).

2. The standard deviation σ

(as σ increases the length increases).

3. The confidence coefficient 100% x (1-α)

(as the confidence coefficient increases the length

increases).

If population variance is unknown:

(xbar -µ )/(s/√n)

t distribution with n-1 DF

t distribution with n-1 DF

how are normal and T distributions different

t has longer tails, lower peaks depending on df, df=1 has lowest peak, longest tail/ R code:

# Display the Student's t distributions with various

# degrees of freedom and compare to the normal distribution

x <- seq(-4, 4, length=100)

hx <- dnorm(x)

degf <- c(1, 3, 8, 30)

colors <- c("red", "blue", "darkgreen", "gold", "black")

labels <- c("df=1", "df=3", "df=8", "df=30", "normal")

plot(x, hx, type="l", lty=2, xlab="x value",

ylab="Density", main="Comparison of t Distributions")

for (i in 1:4){

lines(x, dt(x,degf[i]), lwd=2, col=colors[i])

}

legend("topright", inset=.05, title="Distributions",

labels, lwd=2, lty=c(1, 1, 1, 1, 2), col=colors)

# Display the Student's t distributions with various

# degrees of freedom and compare to the normal distribution

x <- seq(-4, 4, length=100)

hx <- dnorm(x)

degf <- c(1, 3, 8, 30)

colors <- c("red", "blue", "darkgreen", "gold", "black")

labels <- c("df=1", "df=3", "df=8", "df=30", "normal")

plot(x, hx, type="l", lty=2, xlab="x value",

ylab="Density", main="Comparison of t Distributions")

for (i in 1:4){

lines(x, dt(x,degf[i]), lwd=2, col=colors[i])

}

legend("topright", inset=.05, title="Distributions",

labels, lwd=2, lty=c(1, 1, 1, 1, 2), col=colors)

Note that as the degrees of freedom increase, the t distribution

looks more like the unit normal distribution.

The sample variance s² is an unbiased estimator of

the population variance σ².

what distr is used characterize the sampling distribution variance?

chi squared (sum of x²) with n-1 degrees of freedom ~(n-1)s²/ σ²

Can solve the inequality for population variance to get a CI

Caution: For this to be valid the underlying

distribution must be normal. This is

different from what we said regarding the

confidence intervals for the population

mean.

Can solve the inequality for population variance to get a CI

Caution: For this to be valid the underlying

distribution must be normal. This is

different from what we said regarding the

confidence intervals for the population

mean.

Like the t distribution, the chi-square distribution is a

family of distributions that depend on

family of distributions that depend on

the degrees of

freedom

freedom

For n ≥ 3 the chi-squared distribution

has a

mode greater than zero and is skewed to the

right.

mode greater than zero and is skewed to the

right.

So why not make α very, very small?

This may be the solution in some cases,

however, reduction in the α level always

increases the probability of a Type II error

however, reduction in the α level always

increases the probability of a Type II error

define type I and II errors

α=P(reject H0 | H0 true)

1 - β = P(reject H0 | H0 false) =

P(reject H0 | H1 true)

Type 2 error occurs when H0 is false and is not rejected

1 - β = P(reject H0 | H0 false) =

P(reject H0 | H1 true)

Type 2 error occurs when H0 is false and is not rejected

why is null hypothesis never proven?

Because we have type I and II error and one is

potentially possible in any decision, we NEVER say

that we have proved that H0 is true or that H0 is

false.

potentially possible in any decision, we NEVER say

that we have proved that H0 is true or that H0 is

false.

two definitions for p value

The p-value is the probability of observing

something as extreme or even more extreme

given that the null hypothesis is true.

• It is also described as the smallest

significance level at which the current test

would result in rejection.

• Another way to define the p-value is that it is

the significance level at which we would be

indifferent about rejecting or failing to reject

the null hypothesis.

something as extreme or even more extreme

given that the null hypothesis is true.

• It is also described as the smallest

significance level at which the current test

would result in rejection.

• Another way to define the p-value is that it is

the significance level at which we would be

indifferent about rejecting or failing to reject

the null hypothesis.

The investigator must distinguish between

results that are statistically significant and

results that are statistically significant and

The investigator must distinguish between

results that are statistically significant and

results that are clinically significant.

• Statistical significance does not imply clinical

significance.

results that are statistically significant and

results that are clinically significant.

• Statistical significance does not imply clinical

significance.

given α and the true population mean, and xbar, calculate β

see slide 7/17

So what happens when we

don't know the population

variance?

don't know the population

variance?

use t distribution and s²

Factors influencing Sample Size:

The sample size increases as σ² increases.

2. The sample size increases as the significance

level is made smaller (α decreases).

3. The sample size increases as the probability

of the type II error (β) decreases (or

equivalently as the power 1 - β increases).

4. The sample size increases as the absolute

value of the difference between the null and

the alternative means decreases.

2. The sample size increases as the significance

level is made smaller (α decreases).

3. The sample size increases as the probability

of the type II error (β) decreases (or

equivalently as the power 1 - β increases).

4. The sample size increases as the absolute

value of the difference between the null and

the alternative means decreases.

What about

hypothesis testing for

a population

variance?

hypothesis testing for

a population

variance?

must use chi squared test

are chi square tests one and two sided?

yes

when are pvalues in chi square test multiplied by two?

for two sided tests - see slide 7/27

how are proportion hypothesis tests set up?

Use asymptotic normal distribution for binomial: p-p0 over

SE is sqrt (pq/n)

SE is sqrt (pq/n)

calculating p values when testing proportions

If pˆ < p

p-value = 2 x (z) = twice the area to the left

of z under a N(0, 1) curve. slide 7/29

p-value = 2 x (z) = twice the area to the left

of z under a N(0, 1) curve. slide 7/29

Longitudinal studies are said to use the

Cross-sectional studies are said to use the

Cross-sectional studies are said to use the

paired-sample design.

independent-samples design.

independent-samples design.

Computation of the p-value for the

Paired t Test

Paired t Test

If t < 0 then

p-value = twice the area to the left of the

observed value of the test statistic under a tn-1

distribution.

p-value = twice the area to the left of the

observed value of the test statistic under a tn-1

distribution.

find the expectation and variance of the difference two sample means for two populations

~N(µ₁-µ₂, σ²₁/n₁+ σ²₂/n₂)

How do we check for equality of

variances?

variances?

The variance ratio s²₁/s²₂ was studied by R. A Fisher

and G Snedecor. They determined that the variance

ratio had an F-distribution with n₁-1 and n₂-1 degrees

of freedom.

and G Snedecor. They determined that the variance

ratio had an F-distribution with n₁-1 and n₂-1 degrees

of freedom.

What this really means is that you can always put

the largest variance in the numerator.

the largest variance in the numerator.

Performing F test of equality of variance

Null hypothesis when testing for equality of variances

variances are equal

what if the variances are not equal (null hypothesis rejected)

We need to compare the means of two normal

distributions where we know that the variances

are unequal.

Called the Behrens-Fisher problem Two-Sample t Test for Independent Samples with Unequal Variances:

distributions where we know that the variances

are unequal.

Called the Behrens-Fisher problem Two-Sample t Test for Independent Samples with Unequal Variances:

how are degrees of freedom determined for two sample t test with unequal variances?

approximating degrees of freedom

d' where we round d' down to the nearest integer d''.

d' where we round d' down to the nearest integer d''.

cardinal data

Cardinal data are data that are measured on a

scale where it is meaningful to measure the

distance between two possible data values.

For cardinal data, if the zero point is arbitrary,

the data are on an interval scale.

If the zero point is fixed, then the data are on

a ratio scale.

scale where it is meaningful to measure the

distance between two possible data values.

For cardinal data, if the zero point is arbitrary,

the data are on an interval scale.

If the zero point is fixed, then the data are on

a ratio scale.

Ordinal scale

data are data that can be ordered

but do not have specific numerical values.

but do not have specific numerical values.

Nominal scale data

are data that have different

values but the values cannot be ordered.

values but the values cannot be ordered.

Nonparametric Statistical Tests

Sign Test:

Wilcoxon Signed-rank Test

Wilcoxon Rank-Sum Test

Each of these has Normal Theory Methods and Exact Methods

Wilcoxon Signed-rank Test

Wilcoxon Rank-Sum Test

Each of these has Normal Theory Methods and Exact Methods

Sign Test Summary:

Notice that the sign test uses only the sign of

the difference.

The magnitude of the difference is not used.

One would think that this could be throwing

away a lot of information.

the difference.

The magnitude of the difference is not used.

One would think that this could be throwing

away a lot of information.

Comparison of Two Independent Binomial

Proportions

Proportions

Two-sample binomial test (normal theory method)

Confidence Interval for difference between two population

proportions

Sample size and power formulas

2 x 2 Contingency Tables

equivalent to two-sample binomial test

Fisher's Exact Test

R x C contingency tables

Test of association or independence

Test of homogeneity

Confidence Interval for difference between two population

proportions

Sample size and power formulas

2 x 2 Contingency Tables

equivalent to two-sample binomial test

Fisher's Exact Test

R x C contingency tables

Test of association or independence

Test of homogeneity

Comparison of Two Paired Binomial Proportions

McNemar's Test for correlated proportions