How can we help?

You can also find more resources in our Help Center.

Measure of effect - Binary data

Odds Ratio

Measure of effect - Incidence rates data

Incidence Rate Ratio

Measure of Effect - Survival Data

Hazard Ratio (HR)

Kaplan-Meier Survival Curve

Kaplan-Meier Survival Curve

Stratified Analysis - Binary data

Stratum-specific ORs (modification)

Unadjusted and adjusted ORs (confounding)

Unadjusted and adjusted ORs (confounding)

Stratified Analysis - Incidence rates data

Stratum-specific IRRs (modification)

Unadjusted and adjusted IRRs (confounding)

Unadjusted and adjusted IRRs (confounding)

Stratified Analysis - Survival Data

Plot Kaplan-Meier survival curves

compare with log-rank test

compare with log-rank test

Typical Multivariate Model - Binary Data

Logistic regression

Typical Multivariate Model - Incidence Rates Data

Poisson regression

Typical Multivariate Model - Survival Data

Proportional hazard regression

Residual variation (I'm pretty sure this means "distribution") - Binary data

binomial

Residual variation - Incidence Rates Data

Poisson

Residual variation - Survival Data

Unspecified (that's the whole thing, it doesn't force a structure on the data) --> SEMI-parametric

Standard Deviation

-describes the variance of some population mean

-independent of sample size

-independent of sample size

Standard Error

-describes the variance of an ESTIMATOR of some population mean (or other parameter of interest)

-dependent on: sample size and standard deviation

=standard deviation/sqrt(n)

Weird: standard error of the mean is the standard deviation of the sampling distribution (huh?)

-dependent on: sample size and standard deviation

=standard deviation/sqrt(n)

Weird: standard error of the mean is the standard deviation of the sampling distribution (huh?)

Central Limit Theorem

Even if the underlying population parameter is serverly non-normal, the sampling distribution of the estimator will be normal, if the sample size is large enough (WHACK!)

Survival time analyses need?

-continuous or semi-continuous survival time variable AND binary or categorical censoring variable

-Survival times are generally continuous, but almost never normally distributed

--> Kaplan-Meier & Log-Rank Test are really non-parametric

-Survival analyses are built up from conditional risk sets

-Survival times are generally continuous, but almost never normally distributed

--> Kaplan-Meier & Log-Rank Test are really non-parametric

-Survival analyses are built up from conditional risk sets

Right Censoring means what? COMMON

Follow-up was terminated before the outcome was obsered

--> administrative censoring: study ends

--> drop-out

Survival methods assume that the reason for censoring is NOT a risk factor for the event

--> administrative censoring: study ends

--> drop-out

Survival methods assume that the reason for censoring is NOT a risk factor for the event

Left Censoring means what? UNCOMMON

Subject had the event before follow-up started, we get to observe the event, but we don't know when it happened. Big take-home: subject entered the study!

--> age of menarche example, the 10 year-old's time was left-censored

--> age of menarche example, the 10 year-old's time was left-censored

Left Truncation means what? COMMON

Remember that truncation has everything to do with the selection process in the study design, while censoring has everything to do with once the study gets going

--> Subject had the even but did not get to enter the study

--> e.g. time-to-abortion study, time zero is date of conception, women enrolled at first pre-natal visit --> by definition, women who miscarry before their first pre-natal visit will never make it into the study

--> this is selection bias

--> can be fixed by delayed entry (shorten the amount of time at risk)

--> You have fewer spontaneous abortions, probably heavier in the exposed group, so the lack of abortions push the HR closer to the null (Bias towards the null)

--> Subject had the even but did not get to enter the study

--> e.g. time-to-abortion study, time zero is date of conception, women enrolled at first pre-natal visit --> by definition, women who miscarry before their first pre-natal visit will never make it into the study

--> this is selection bias

--> can be fixed by delayed entry (shorten the amount of time at risk)

--> You have fewer spontaneous abortions, probably heavier in the exposed group, so the lack of abortions push the HR closer to the null (Bias towards the null)

Right Truncation means what? LESS COMMON

Eek, I have no idea what this is, what it looks or really anything. Probably just best to leave it alone

Survival function S(t) has what component parts?

-function of time

-probability

S(t) is the probability that the average person survives beyond time t

S(t) = #surviving after t / # at start

S(t) at the beginning = 1

S(t) at the end = 0

-probability

S(t) is the probability that the average person survives beyond time t

S(t) = #surviving after t / # at start

S(t) at the beginning = 1

S(t) at the end = 0

Hazard function: h(t)

# events / # in risk set

= P(you'll die in the next Δt, given you're alive today) / length of Δt

--> is a rate, not a probability

= P(you'll die in the next Δt, given you're alive today) / length of Δt

--> is a rate, not a probability

What's the opposite of S(t)?

Cumulative distribution function: F(t)

F(t) = risk = cumulative incidence

survival is the opposite of risk

F(t) = risk = cumulative incidence

survival is the opposite of risk

Characteristics of the Log-Rank Test

-equivalent to MH test, but rows are points in time

-gives equal weights to all events

-Caveat: hazards need to be proportional for this to be a well-powered measure

-gives equal weights to all events

-Caveat: hazards need to be proportional for this to be a well-powered measure

Interpretation of Beta (Hazard Model)

exp(Beta1) = hazard ratio comparing those with X- 1 to those with X=0 (referent) adjusted for all other predictor variables in the model

exp(Beta) is assumed to be

"constant across time"

Beta = log of the hazard ratio for a one-unit increment in x

exp(Beta) is assumed to be

"constant across time"

Beta = log of the hazard ratio for a one-unit increment in x

Info on the baseline hazard (h₀)

-like an intercept term (except it's a function of time)

-for whatever reason (and this is what makes it semi-parametric), you don't HAVE to estimate the baseline hazard to estimate the other beta coefficients

--> as in, you're not forcing any kind of shape on the baseline hazard...

--> KEY POINT: the baseline hazard is the SAME for all individuals

-for whatever reason (and this is what makes it semi-parametric), you don't HAVE to estimate the baseline hazard to estimate the other beta coefficients

--> as in, you're not forcing any kind of shape on the baseline hazard...

--> KEY POINT: the baseline hazard is the SAME for all individuals

Kaplan-Meier and Log-Rank test are parameteric or non-parametric?

Non-parametric

--> they impose little structure on the data

--> they impose little structure on the data

How do you relax the proportional hazards assumption?

1. add an interaction term between the covariate and continuous time

2. Adding an interaction between the covariate and categorical time

3. fitting a special version of the model which is based on stratifying on a covariate

--> violations of PHA are important for all covariates but are ESPECIALLY important for the main exposure!

2. Adding an interaction between the covariate and categorical time

3. fitting a special version of the model which is based on stratifying on a covariate

--> violations of PHA are important for all covariates but are ESPECIALLY important for the main exposure!

Deviance Residuals, what's that?

...

What type of likelihood function does proportional hazards model use?

--> partial likelihood

-factors out the baseline hazard function from the likelihood function

--> allows you to fit the proportional hazards model without specifying the underlying event-time distribution

[still unclear why this is...]

-factors out the baseline hazard function from the likelihood function

--> allows you to fit the proportional hazards model without specifying the underlying event-time distribution

[still unclear why this is...]

What is the partial likelihood?

a function that maximizes the beta coefficients in the proportional hazards model

who uses the maximum likelihood estimation?

logistic regression, comparable to the partial likelihood for proportional hazards model

How to construct likelihood, e.g. L at t=3?

Ask the question: "Given that an event occurred at t = 3, what is the probability that it occurred to subject i, rather than to one of the other subjects?"

Formula for L at t

Numerator: hazard for the subject who had the event at each event time t

Denominator: sum of hazards for all those at risk of the event at time point t

Denominator: sum of hazards for all those at risk of the event at time point t

Combining the hazards model and the likelihood function

Once in the likelihood function, the baseline hazard cancels out

--> this is why we can estimate the beta coefficients in a proportional hazards model without formally specifying the underlying distribution of the event-time distribution

--> this is why we can estimate the beta coefficients in a proportional hazards model without formally specifying the underlying distribution of the event-time distribution

What is a tie?

two or more events at the same time point

Why do I care about ties?

-lots of ties make the partial likelihood complicated --> all possible permutations of ordering need to be evaluated

-lots of ties are bad (gotta use some efron, never breslow)

-lots of ties are bad (gotta use some efron, never breslow)

What is an extended proportional hazards model?

a proportional hazards model that includes time interactions

--> interactions between time and covariates also allow for those covariates to be time-varying

--> interactions between time and covariates also allow for those covariates to be time-varying

Incidence rates

# events / person-time at risk

Units: inverse time

Rates are positive numbers, bounded by 0 and inifinity

Units: inverse time

Rates are positive numbers, bounded by 0 and inifinity

Key Assumption of Person-time at risk

-Person-time units are fully exchangeable (10 years in 10 people is the same as 1 year in 100 people)

[Exchangeability Assumption]

[Exchangeability Assumption]

IR 95% confidence interval

SE(IR) = sqrt(# events) / person-time at risk

IR ± 1.96 * SE(IR)

IR ± 1.96 * SE(IR)

Rate Ratio

-bounded by 0 and infinity

-null value: 1

-null value: 1

95% Confidence Interval of Rate Ratio

1. SE(ln(IRR)) = sqrt(1/a₁ + 1/a₀)

Where a₁ = # events exposed

and a₀ = # events unexposed

2. 95% CI = exp[ln(IRR) ± 1.96** SE *** ln(IRR)]

Where a₁ = # events exposed

and a₀ = # events unexposed

2. 95% CI = exp[ln(IRR) ± 1.96

Notes about Rates and Rate Ratios

-use for both rare and common events --> don't depend on prevalence

-Incidence rate is an instantaneous measure, like speed in physics

-Incidence rate is an instantaneous measure, like speed in physics

Relating Risks to Rates

If a disease is rate, then rate*duration kind of equals risk

If the disease is common, you're going to need to use some exponential formulas to get your **** sorted

If the disease is common, you're going to need to use some exponential formulas to get your **** sorted

Assumptions of the Poisson Regression

1. events are independent

2. the effect of ordinal and continuous variables is linear in the logarithmic scale

3. the combined effect of variables is multiplicative

4. the outcome variable follows the poisson distribution with mean = variance, conditional on the predictor variables

--> 4 is specific to Poisson:

Note: Poisson regression use grouped/tabular data (observations are not individual people)

2. the effect of ordinal and continuous variables is linear in the logarithmic scale

3. the combined effect of variables is multiplicative

4. the outcome variable follows the poisson distribution with mean = variance, conditional on the predictor variables

--> 4 is specific to Poisson:

Note: Poisson regression use grouped/tabular data (observations are not individual people)

Residual variation

observed values for the outcome MINUS model-predicted values for the outcome

Is the average residual variation the variance??

Is the average residual variation the variance??

Over-dispersion and under-dispersion

Over-dispersion: If residual variation (variance) is greater than the mean

--> SEs too small

--> CIs too narrow

--> Wald p-values too low

Under-disperson: if residual variation (variance) is less than the mean

--> SEs too small

--> CIs too narrow

--> Wald p-values too low

Under-disperson: if residual variation (variance) is less than the mean

What type of likelihood function does Poisson regression use?

Maximum likelihood, obviously!

--> through iterative testing, obtaining a global maxima for the specified model

--> MLE just like logistic regression

--> through iterative testing, obtaining a global maxima for the specified model

--> MLE just like logistic regression