Stratified Analysis - Binary data
Stratum-specific ORs (modification)
Unadjusted and adjusted ORs (confounding)
Stratified Analysis - Incidence rates data
Stratum-specific IRRs (modification)
Unadjusted and adjusted IRRs (confounding)
Residual variation - Survival Data
Unspecified (that's the whole thing, it doesn't force a structure on the data) --> SEMI-parametric
-describes the variance of an ESTIMATOR of some population mean (or other parameter of interest)
-dependent on: sample size and standard deviation
Weird: standard error of the mean is the standard deviation of the sampling distribution (huh?)
Central Limit Theorem
Even if the underlying population parameter is serverly non-normal, the sampling distribution of the estimator will be normal, if the sample size is large enough (WHACK!)
Survival time analyses need?
-continuous or semi-continuous survival time variable AND binary or categorical censoring variable
-Survival times are generally continuous, but almost never normally distributed
--> Kaplan-Meier & Log-Rank Test are really non-parametric
-Survival analyses are built up from conditional risk sets
Right Censoring means what? COMMON
Follow-up was terminated before the outcome was obsered
--> administrative censoring: study ends
Survival methods assume that the reason for censoring is NOT a risk factor for the event
Left Censoring means what? UNCOMMON
Subject had the event before follow-up started, we get to observe the event, but we don't know when it happened. Big take-home: subject entered the study!
--> age of menarche example, the 10 year-old's time was left-censored
Left Truncation means what? COMMON
Remember that truncation has everything to do with the selection process in the study design, while censoring has everything to do with once the study gets going
--> Subject had the even but did not get to enter the study
--> e.g. time-to-abortion study, time zero is date of conception, women enrolled at first pre-natal visit --> by definition, women who miscarry before their first pre-natal visit will never make it into the study
--> this is selection bias
--> can be fixed by delayed entry (shorten the amount of time at risk)
--> You have fewer spontaneous abortions, probably heavier in the exposed group, so the lack of abortions push the HR closer to the null (Bias towards the null)
Right Truncation means what? LESS COMMON
Eek, I have no idea what this is, what it looks or really anything. Probably just best to leave it alone
Survival function S(t) has what component parts?
-function of time
S(t) is the probability that the average person survives beyond time t
S(t) = #surviving after t / # at start
S(t) at the beginning = 1
S(t) at the end = 0
Hazard function: h(t)
# events / # in risk set
= P(you'll die in the next Δt, given you're alive today) / length of Δt
--> is a rate, not a probability
What's the opposite of S(t)?
Cumulative distribution function: F(t)
F(t) = risk = cumulative incidence
survival is the opposite of risk
Characteristics of the Log-Rank Test
-equivalent to MH test, but rows are points in time
-gives equal weights to all events
-Caveat: hazards need to be proportional for this to be a well-powered measure
Interpretation of Beta (Hazard Model)
exp(Beta1) = hazard ratio comparing those with X- 1 to those with X=0 (referent) adjusted for all other predictor variables in the model
exp(Beta) is assumed to be
"constant across time"
Beta = log of the hazard ratio for a one-unit increment in x
Info on the baseline hazard (h₀)
-like an intercept term (except it's a function of time)
-for whatever reason (and this is what makes it semi-parametric), you don't HAVE to estimate the baseline hazard to estimate the other beta coefficients
--> as in, you're not forcing any kind of shape on the baseline hazard...
--> KEY POINT: the baseline hazard is the SAME for all individuals
Kaplan-Meier and Log-Rank test are parameteric or non-parametric?
--> they impose little structure on the data
How do you relax the proportional hazards assumption?
1. add an interaction term between the covariate and continuous time
2. Adding an interaction between the covariate and categorical time
3. fitting a special version of the model which is based on stratifying on a covariate
--> violations of PHA are important for all covariates but are ESPECIALLY important for the main exposure!
What type of likelihood function does proportional hazards model use?
--> partial likelihood
-factors out the baseline hazard function from the likelihood function
--> allows you to fit the proportional hazards model without specifying the underlying event-time distribution
[still unclear why this is...]
What is the partial likelihood?
a function that maximizes the beta coefficients in the proportional hazards model
who uses the maximum likelihood estimation?
logistic regression, comparable to the partial likelihood for proportional hazards model
How to construct likelihood, e.g. L at t=3?
Ask the question: "Given that an event occurred at t = 3, what is the probability that it occurred to subject i, rather than to one of the other subjects?"
Formula for L at t
Numerator: hazard for the subject who had the event at each event time t
Denominator: sum of hazards for all those at risk of the event at time point t
Combining the hazards model and the likelihood function
Once in the likelihood function, the baseline hazard cancels out
--> this is why we can estimate the beta coefficients in a proportional hazards model without formally specifying the underlying distribution of the event-time distribution
Why do I care about ties?
-lots of ties make the partial likelihood complicated --> all possible permutations of ordering need to be evaluated
-lots of ties are bad (gotta use some efron, never breslow)
What is an extended proportional hazards model?
a proportional hazards model that includes time interactions
--> interactions between time and covariates also allow for those covariates to be time-varying
# events / person-time at risk
Units: inverse time
Rates are positive numbers, bounded by 0 and inifinity
Key Assumption of Person-time at risk
-Person-time units are fully exchangeable (10 years in 10 people is the same as 1 year in 100 people)
95% Confidence Interval of Rate Ratio
1. SE(ln(IRR)) = sqrt(1/a₁ + 1/a₀)
Where a₁ = # events exposed
and a₀ = # events unexposed
2. 95% CI = exp[ln(IRR) ± 1.96 SE ln(IRR)]
Notes about Rates and Rate Ratios
-use for both rare and common events --> don't depend on prevalence
-Incidence rate is an instantaneous measure, like speed in physics
Relating Risks to Rates
If a disease is rate, then rate*duration kind of equals risk
If the disease is common, you're going to need to use some exponential formulas to get your **** sorted
Assumptions of the Poisson Regression
1. events are independent
2. the effect of ordinal and continuous variables is linear in the logarithmic scale
3. the combined effect of variables is multiplicative
4. the outcome variable follows the poisson distribution with mean = variance, conditional on the predictor variables
--> 4 is specific to Poisson:
Note: Poisson regression use grouped/tabular data (observations are not individual people)
observed values for the outcome MINUS model-predicted values for the outcome
Is the average residual variation the variance??
Over-dispersion and under-dispersion
Over-dispersion: If residual variation (variance) is greater than the mean
--> SEs too small
--> CIs too narrow
--> Wald p-values too low
Under-disperson: if residual variation (variance) is less than the mean