Hypothesis Testing

Outline the logical basis for a hypothesis test
Click the card to flip 👆
1 / 36
Terms in this set (36)
Clearly if we had seen 5 HIV cases in the vaccine group and 120 in the placebo group we would conclude the vaccine worked. Alternatively if we had seen 62 HIV cases in the vaccine group and 61 in the placebo group, we would conclude the vaccine did not work

However, at what stage should we begin to pay attention to the imbalance? Does the observed 51 versus 74 split supply evidence that the vaccine works, or could these results reasonably be ascribed to chance?

An hypothesis (or significance) test helps us to answer this question and assess the strength of the evidence provided by these data
The research question is first formulated in terms of a null hypothesis

The null hypothesis usually refers to no difference between groups or no association and researchers are generally keen to disprove this

More specifically in a clinical trial, the null hypothesis (H0) may be that the true difference in a parameter (mean or proportion) on treatment or non placebo is 0
The probability of observing a test statistic (result) as more extreme than the observed in your sample, in hypothetical repetitions of the study assuming that the null hypothesis is true

The P-value is compared with the significance level (chosen prior to the study, sometimes referred to as alpha, usually at 5%)

If P<0.05, the test is "statistically significant" at the 5% level and the null hypothesis is rejected

Alternatively if P>0.05, the test is not "statistically significant" at the 5% level and there is insufficient evidence to reject the null hypothesis
The outcome and the compassion being conducted

1. Formulae hypothesis

In a trial, the null hypothesis (H0) often that the true difference in mean/proportion on treatment or on placebo is 0

2. Conduct study

3. Use sample data to calculate a P-value

The P-value: the probability of observing a test statistic (results) as, or more, extreme than that observed in your sample, in hypothetical repetitions of the study assuming that the null hypothesis is true

4. Make decision based upon the P-value
A Type I error

Occurs if the null hypothesis is rejected when it is true

The probability of a Type 1 error, the rejection of a true null hypothesis, equals the significance level (and is therefore usually 5%)
Saying the treatment works when it doesn't

A Type II error

Occurs if the null hypothesis is not rejected when it is false

The calculation of the probability of a Type II error is beyond the scope of this course, but as a given significance level is related to sample, larger studies will have reduced risk of Type II error
A Chi-squared test for a 2x2 table is used to compare proportions between two independent groups

It is commonly used in randomised controlled trials comparing an outcomes with two categories (eg.dead/alive) in a placebo and treatment group

The Null Hypothesis (H0) is that population (or true) proportion in group one equals the population (or true) proportion in in group two (π1=π2)

The Alternative Hypothesis (H1) is that the population (or true) proportion in group one does not equal the population (or true) proportion in group two (π1=/=π2)
Test Statistic:

In this course, we are not concerned with calculating tests by hand but for transparency, the test statistic value is given by

x^2 = E [ (IO-EI - 0.5)^2 / E

...with 1 degree of freedom

Where O is observed cell count and E is the expected cell count (if the null hypothesis was true given by ([row total x column total] / overall total) in each cell of the 2x2 table

To obtain a P-value the test statistic is compared against the Chi-squared distribution with 1 degree of freedom

The test is only considered reliable if expected cell values are greater than 5

Other tests such a Fisher's Exact Test (not covered in this course) are required if this criteria is not met
Image: How do you calculate a Chi-squared test for 2x2 table
When are the relative risk and the absolute risk reduction calculated?In analyses of 2x2 tables, particularly in randomised controlled trials (comparing a treatment to a placebo), two measures of treatment effect are commonly calculated: the relative risk and the absolute risk reductionWhat is the absolute risk reduction?The difference in risk between the two treatment groups Absolute risk reduction = risk in placebo group - risk in treatment group = b/(b+d) - a/(a+c) The absolute risk reduction is often calculated as a percentage by multiplying it by 100What is the number needed to treat?The number need to treat (NNT) is based upon the absolute risk reduction and gives the number of patients that would need to be treated to prevent one adverse outcome Number needed to treat = 1 / absolute risk reductionWhat is the relative risk?The relative risk (RR) or risk ratio is the risk in the treatment divided by the placebo group Relative risk = risk in treatment group / risk in placebo group = {a/(a+c)} / {b/(b+d)} A relative risk (RR) below 1 indicates that the risk is decreased in the treatment group compared with the placebo group For RR below 1, this is sometimes converted to a percentage by (1-RR)x100 and referred to as the relative risk reduction Similarly, for RRs above 1 the relative risk increase is often calculated as a percentage by (RR-1)x100 to give the relative risk increase in percentage termsWhat is Odds Ratio?The odds ratio (OR) is the odds in the treatment divided by the odds in the placebo group Odds = odds in treatment group / odds in placebo group = {a/c} / {b/d}What is a Confidence Interval?Confidence intervals for these measures of treatment effect are also calculated giving a range of values which are plausible for the treatment effect and are interpreted as in Chapter 2 for a RR, in 95% of samples taken from the population, the true RR will be captured by the calculated 95% confidence interval The confidence intervals also allow us to predict the results of the hypothesis test as they contain values of the true RR which are plausibleCalculate confidence interval from relative riskFor instance, for a relative risk: If the relative risk of 1 (corresponding to the Null Hypothesis of no difference in population proportions) is captured by the 95% CI then the Chi-squared 2x2 hypothesis test result will not be significant at the 5% level If the relative risk of 1 (corresponding to the Null Hypothesis of no difference in population proportions) is not captured by the 95% CI then the Chi-squared 2x2 hypothesis test result will be significant at the 5% levelExample 1 (continued) - HIV trialIn the HIV trial the magnitude of the effect which can also be quantified using relative risks or the risk reduction... Risk reduction = b/(b+d) - a/(a+c) = 74/(74+8124)-51/(51+8146) = 0.009-0.006 = 0.003 This is typically presented with CIs as: Risk reduction 0.003 (95% CI 0.0001 to 0.005) The absolute reduction in risk in the vaccine group compared with the placebo was 0.003 or 0.3%. There is a 95% chance that the true risk reduction in the vaccine group compared with the placebo group is captured by these confidence intervals. Notice that our sample results are consistent with a very small reduction in the absolute risk of HIV of 0.01% or a larger reduction 0.5%. Number needed to treat = 1/0.003 = 333 The number needed to treat 333 means that you would have to treat 333 individuals with the vaccine to prevent one case of HIV. Relative risk​ = {a/(a+c)}/{b/(b+d)} = {51/(51+8146)}/{74/(74+8124)}=0.0062/0.009 = 0.69 This is typically presented with CIs as: Relative risk 0.69 (95% CI 0.49 to 0.99). The relative risk is 0.69. This means that the risk in the vaccine group is 0.69 times that of the placebo group or alternatively is reduced by 31% (ie. 1-RRx100). Notice that our sample results are consistent with a reduction in relative risk of 1% or a 51% reduction. Odds ratio​ = {a/c}/{b/d} ={51/8146}/{74/8124}= 0.0063/0.0091 = 0.69 The calculations of 95% CIs for risk reductions, relative risks and odds ratios are not covered by this courseExample 2 - The Recovery TrailThe Recovery trial was conducted to investigate treatments for COVID -19 Hydroxychloroquine was proposed as a treatment for COVID-19. The Recovery trial investigated the impact of hydroxychloroquine on mortality at 28 days in patients admitted to hospital with COVID-19 In total, 27% (418) of the 1561 patients randomly allocated to hydroxychloroquine and 25% (788) of the 3155 concurrently allocated to usual care had died by 28 days Is there evidence that hydroxychloroquine improves mortality? A Chi-squared test for a 2x2 table was used to compare the proportion dying at 28 days to hydroxychloroquine and usual care group The Null hypothesis (H0) was that in the population of patients hospitalised with COVID-19 the proportion who die within 28 days on hydroxychloroquine is the same as the proportion who die on usual care (πhydroxychloroquine = πuual care) The Alternative hypothesis (H0) was that in the population of patients hospitalised with COVID-19 the proportion who die within 28 days on hydroxychloroquine is not the same as the proportion who die on usual care (πhydroxychloroquine ≠ πusual care) As P>0.05 we have insufficient evidence to reject the null hypothesis. We do not have evidence of a difference in mortality between hydroxychloroquine and usual care As we do not reject the null hypothesis there is the possibility of Type 2 error (that hydroxychloroquine does reduce the risk of death compared with usual care but this was not detected in our study as significant) The risk reduction = 788/(2367+788)-418/(1143+418) = 0.25-0.27 = -0.02 (95% CI +0.008, -0.04) Therefore, there the absolute reduction in mortality was negative indicating that the hydroxychloroquine group had higher mortality than usual care by 2%. The relative risk = {418/(1143+418)}/{ 788/(2367+788)} =0.27/0.25= 1.07 (95% CI 0.97, 1.19). Therefore, the risk of mortality in those on hydroxychloroquine was 1.07 times that of those on usual care (i.e. in relative terms increased by 7%)What is an independent samples t-test?An independent samples t-test is used to test whether the mean of an interval scale variable in one group os equal to the mean in another group when the two groups are independent The Null Hypothesis (H0) is that the population(or true) mean in group 1 equals the population (or true) mean in group 2 (µ1=µ2) The Alternative Hypothesis (H1) is that the population (or true) mean in group 1 does not equal the population (or true) mean in group 2 (µ1=/=µ2) (The symbol µ is often used to denote the population mean)How do you calculate an independent samples t-test?In this course, we are not concerned with calculating tests by hand but for transparency, the test statistic value is given by; To obtain a P-value, the test statistic is then compared against the t-distribution with (n1 + n2 - 2) degrees of freedom if the sample sizes are small or the N(0,1) distribution if sample sizes are large The independent samples t-test requires the data being compared meet two criteria referred to as assumptions before it is used or the results from it can be misleading. These assumptions are only required when the size of either of the two samples is small (ie n1<30 or n2<30) The assumptions are... A. The variable of interest should be Normally distributed in the population from which the first sample came and in the population from which the second sample came. This can be checked using histograms of the data from sample 1 and sample 2 B. The standard deviation in the two populations should be similar This can be checked by comparing the standard deviation in sample 1 and sample 2 If these assumptions are not met, a Mann-Whitney U-test can be conducted insteadState the related tests to a sample t-test1. A separate test is available called Analysis of variance (often referred to as ANOVA) to compare the mean of an interval scale variable between 3 or more independent groups 2. A haired sample t-test is used to compare a mean between 2 paired groups (when the data in the 2 groups are paired, such as blood pressure before and after treatment) These tests are based upon similar logicExample 1 - Scale TrialIn the SCALE trial (NEJM 2015) 3,371 obese patients were randomly allocated to receive liraglutide (n=2,437) or placebo (n=1,225) One year after the start of the study change in weight from the start of the study was calculated for each individual Patients in the liraglutide group had lost on average 8.4 kg (sd=7.3 kg) and patients in the placebo group had lost on average 2.8 kg (sd=6.5kg) Is there evidence that liraglutide reduced weight? An independent sample t-test was conducted to compare the interval scale variable (change in weight) between the 2 independent groups (the liraglutide group and the placebo group) The Null hypothesis (H0) was that in the population of obese patients the mean change in weight on liraglutide equals the mean change in weight on placebo (μliraglutide = μplacebo) The Alternative hypothesis (H1) was that in the population of obese patients the mean change in weight on liraglutide does not equal the mean change in weight on placebo (μliraglutide ≠ μplacebo) Therefore, assuming the null hypothesis is true (i.e. the treatment does not work) the probability of obtaining results as, or more extreme, than the results we have obtained in hypothetical repetitions of the study is <0.001 As P<0.05 we reject the null hypothesis, there is evidence that the mean change in weight in obese patients on liraglutide is different to obese patients on placebo As we reject the null hypothesis, there is the possibility of a Type 1 error (which is set at 5% because we are testing at the 5% significance level), which is that the treatment does not work and the results we have seen have occurred by chanceHow is the treatment effect calculated?In a comparison of means, the treatment effect is calculated as a difference in mean This difference in mean is presented with the confidence intervals giving a range of values which are plausible for the true difference in mean and these are interpreted as in Chapter 2 (in 95% of samples taken from the population, the true difference in mean will be captured by the 95% confidence interval) These confidence intervals also allow us to predict the results of the hypothesis test as they contain values of the true difference in mean which are plausible If a difference in mean of 0 is captured by the 95% CI then the independent samples t-test will not be significant at the 5% level If a difference in mean of 0 is not captured by the 95% CI then the In a comparison of means, the treatment effect is calculated as a difference in mean This difference in mean is presented with the confidence intervals giving a range of values which are plausible for the true difference in mean and these are interpreted as in Chapter 2 (in 95% of samples taken from the population, the true difference in mean will be captured by the 95% confidence interval) These confidence intervals also allow us to predict the results of the hypothesis test as they contain values of the true difference in mean which are plausible If a difference in mean of 0 is captured by the 95% CI then the independent samples t-test will not be significant at the 5% levelGive an example of how to calculate the difference in meanIn the Scale trial the mean weight loss in the liraglutide group was 8.4 kg whereas the mean weight loss in the placebo group was 2.8 kg The difference in mean is 5.6 kg i.e. there was greater weight loss by on average 5.6 kgs in the liraglutide group compared with placebo A 95% confidence interval for the true difference in mean is 5.1, 6.1 kg These are plausible values for the true difference in weight loss consistent with our sample data, note as 0 is not in this interval the test is significantChi-squared for a 2x2 table is used: a) To compare an interval scale variable between two groups b) To compare two proportions between two independent groups c) To compare an interval scale variable between two groups provided that the variable is normally distributed d) To compare two proportions between two paired groups e) To compare any outcome between two independent groupsAnswer: B To compare two proportions between two independent groupsIn a study to evaluate the importance of different factors on the development of cirrhosis among hepatitis C virus positive individuals (J Vir Hep 1998; 5: 43-51), 35/79 of those without cirrhosis reported alcohol abuse compared to 10/20 of the individuals with cirrhosis (P=0.84) a) We could use the unpaired t-test to analyse these data b) The appropriate null hypothesis is that the same number of individuals in the populations with and without cirrhosis report alcohol abuse c) The appropriate null hypothesis is that the same proportion of individuals in the populations with and without cirrhosis report alcohol abuse d) An appropriate null hypothesis is that the same mean alcohol abuse is the same in individuals number of individuals in the populations with and without cirrhosis report alcohol abuse e) The appropriate null hypothesis is that the same proportion of individuals in the samples with and without cirrhosis report alcohol abuseAnswer: C The appropriate null hypothesis is that the same proportion of individuals in the populations with and without cirrhosis report alcohol abuseIn the study in question 3.2 the P-value of 0.84 indicates that: a) There is evidence of a difference in the true rates of alcohol abuse in the two groups d) The rates of alcohol abuse in the cirrhosis group is 0.84 times that of the non-cirrhosis group c) There is an 84% probability the rates are different d) The rates of alcohol abuse in the non-cirrhosis group is 0.84 times that of the cirrhosis group e) There is no evidence of a difference in the true rates of alcohol abuse in the two groupsAnswer: E There is no evidence of a difference in the true rates of alcohol abuse in the two groups.In an observational study of defibrillation in theatre, 23 surgeons and 25 anaesthetists were asked to manage simulated ventricular fibrillation The percentage successful managing to defibrillate according to advanced life support protocols was 28% (7/25) of the surgeons and 4% (1/23) of the anaesthetists (P=0.06) a) This lack of significance could be due to a Type 1 error b) The chance of a Type 2 error increases as the sample size increases c) A larger sample size would have reduced the risk of Type 1 error d) If this study was repeated and a difference of 24% (28%-4%) was observed again, this difference would not be significant, regardless of the size of the study e) This lack of significance could be due to a Type 2 errorAnswer: E This lack of significance could be due to a Type 2 errorIn a study of critical ill patients with severe acute kidney injury, patients were randomly allocated to early initiation of renal replacement therapy (RRT) or delayed initiation of RRT The proportion dead at 90 days was 39% (44/112) in the early group and 55% (65/119) in the delayed group (P=0.02) a) There is a 2% chance of no difference between the two groups b) In hypothetical repetitions of the study assuming the null hypothesis is true, there is a 2% chance of observing a difference in proportions as extreme or more extreme than that observed in our samples c) We accept the null hypothesis d) There is a 2% chance the difference is real e) There is no evidence of a difference in the true rates of death in the two groups.Answer: B In hypothetical repetitions of the study assuming the null hypothesis is true, there is a 2% chance of observing a difference in proportions as extreme or more extreme than that observed in our samplesIn the study in Q3.5 a) There is the possibility of a type 2 error b) The relative risk in the early group compared with the delayed group = 0.39 - 0.55 = 0.16 c) The relative risk is 0.02 d) The relative risk in the early group compared with the delayed group = 0.39/0.55 = 0.72 e) A type 1 error has occurredAnswer: D The relative risk in the early group compared with the delayed group = 0.39/0.55 = 0.72 Remember absolute risk is minus and relative risk is divideThe P-value is: a) The probability that the null hypothesis is true b) The probability that the alternative hypothesis is true c) The probability of obtaining the observed or more extreme results if the alternative hypothesis is true d) The probability of obtaining the observed results or results which are more extreme if the null hypothesis is true e) Always less than 0.05Answer: D The probability of obtaining the observed results or results which are more extreme if the null hypothesis is trueA study is conducted to investigate a new ingestible and inflatable balloon system as a noninvasive way to fill up the stomach and curb appetite Sixty clinically obese men (selected from the population of obese men) are randomly allocated to the intervention (i.e. the balloon) or a placebo After 6 months, BMI is measured in the two groups Which of the following is an appropriate null hypothesis for the study... a) At the end of the 6 month period, the difference in BMI in the placebo and obese group is not statistically significant b) At the end of the 6 month period, the mean BMI on placebo is equal to that on the intervention in the population of obese men c) At the end of the 6 month period, the mean BMI on placebo is less than that on the intervention in the population of obese men d) At the end of the 6 month period, BMI in the placebo group and obese group is identical e) At the end of the 6 month period, the difference in BMI in the placebo and obese group is statistically significant.Answer: B At the end of the 6 month period, the mean BMI on placebo is equal to that on the intervention in the population of obese menIn a study, CD4 counts were measured in 48 HIV positive mothers who gave birth to children without HIV (mean=728, SD=274) and 11 HIV positive mothers whose children were HIV infected (mean=465, SD=271, P=0.006) a) A paired t-test was performed on these data as the two groups were dependent b) The results are significant at the 5% level suggesting that the null hypothesis cannot be rejected c) There is the possibility of Type 2 error d) The null hypothesis can be rejected, there is evidence of lower CD4 counts in mothers who transmitted HIV to their children e) CD4 counts in these women are Normally distributedAnswer: D The null hypothesis can be rejected, there is evidence of lower CD4 counts in mothers who transmitted HIV to their childrenQ3, a confidence interval was calculated for the difference in mean CD4 counts in mothers who transmitted HIV to their children compared with mothers who did not (263 95%CI 80, 446) a) There is a 95% chance this result is statistically significant b) In 95% of mothers this interval captures the difference in CD4 count in the population of HIV positive mothers who transmitted HIV to their children compared with mothers who did not c) There is a 95% chance that this interval captures the difference in mean CD4 count in the population of HIV positive mothers who transmitted HIV to their children compared with mothers who did not d) This result is not statistically significant because 0 is not in this 95% confidence interval e) There is a 95% chance that this interval captures the difference in mean CD4 count in the samples of HIV positive mothers who transmitted HIV to their children compared with mothers who did notAnswer: C There is a 95% chance that this interval captures the difference in mean CD4 count in the population of HIV positive mothers who transmitted HIV to their children compared with mothers who did notFor an independent samples t-test comparing two small samples to be valid: a) The numbers of observations must be approximately the same in the two groups b) The standard deviations of the outcome variable must be approximately the same in the two groups c) The mean of the outcome variable must be approximately the same in the two groups d) The outcome variable must be categorical e) The sizes of samples must be greater than 25Answer: B The standard deviations of the outcome variable must be approximately the same in the two groups