Study sets, textbooks, questions
Upgrade to remove ads
Quantitative Biochemistry - Statistics Flashcards
Terms in this set (62)
Probability can be defined as the likelihood of an event, A, occurring. What are two interpretations of this?
- Frequentist - A measure of the proportion of outcomes that lead to A occurring.
- Bayesian - A measure of how plausible it is that A may occur.
The main rule of probability is that something either happens or it doesn't. Present this as an equation.
p(A) + p(¬A) = 1
The way we represent the combined probability of two events depends on whether or not they are mutually exclusive. What does this mean?
Two events that are mutually exclusive cannot both occur at the same time; only one or the other.
Express the way in which we present the probability of either A or B (A∪B) happening as two equations. Assume the two are mutually exclusive for one, and not for the other.
If mutually exclusive:
p(A∪B) = p(A) + p(B)
If not mutually exclusive:
p(A∪B) = p(A) + p(B) - p(A and B)
Express the way in which we present the probability of both A and B (A∩B) happening as two equations. Assume the two are independent to each other.
p(A∩B) = p(A) · p(B)
If not independent:
p(A∩B) = p(A) · p(B given A)
[We can also write p(B given A) as p(B|A).]
What does Bayes' theorem tell us? Give the formula, and outline its key parts.
It allows us to find p(A|B) from p(B|A).
p(A|B) = p(A) · [p(B/A) / p(B)]
p(A/B) is the posterior (the probability of A being true, if B is also true).
p(A) is the prior (the standard probability of A being true).
p(B/A)/p(B) is our "support".
Define the probability distribution function. Why is it useful?
The PDF is a mathematical object for specifying distribution of possible outcomes, and telling us the probability of achieving a specific value of x. It's useful as it can specify either DISCRETE outcomes (numbers on a dice) or CONTINUOUS ones (metabolite concentration).
What is a Bernoulli trial? Express the probability of a certain number of success mathematically.
An experiment with only two possible outcomes, "success" (p(A) = p) and "failure" (p(B) = 1 - p).
If x is the number of successes,
p(x) = p^x (1-p)^[1-x].
When a Bernoulli trial is carried out n times, only r of these times will be successful. The formula for finding the probability of r successes is below, as it is on the formula sheet:
P(r) = n!/r!(n - r) · p^r · (1-p)^[n-r]
How would the PDF for this function change as n increases?
As n increases, the distribution shape will grow narrower and more symmetric around the middle. It will also become more smooth, rounded and continuous.
How does continuous distribution arise from discrete? Is there a disadvantage to this?
From the limit as n reaches infinity. The use of limits, however, means that the probability of achieving any exact value is zero, so it's best to only use continuous PDFs when dealing with a small, local range of data.
Define the normal distribution. Give the two parameters that affect the normal distribution, as well as how they do so.
A symmetrical spread of the sum of frequency data that forms a bell-shaped pattern. The mean, median and mode are all located at the highest peak.
μ - Where along the x-axis the distribution is centred (central tendency).
σ - How wide the bell-shaped curve is (dispersion).
Define the central limit theorem.
Large samples will tend towards normal distribution.
Because normal distribution relies on the sum of frequency data, we can't use products. How do we use these, instead?
They can be used for log-normal distribution, which is asymmetrical.
Expectation values can be used as a measure of central tendency. Define the expectation value of a discrete PDF.
The sum of the product of every possible value with its probability.
Expectation values can be used as a measure of central tendency. Define the expectation value of a continuous PDF.
The integral of the product of the probability of a possible value multiplied by the value, within the limits of positive and negative infinity.
Define the Cumulative Distribution Function (CDF).
The area under the PDF curve, from the first item to a particular point, shown by the upper limit of an integral. (For example, the CDF at k=2 tells you the probability of getting two or fewer of something.)
There are two main ways we can measure dispersion. Outline both.
1. Dividing into quartiles; the IQR is the gap between them. It has the same units as distribution, so is useful.
2. Variance (σ^2); has same units as distribution^2, so we often square-root for standard deviation.
Give the formula for variance.
Variance (sigma squared) is the sum of (x - μ) squared for all values of x, where μ is the mean, divided by n - 1.
Outline how to estimate population mean.
Sum of values, divided by how many there are.
Outline how to estimate standard deviation.
The square of the difference between each value and the mean, summed for every value, multipled by 1/(n-1) and square rooted.
What is significant about x̄ and s^2? Why is this the case?
They are the best unbiased estimates for mu and sigma squared. As n approaches infinity, they approach the values they are estimating.
Why do we prefer s^2 as an estimator for sigma squared, over s as an estimator for sigma?
S is always an underestimate of sigma, due to the nonlinear behaviour of the square root. It is therefore biased.
Give the expression that standardises all values for mu and sigma, giving a normal distribution with mean 0 and variance 1.
(x-bar - mu) / (sigma / root-n)
Describe the t-distribution, and give the quantity containing mu and s that follows it. What does the t value depend on, and what IS t?
T-distribution is similar to the normal distribution, but relies on the degrees of freedom, v (n - 1). As v approaches infinity, the t-distribution becomes a normal distribution.
t = (x-bar - mu) / (s / root-n)
[Largely differs from normal distribution as it uses s (standard deviation) as an estimate of sigma (measure of variance), rather than sigma itself.]
T is a test statistic.
How does the t-distribution change with the degrees of freedom?
- For a small v value, the distribution is broader; the tails are fatter, but the peak is less defined.
- As v approaches infinity, the distribution grows more similar to the bell curve.
What is the purpose of confidence limits?
They tell us how often a set of results will be included in a predicted range. For example, if our confidence limit for the population mean is 95%, it will fall within a predicted interval 95% of the time.
Describe the standard error of the mean. How do we find it?
It is an estimate of how much the sample mean differs from the population mean. We can find it with:
SEM = σ (population standard deviation) / root-n (root of number of samples)
Because population standard deviation is difficult to find, the SEM is roughly equivalent to s (sample SD) / root-n.
When the SEM tends to zero, the sample mean tends to the population mean. What must happen to n for this to happen?
It must tend to infinity (i.e. the sample includes the whole population).
Define factors and levels in statistics. How do we tend to show factors graphically?
A factor is a categoric variable, like sex or eye colour.
A level is a possible factor that cannot be ranked or ordered, like female or brown.
We plot factors typically as box plots, with one for each level on the same axes.
Define ordinal variables, and give an example.
Categorical variables that can be ordered. For example, the strongly agree to strongly disagree scale.
Give the alternative (H1) and null (H0) hypotheses.
Judging by sample data from two groups...
H1 - There is evidence that there is a difference in the POPULATION means of the two groups. (mu1 = mu)
H0 - There is no difference between the POPULATION means of the two groups (they come from the same underlying distribution). (mu1 ≠ mu)
Why might we carry out Student's t-test? Outline the necessary steps.
It will tell us how likely it is that the population mean is equal to our calculated sample mean.
- Obtain estimates (x-bar and s) for population mean (mu) and variance (sigma).
- State H1 and H0.
- Compute (x-bar - mu)/(s / root-n), to give us a value t. [mu is often the upper end of a reference range]
- Using a table of t values with n - 1 degrees of freedom, obtain a corresponding p value.
- If the p value is below 0.05, the sample mean is likely to differ from the population mean, so we reject H0 at the 5% level.
What is the significance threshold (level) of 0.05?
A p value below the significance threshold is statistically significant. This means it is likely to differ from the population mean, so we reject the null hypothesis at the 5% level.
When carrying out a t-test to compare the means of two samples, rather than one sample and of the population, there are two ways we can do this. Outline both.
If we assume the variance of both groups is the same, we construct a pooled estimate of s with a formula, and then use that to find our t value, which we use with nA + nB - 2 degrees of freedom.
If we assume the variance of both groups is unequal, we carry out Welch's t-test, which is t-distributed with a degrees of freedom requiring an formula to find. [We don't need to be able to do this, probably.]
When might we perform paired t-tests? (The method is on the formulae sheet.)
When we're repeating experiments on the same subject, but with different conditions (e.g. with and without administering a drug).
Outline the difference between Type 1 and Type 2 errors. What affects the rate at which we occur?
Type 1 - False positive. (H0 is true, but we've rejected it; the p value is incorrectly below the significance level.)
Type 2 - False negative. (H0 is false, but we've accepted it; the p value is incorrectly above the significance level.)
The rate at which we have Type 1 errors is the significance level.
The rate at which we have Type 2 errors is related to the statistical power.
What is the Bonferroni correction, and why is it necessary?
If I do "m" hypothesis tests, then to control the rate of Type 1 error, I divide by p-value by "m".
Every time we do a test, we might have Type 1 error. This means that carrying out multiple tests ("m" tests) increases the chance of error.
How can we estimate the number of pairwise tests from "k" groups that will suffer from Type 1 error?
- Multiply k by (k - 1) and divide by two, to find the number of different possible pairwise tests.
- Multiply by the significance threshold (e.g. 0.05) to find the number affected by Type 1 error.
Bonferroni's correction is slightly conservative. Name an improved test, which is both more complex and more powerful, and suggest how it works.
The Holm-Bonferroni method, which tests the lowest "p" value in a group most strictly, and others progressively less so.
Name three assumptions that we make when carrying out a t-test. 
- The underlying populations are normally distributed and continuous.
- Samples are drawn randomly from the populations.
- Each observation made from a population is independent.
- The underlying populations have EITHER the same Student's t-test variance, OR different Welch's t-test variances.
Outline the difference between parametric and non-parametric tests.
- Parametric tests make an assumption of the shape of the PDF (e.g. t-tests assume the data follows a normal distribution), so that their power is greater.
- Non-parametric tests assume nothing about the shape of the PDF, and so are useful for corrupted data, although their power can be lower.
The Poisson distribution is useful to calculate the probability of an event happening a set number of times. The formula for its PDF is as follows:
P(r) = e^-λ · (λ^r) / r!
where r is the number of times an event occurs in a set interval, and λ is the average rate at which these events occur.
Give two criteria for this formula to be true, as well as two scientific examples where it applies.
- The probability of each instance of the event occurring is constant.
- Each event is independent.
- The number of radioactive decays per unit time.
- The number of mutations on a DNA strand per unit length.
State the Central Limit Theorem (CLT), and suggest a problem that arises when attempting to test it, as well as a rule of thumb we follow as a result.
In the limit as the number of independent, random samples tends to infinity, the PDF tends to a normal distribution.
Methods to test whether or not your data is normal tend to require very large sample sizes. Therefore, we generally assume that the CLT is valid when at least 30 samples have been drawn from a unimodal distribution.
An example of a non-parametric test is the Wilcoxon Rank Sum test (a.k.a. the Mann-Whitney U test). What is its purpose? Outline the principle behind how it works.
To determine if data acquired from two or more samples differ in a statistically significant way.
- Rank the values from all groups (both samples) from smallest (1st) to largest, taking the average rank of two tied values.
- Sum the assigned rankings for each sample separately. The distribution of these sums is the test statistic, because the order of the groups shouldn't matter if they don't differ significantly.
- Count the number of values from both groups and use the smaller, then the larger, to find a p value range from the table (e.g. 10;26).
- Compare the sum from the group with the smaller "n" value to the range; if it's outside the range or equal to one of the numbers, p < 0.05 and we can reject H0.
Give an advantage of the Wilcoxon Rank Sum test.
Because we are only using the ranks, and not the actual figures, we avoid many physical sources of error such as quadratic calibration issues, as well as outliers.
Give a disadvantage of the Wilcoxon Rank Sum test.
It is less efficient than a t-test for normally distributed data, due to a reduced likelihood of detecting a difference between groups.
What does effect size act as a measure of? Give an advantage of its representation by Cohen's "d".
(NOTE: The formula for Cohen's "d" is the same as that for equal variance, assuming that:
- n1/s1 and n2/s2 are replaced with nA/sA and nB/sB
- x-bar and y-bar are replaced with x-barA and x-barB
- Z is replaced with d
- The square root is removed from the bottom of the d formula.)
The magnitude of the difference between two groups provided by a statistical test.
"d" gives a situation-independent measure of the difference's magnitude; it functions better than a p value, as the p value will always tend to zero as "n" increases to infinity, while "d" is a good measure of the difference between the two groups no matter the "n" values.
Define statistical power, and suggest which type of errors a high power reduces. What does working out the power (via a power calculation) allow us to do?
The probability of being able to reject the H0 when H1 is true. A high power reduces Type II errors.
Power calculations allow us to state a minimum effect size at an appropriate level of power (usually around 80%).
Knowing both power and effect size allow us to do....what?
Predict which hypothesis can (or can't) be detected by tests.
Define likelihood, L. Suggest an alternative model we might use to represent it, and explain why.
The probability that a certain set of values has been pulled from a data pool. Its formula is very complex.
We tend to use log-likelihood (ι, Greek iota) instead, as likelihood tends to be very small.
Why might we want to know the likelihood?
It allows us to maximise it via Maximum Likelihood Estimation (MLE), which is very important for certain scientific techniques as it allows us to fit "models" to our data.
Suggest how we might maximise ι.
By varying the parameters of the PDF.
What is the MLE for mu (population mean)?
The sample mean.
What is the purpose of multivariate normal distribution? Name two parts of its formula.
To act as a joint PDF for measuring multiple variables, x and y (e.g. height and weight).
- The covariance matrix (Σ), which contains information about variance in both variables, x and y.
- Pearson's correlation coefficient (ρ, rho), which tells us the magnitude of any association between x and y.
Give the formula for covariance of x and y.
- Sum of the product of (xi - x-bar) and (yi - y-bar) for all values of the two.
- Divide by (n - 1).
Give the formula for Pearson's correlation coefficient, as well as a good estimator (like x-bar, sample mean, for mu, population mean). Does the estimator make assumptions about the PDF?
ρ = the covariance for (x,y), divided by the product of the standard deviations for both x and y.
The estimator for ρ is Pearson's r; the formula for this is given on the sheet. It is PARAMETRIC, and assumes the data is normally distributed. This is a weakness of the estimator, as it isn't robust against outliers.
A hypothesis test of r may be performed against 0, to see if the correlation is significantly different from 0. Outline the steps by which this occurs.
- Compute the following quantity to obtain a Z value:
r · √[(n - 2)/(1 - r^2)]
- Compare Z against a critical value from a t-table.
Suggest two non-parametric tests we can use to assess correlations. Compare:
- the sizes of the values they produce
- what the calculations are based on
- how accurate / prone to error they are
- Spearman's Rank - usually gives larger values, calculations based on deviation from surrounding data points so more prone to error.
- Kendall's Tau - usually gives smaller values, calculations based on correlation between pairs of variables, gives more accurate o-values with smaller sample sizes.
Define regression, and present a typical formula for linear regression (a straight line through the data points).
Inferring an analytic formula from acquired data.
y = k1 + k2f(x) + k3g(x) + ... + ε (error)
Outline the least squares approach for simple linear regression, which aims to minimise the sum of the squared differences between predicted and measured (observed) values.
[NOTE: These differences are called "residuals", and should be normally distributed about zero for a good fit.]
We want to fit y = mx + c (+ ε) to acquired data points xi and yi. To do this, we need to calculate m and c.
(Formulae for m and c are given on the formula sheet.)
Suggest two quantities that can be used to quantify how well our formula fits the data, and give the formula that links them. (The formula is given on the sheet.) Are there any issues with using this formula?
SSresidual, the residual sum of squares.
-> Sum of (yi - f(xi)) squared, for all values of yi.
[We want this number to be low for a good fit.]
SStotal, the total sum of squares
-> Sum of (yi - y-bar) squared, for all values of yi.
[Very similar to estimated variance.]
R^2 = 1 - (SSresidual/SStotal)
The larger this number is, the greater the amount of variability that is accounted for by the fit. (E.g. if R^2 = 0.49, 49% has been accounted for.)
The issue with R^2 is that it isn't completely impartial; a model will always improve with a greater number of parameters, regardless of its R^2 value.
Suggest an issue with simple linear regression.
One distant outlier (with "high leverage") can dramatically affect the formula produced via regression.
Sets found in the same folder
Maths lecture 1 - introduction
Quantitative Biochemistry - Maths Flashcards
Quantitative Biochemistry - Statistics Flashcards
Other sets by this creator
IT3 - How are genes expressed?
IT3 - How are genes expressed? - Transcription
IT3 - How are genes expressed? - Translation
IT4 - How is DNA copied and maintained?
Other Quizlet sets
BJU Biology First Semester Final - Multiple Choice…
ACC 201: Final Exam on Chapters 8-10: