5_Biostatistics/Other Non Pharm Material

Terms in this set (157)

1. Most common model for population distributions

2. Symmetric or "bell-shaped" frequency distribution

3. Landmarks for continuous, normally distributed data
a. μ: Population mean is equal to zero.
b. σ: Population SD is equal to 1.
c. x and s represent the sample mean and SD.

4. When measuring a random variable in a large enough sample of any population, some values will occur more often than will others.

5. A visual check of a distribution can help determine whether it is normally distributed (whether it appears symmetric and bell shaped). Need the data to perform these checks
a. Frequency distribution and histograms (visually look at the data; you should do this anyway)
b. Median and mean will be about equal for normally distributed data (most practical and easiest to use).
c. Formal test: Kolmogorov-Smirnov test
d. More challenging to evaluate this when we do not have access to the data (when we are reading a paper), because most papers do not present all data or both the mean and median

6. The parameters mean and SD de ne a normally distributed population.

7. Probability: The likelihood that any one event will occur given all the possible outcomes

8. Estimation and sampling variability
a. One method that can be used to make an inference about a population parameter
b. Separate samples (even of the same size) from a single population will give slightly
different estimates.
c. The distribution of means from random samples approximates a normal distribution.
i. The mean of this "distribution of means" is equal to the unknown population mean, μ. ii. The SD of the means is estimated by the standard error of the mean (SEM).
iii. Like any normal distribution, 95% of the sample means lie within ±2 SEM of the
population mean.
d. The distribution of means from these random samples is about normal regardless of the
underlying population distribution (central limit theorem). You will get slightly different mean and SD values each time you repeat this experiment.
e. The SEM is estimated with a single sample by dividing the SD by the square root of the sample
size (n). The SEM quantifies uncertainty in the estimate of the mean, not variability in the sample. Important for hypothesis testing and 95% con dence interval (CI) estimation
f. Why is all of this information about the difference between the SEM and SD worth knowing?
i. Calculation of CIs. (95% CI is approximately the mean ± 2 times the SEM.)
ii. Hypothesis testing
iii. Deception (e.g., makes results look less "variable," especially when used in graphic format)

9. Recall the previous example about high-density lipoprotein cholesterol (HDL-C) and green tea. From the calculated values in section III, do these data appear to be normally distributed?
Commonly Reported as a Way to Estimate a Population Parameter

CIs Can also be Used for Any Sample Estimate. Estimates derived from categorical data such as risk, risk differences, and risk ratios are often presented with the CI and will be discussed below.

95% CIs are the most commonly reported CIs. In repeated samples, 95% of all CIs include true population value (i.e., the likelihood/ confidence [or probability] that the population value is contained within the interval). In some cases, 90% or 99% CIs are reported. Why are 95% CIs most often reported?

a. Assume a baseline birth weight in a group (n=51) with a mean ± SD of 1.18 ± 0.4 kg.
b. 95% CI is about equal to the mean ± 1.96 × SEM (or 2 × SEM). In reality, it depends on the distribution being used and is a bit more complicated.
c. What is the 95% CI? (1.07, 1.29), meaning there is 95% certainty that the true mean of the entire
population studied will be between 1.07 and 1.29 kg.
d. What is the 90% CI? The 90% CI is calculated to be (1.09, 1.27). Of note, the 95% CI will always be wider than the 90% CI for any given sample. Therefore, the wider the CI, the more likely it is to encompass the true population mean. In general, the "more con dent" we wish to be.

The differences between the SD, SEM, and CIs should be noted when interpreting the literature because they are often used interchangeably. Although it is common for CIs to be confused with SDs; the information each provides is quite different and has to be assessed correctly.

CIs Instead of Hypothesis Testing
1. Hypothesis testing and calculation of p-values tell us (ideally) whether there is, or is not, a statistically significant difference between groups, but they do not tell us anything about the
magnitude of the difference.
2. CIs help us determine the importance of a finding(s), which we can apply to a situation.
3. CIs give us an idea of the magnitude of the difference between groups as well as the statistical significance.
4. CIs are a "range" of data, together with a point estimate of the difference.
5. Wide CIs
a. Many results are possible, either larger or smaller than the point estimate provided by the study.
b. All values contained in the CI are statistically plausible.
6. If the estimate is the difference between two continuous variables: A CI that includes zero (no difference between two variables) can be interpreted as not statistically significant (a p-value of 0.05 or greater). There is no need to show both the 95% CI and the p-value.
7. The interpretation of CIs for odds ratios and relative risks is somewhat different. In that case, a value of 1 indicates no difference in risk, and if the CI includes 1, there is no statistical difference. (See the discussion of case-control/cohort in other sections for how to interpret CIs for odds ratios and relative risks.)
The probability of making a correct decision when H0 is false; the ability to detect differences between groups if one actually exists

Dependent on the following factors:
a. Predetermined α
b. Sample size n
c. The size of the difference between the outcomes you wish to detect. Often not known before conducting the experiment, so to estimate the power of your test, you will have to specify how large a change is worth detecting
d. The variability of the outcomes that are being measured
e. Items c and d are generally determined from previous data and/or the literature.

Power is decreased by the following (in addition to the above criteria):
a. Poor study design
b. Incorrect statistical tests (use of nonparametric tests when parametric tests are appropriate)

Statistical power analysis and sample size calculation
a. Related to above discussion of power and sample size
b. Sample size estimates should be performed in all studies a priori. c. Necessary components for estimating appropriate sample size
i. Acceptable type II error rate (usually 0.10-0.20)
ii. Observed difference in predicted study outcomes that is clinically signi cant iii. The expected variability in item ii
iv. Acceptable type I error rate (usually 0.05)
v. Statistical test that will be used for primary end point

Statistical significance versus clinical signi cance
a As stated earlier, the size of the p-value is not necessarily related to the clinical importance of the result. Smaller values mean only that "chance" is less likely to explain observed differences.
b. Statistically significant does not necessarily mean clinically significant.
c. Lack of statistical significance does not mean that results are not clinically important.
d. When considering nonsignificant findings, consider sample size, estimated power, and observed variability.
1. A statistical technique related to correlation. There are many different types; for simple linear regression: One continuous outcome (dependent) variable and one continuous independent (causative) variable
2. Two main purposes of regression: (1) Development of prediction model and (2) accuracy of prediction
3. Prediction model: Making predictions of the dependent variable from the independent variable; Y = mx + b (dependent variable = slope × independent variable + intercept)
4. Accuracy of prediction: How well the independent variable predicts the dependent variable. Regression analysis determines the extent of variability in the dependent variable that can be explained by the independent variable.
a. Coefficient of determination (r2) measured describing this relationship. Values of r2 can range from 0 to 1.
b. An r2 of 0.80 could be interpreted as saying that 80% of the variability in Y is "explained" by the variability in X.
c. This does not provide a mechanistic understanding of the relationship between X and Y, but rather, a description of how clearly such a model (linear or otherwise) describes the relationship between the two variables.
d. Like the interpretation of r, the interpretation of r2 is dependent on the scientific arena (e.g., clinical research, basic research, social science research) to which it is applied

5. For simple linear regression, two statistical tests can be used.
a. To test the hypothesis that the y-intercept differs from zero
b. To test the hypothesis that the slope of the line is different from zero
6. Regression is useful in constructing predictive models. The literature is full of examples of predictions. The process involves developing a formula for a regression line that best ts the observed data.
7. Like correlation, there are many different types of regression analysis.
a. Multiple linear regression: One continuous independent variable and two or more continuous dependent variables
b. Simple logistic regression: One categorical response variable and one continuous or categorical explanatory variable
c. Multiple logistic regression: One categorical response variable and two or more continuous or categorical explanatory variables
d. Nonlinear regression: Variables are not linearly related (or cannot be transformed into a linear relationship). This is where our pharmacokinetic equations come from.
e. Polynomial regression: Any number of response and continuous variables with a curvilinear relationship (e.g., cubed, squared)
The complete cost-utility analysis is the most expensive economic anal- ysis technique because of the time required from both researchers and subjects to collect the utili- ties. It should be used only when quality of life is the outcome of interest or is one of the outcomes of interest

QALY: The QALY is a function of quality multiplied by quantity of life, which are independent. The life-year is simply the change in survival (the measure of mortality)

Utilities: Health is the construct (surrogate) for being able to participate in life at the level desired; other non-health-related aspects of quality of life are not considered here.

Utility determination: Although multi-attribute utility instruments (also called MAUIs), such as the EQ-5D, HUI-3 or SF-6D, are being used more often, three direct methods of determining utilities are commonly used. A MAUI must be validated against of the direct methods, usually standard gamble, and may include a state worse than death (range -1.0 to 1.0).

A probability (p) of full health is presented (with death being 1 - p)

Time trade-off
Time trade-off assumes that people with a loss of health would be willing to give up part of their life span to live in perfect health.

Rating scale
The rating scale (usually a visual analog scale similar to the 100-mm pain scale, where the distance between each millimeter mark is equal) has been used, but because it does not give a choice between two alternatives, has been considered an indirect technique similar to PROs rather than a utility generator. The rater places a mark at the point where he or she believes the scenario belongs relative to several other scenarios, with the top being "full health" and the bottom being "death.

All geographic subdivisions smaller than a state, including street address, city, county, precinct, and zip code, and their equivalent geocodes, except for the initial three digits of a zip code, if one of the following applies according to the current publicly available data from the Bureau of the Census:
a. The geographic unit formed by combining all zip codes with the same three initial digits contains
more than 20,000 people, and
b. The initial three digits of a zip code for all such geographic units containing 20,000 or fewer people
is changed to 000.

All elements of dates (except year) for dates directly related to an individual, including birth date,
admission date, discharge date, and date of death; and all ages older than 89 years and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 years or older

Telephone numbers

Fax numbers

Electronic mail addresses

Social Security numbers

Medical record numbers

Health plan beneficiary numbers

Account numbers

Certificate and license numbers

Vehicle identifiers and serial numbers, including license plate numbers

Medical device identifiers and serial numbers

Internet universal resource locators

Internet protocol (IP) addresses

Biometric identifiers ( fingerprints and voiceprints)

Full-face photographic images and comparable images

Any other unique identifying number, characteristic, or code (may assign a code for de-identified
information to be re-identified)