Upgrade to remove ads
Terms in this set (78)
Bivariate normal density (distribution)
Both are random variables that are normally distributed. 3-d normal; if you sliced it on any x or y, it would be normal. Conditional distribution is the distribution of y given x. Marginal is any value of y regardless of x. Both conditional and marginal are normally distributed.
Examines the relationship between two variables by determining the best-fit regression line. The properties of this line of best fit are determined by the slope and the y-intercept; assume a linear model with normally distributed errors.
a statistical measure that indicates the extent to which two or more variables fluctuate together.; The degree of reletionship (or dependency) between two variables. Can range between -1 to +1, with the number indicating the degree of the relationship and the sign indicating the direction. 0 indicates no straight-line relationship, and close to -1 or +1 indicates a strong relationship.
A measure of the relation between the mean value of one variable (e.g., output) and corresponding values of other variables (e.g., time and cost).
Ordinary Least Squares
a method for estimating the unknown parameters in a linear regression model, with the goal of minimizing the sum of the squares of the differences/residuals between the observed and predicted values.
relates to the statistical significance of a relationship.; used as an inferential test of the independence of two nominal variables. Sum of the squared z-scores. Refers to a statistical test that has a resulting test statistic that is distributed in approximately the same way as the x^2 distribution. Null hypothesis is the factors are independent.
the relationship between two variables while controlling for a third variable. The purpose is to find the unique variance between two variables while eliminating the variance from a third variable.
vertical distance between the observed and predicted values (points on the line). ei=yi - y(hat)i
Partition of Sums of Squares
Separating Sum of Squares total into Sum of Squares residual and Sum of Squares regression. Separating the sum of squared deviations into various components allows the overall variability in a dataset to be ascribed to different types or sources of variability, with the relative importance of each being quantified by the size of each component of the overall sum of squares. Separating the SStotal by which treatment produced them (SSresidual and SSregression)
Goodness of Fit Measures
the extent to which observed data match the values expected by theory. Measures typically summarize the discrepancy between observed values and the values expected under the model in question. Standard error of the estimate and R^2.
Standard Error of the Estimate
a measure of the accuracy of predictions made with a regression line. The standard deviation of y conditional on x.
a multivariate statistical technique for helping to infer whether there are real differences between the means of three or more groups or variables in a population, based on sample data. Use variance of groups to see if there are differences between the groups. Deals with differences among or between sample means (like t), but has no restriction on the number of means (unlike t). Looks at the individual effects of each variable separately but also the interacting effects of two or more variables. assumptions: homoscedasticity, normality, and independence.;
a statistical measure of how close the data are to the fitted regression line.; represents the degree to which the variability in one measure is attributable to variability in another measure; can be explained as the proportion of total variance explained by the model; how well original values correlate with predicted values.
OLS estimate of b that provides minimum variance estimate of b among all unbiased linear functions of y, irrespective of the distribution of the erros. Therefore, a linear model of OLS is the best.*
a graph that shows the residuals on the vertical axis and the independent variable on the horizontal axis. AKA: A plot of error versus yhat. Ideally you would see data points equally scattered around the line. NOTE: You often cannot see outliers here because they are masked...
can therefore be misleading!
Maximum Likelihood Estimation
a method of estimating the parameters of a statistical model given observations, by finding the parameter values that maximize the likelihood of making the observations given the parameters. Mean turns out to be *this* for the normal distribution. Big feature is minimum variance. Finding A and B for regression line equation is
od of estimating the parameters of a statistical model given observations, by finding the parameter values that maximize the likelihood of making the observations given the parameters. Mean turns out to be *this* for the normal distribution. Big feature is minimum variance. Finding A and B for regression line equation is *this*.
estimates of variance across groups.
are used in analysis of variance and are calculated as a sum of squares divided by its appropriate degrees of freedom
when the regression line is linear (y = ax + b)
is the constant(a) that represents the rate of change of one variable (y) as a function of changes in the other (x); it is the slope of the regression line.
a modified version of R-squared that has been adjusted for the number of predictors in the model. *This* increases only if the new term improves the model more than would be expected by chance. It decreases when a predictor improves the model by less than expected by chance. When the number of observations is small, the sample correlation will be a biased estimate of the population correlation coefficient.
fied version of R-squared that has been adjusted for the number of predictors in the model. *This* increases only if the new term improves the model more than would be expected by chance. It decreases when a predictor improves the model by less than expected by chance. When the number of observations is small, the sample correlation will be a biased estimate of the population correlation coefficient. *This* gives a relatively unbiased estimate of the population correlation coefficient.
an alternative to least squares regression when data is contaminated with outliers or influential observations. It can also be used for the purpose of detecting influential observations. Certain widely used methods of regression, such as ordinary least squares, have favourable properties if their underlying assumptions are true, but can give misleading results if those assumptions are not true; thus ordinary least squares is said to not be
due to violations of its assumptions, and outliers have more leverage and pull the line toward them.
Way of tuning down outliers using m-estimators
the line that minimizes the sum of squared deviations of prediction (also called the sum of squares error). represents our best prediction of Y for a given value of X for the "i"th observation. The height of
represents our best prediction of yhat given any specified value of X.
types of data which may be divided into groups. Examples of
are race, sex, age group, and educational level.; variables that provide numerical labels (or names) for two or more categories.
a table showing the distribution of one factor in rows and another in columns, used to study the association between the two factors.; when data are categorized with respect to two or more factors, interested in whether the distribution of one variable is conditional on a second variable. therefore, construct
to show the distribution of one factor at each level of the other factor.
When all random variables in a sequence or vector have the same finite variance. This is also known as homogeneity of variance. the assumption underlying the analysis of variance such that each population has the same variance. when data is evenly spread about the line of best fit.
is the common variance accross the population variances. It is the "error variance" of the variance unrelated to any treatment difference. It is expected if the effect of the treatment is to add a constant to all scores.
Solutions to Reproducability Problem
1. Replicate. 2. Use alternatives to classical statistics. 3. Maintain a registry of research proposals that include a statement of prospective outcomes. 4. More emphasis on study design. 5. More use of mutlivariate statistics. 6. More use of robust statistics. 7. Publish less. 8. Publish non-significant results too so people know what doesn't work.
the mean of the means of several subsamples, as long as the subsamples have the same number of data points. The mean accross all treatments/sub-samples (i.e. the mean of all data points).
Sum of squares regression
a mathematical approach to determining the dispersion of data points. In a regression analysis, the goal is to determine how well a data series can be fitted to a function which might help to explain how the data series was generated. The sum of squared deviations of the treatment means around the grand mean, multiplied by n to estimate population variance.
a measure of association for when both (2) variables are binary/dichotmous variables; equivalent to Pearson's r, but used specifically for situations with two dichotomous variables.; measures the degree or magnitude of a relationship.
Point biserial correlation
a correlation coefficient used when one variable (e.g. Y) is dichotomous; Y can either be "naturally" dichotomous, like whether a coin lands heads or tails, or an artificially dichotomized variable.; Pearson's r applied to situations in which one variable is dichotomous.
Spearman's rank order correlation
a nonparametric measure of the strength and direction of association that exists between two variables measured on at least an ordinal scale. It is denoted by the symbol rs(or the Greek letter ρ, pronounced rho).; equivalent to Pearson's r for ranked data.; used with ranked scores or when ranked scores have been substituted for raw scores. A correlation based on rank-order of data points/scores; useful with non-normal data.
a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome (fitting a regression line to data in which the dependent variable is dichotomous). The outcome is measured with a binary dependent variable. The result is a probability.
a measure of association between an exposure and an outcome. Represents the odds that an outcome will occur given a particular exposure, compared to the odds of the outcome occurring in the absence of that exposure.; one of three main ways to quantify how strongly the presence or absence of property A is associated with the presence or absence of property B in a given population.; Measure of effect size, describing the strength of association or non-independence between two binary data values. It is used as a descriptive statistic and plays an important role in logistic regression. Equation: (p/1-p)/(p/1-p)
an aspect of a logistoc regression model where the dependent variable (DV) is categorical.; the inverse of the sigmoidal function used in mathematics, especially in statistics. When the function's parameter represents a probability p, this function gives the log-odds, or the logarithm of the odds p/(1 − p).
the act of repeating a process, either to generate an unbounded sequence of outcomes, or with the aim of approaching a desired goal, target or result. Each repetition of the process is also called an "iteration", and the results of one iteration are used as the starting point for the next iteration.
a quality-of-fit statistic for a model that is often used for statistical hypothesis testing. It is a generalization of the idea of using the sum of squares of residuals in ordinary least squares to cases where model-fitting is achieved by maximum likelihood. Applies to logistic regression and is like SSresidual in "normal" regression. Every time you add a predictor *this* drops. More significant predictors will lower
ity-of-fit statistic for a model that is often used for statistical hypothesis testing. It is a generalization of the idea of using the sum of squares of residuals in ordinary least squares to cases where model-fitting is achieved by maximum likelihood. Applies to logistic regression and is like SSresidual in "normal" regression. Every time you add a predictor *this* drops. More significant predictors will lower *this* by more.
a method for finding successively better approximations to the roots (or zeroes) of a real-valued function. Discussed with iterative solutions in lecture.; a technique used to find the roots of nonlinear algebraic equations.
Iterated least squares
used to solve certain optimization problems with objective functions by an iterative method in which each step involves solving a weighted least squares problem. used to find the maximum likelihood estimates of a generalized linear model, and in robust regression to find an M-estimator, as a way of mitigating the influence of outliers in an otherwise normally-distributed data set. For example, by minimizing the least absolute error rather than the least square error.
Regression based statistical technique used in determining which particular classification or group (such as 'ill' or 'healthy') an item of data or an object (such as a patient) belongs to on the basis of its characteristics or essential features.; Predicting of a categorical dependent variable by one or more continuous or binary independent variables. Problems: can produce a probability which is outside of 1 and 0 (impossible) and unrealistic and restictive assumptions of normality. Technique for distiguishing two or more groups on the basis of a set of variables.
(also called an independent variable) is an explanatory variable that can be manipulated by the experimenter. Each has two or more levels. Combinations of these levels are called treatments. Two types: fixed and random. A categorical, independent variable.
A scaling factor. The sum of x^2 divided by n. Corrects for the population.
a priori comparisons
Comparisons planned to analyze before an experiment is conducted, especially when you have particular research questions in mind
post hoc comparisons
Comparisons completed after a signifiance test/rejection of Ho; an exploratory analysis. Need to run an F test. Issue of Type I errors.
a linear combination of variables (parameters or statistics) whose coefficients add up to zero, allowing comparison of different treatments. Independent; no redundancy.
data that has been ranked in order of preference/specific measurement.
Kendall's Tau Coefficient
similar to Spearman's rank order correlation, though rather than treat the ranks as scores and calculate the correlation between the two sets of ranks, this statistic is based on the number of inversions (i.e. discordance) in the rankings (if perfect ordinal relationship between two sets of ranks, would not expect to find any inversions). Non-parametric measure of correlation.
the variance unrelated to any treatment differences, which is the variability of scores within the same condition.
Sum of squares total
the sum of squares of all the observations, regardless of which treatment produced them.
Sum of squares treatment
sum of the squared deviations of the treatment means around the grand mean, which is multipled by n to give us an estimate of the population variance.
sum of squares error
the sum over all groups of the sums of squared deviation of scores around their group's mean. ∑ (ˆyi − ¯y)2.
*This* is the amount of variance that is not explained by the regression model. That is, it is the amount of variance that is described by factors other than the predictors in the model.
mean squares error
an estimate of the population variance, regardless of the truth or falsity of Ho, and is actually the average of the variances within each group when the sample sizes are equal.
mean squares treatment
the variance of the treatment means multiplied by n to produce a second estimate of the population variance. Assumes the null is true.
obtained by dividing MStreat by MSerror. If this equals 1, the null is true. If this is over one, the null may not be true (compare to critical). Distributed as
with k-1 and n(k-1) degrees of freedom.
Two contrasts with the sum of the products of corresponding coefficients (i.e. coefficients for the same means) adding to zero. Independent; no redundancy.; When members of a set of contrasts are independent of one another.
practice of collecting information and attempting to spot a pattern, or trend, in the information. Involves the pattern of the trend line of the data.
Advantages: (1) no 'overlapping' information (uncorrelated) (2) Can add one term at a time (because uncorrelated). (3) Independent; no redundancy.; A family of polynomials such that any two different polynomials in the sequence are orthogonal to each other under some inner product.
dot product of two vectors
aka inner product or scalar product. an algebraic operation that takes two equal-length sequences of numbers (usually coordinate vectors) and returns a single number. Algebraically, it is the sum of the products of the corresponding entries of the two vectors. Does not tell you as much as trend analysis; pair with trend analysis.
generally is any process of comparing means in pairs to judge which of each entity is preferred, or has a greater amount of some quantitative property, or whether or not the two entities are identical.
Tukey's Studentized Range statistic
Statistic used for controlling Type I errors in post hoc tests. Based on max of the means - the min of the means. aka Tukey's range test. Used in Tukey's HSD.
sum of squares orthogonal contrast
The sums of squares of a complete set of orthogonal contrasts sum to SStreat.
When you have two columns of equal length, and you pair each value from the first column with each value from the second column, the differences between the paired values are called
polynomial trend coefficients
coefficients for linear, quadratic, cubic, quartic, quintic, higher-order functions.
If multiple comparisons are done or multiple hypotheses are tested, the chance of a rare event increases, and therefore, the likelihood of incorrectly rejecting a null hypothesis (i.e., making a Type I error) increases.
compensates for that increase by testing each individual hypothesis at a significance level of alpha /m , where alpha is the desired overall alpha level and m is the number of hypotheses
The type I error rate or significance level is the probability of rejecting the null hypothesis given that it is true. It is denoted by the Greek letter α (alpha) and is also called the alpha level. Limits on
*: PC ≤ FW ≤ cα
error rate per comparison
denoted by alpha prime ( α' )
family-wise error rate
the probability of making one or more false discoveries, or type I errors when performing multiple hypotheses tests.; probability of a false positive (Type I error) for a 'set' or 'family' of tests. 1-(1-α')^c
Tukey's Honestly Significant Difference test
uses the Studentized q statistic for its comparisons, except that this is always taken as the maximum value of qr. it applies simultaneously to the set of all pairwise comparisons, ui -uj, and identifies any difference between two means that is greater than the expected standard error. Effect is to fix the familywise error rate at and against all possible null hypotheses, although with some loss of power. Designed for equal sample sizes.
is the favorite pairwise test for many people because of the control it exercises over a.
A two-step procedure. (1) Perform omnibus test (the usual F test) (2) If you reject the null, test all pairs using qT with p-1 as a parameter (where p = the number of means). A little more power, but a little more sensitive to error.
a method used to counteract the problem of multiple comparisons. It is intended to control the familywise error rate and offers a simple test uniformly more powerful than the Bonferroni correction.
a post-hoc test used in Analysis of Variance. After you have run ANOVA and got a significant F-statistic, then you run
to find out which pairs of means are significant.
corrects alpha for simple and complex (all) mean comparisons, not just pairwise comparisons. Generally use this for complex comparisons, not recommended for pairs.
Tukey's HSD used for unequal sample sizes, because usual Tukey's HSD is too conservative.
A measure of effect size for use in Anova. This equals the variance in Y explained by X. Non-linear correlation coefficient. Ranges between 0 and 1. Sometimes called a correlation ratio. η^2 = SStreat / SStotal; proportion of variance due to treatments, though tends to overestimate the effect.; percent by which the error of our prediction has been reduced by considering group membership.; assumes the regression line passes through individual treatment means, and is biased because of this.
a measure of effect size, or the degree of association for a population. It is an estimate of how much variance in the response variables are accounted for by the explanatory variables. For a fixed-model ANOVA: ω^2 = [SStreat - (k-1) - MSerror] / [SStotal + MSerror]; method of assessing magnitude of experimental effect with balanced or nearly-balanced designs.
Root Mean Square Standardized Effect (RMSSE)
This standardized measure of effect size is used in Analysis of Variance to characterize the overall level of population effects. It is the square root of the sum of squared standardized effects divided by the number of degrees of freedom for the effect.; measure of effect size that directly measures to differences of group means that have been standardized by dividing standard deviation.
a measure of effect size where 0.1 is considered weak, 0.25 is considered medium, and 0.40 is considered large.
difference between mean of the group and the grand mean; treatment effect.
error is normally distributed within conditions; assumption of anova.
observations are independent of one another. argument for random sampling. assumption of anova.
raise 'e' to that power. used with logistic regression.
YOU MIGHT ALSO LIKE...
Statistics I Final Terms/Concepts
Exam 3 Vocab
WGU C741 Statistics for Data Analysis
OTHER SETS BY THIS CREATOR
Candidacy - Methodology
PSYCO 532 Final Vocab
Psyco 532 Midterm Vocab
Psyco 531 In-class Vocab