Search
Create
Log in
Sign up
Log in
Sign up
SDS 332 - Exam 3
STUDY
Flashcards
Learn
Write
Spell
Test
PLAY
Match
Gravity
Terms in this set (48)
Categorical data analysis
All variables are categorical
ex: hair color, ethnic group, etc.
counts in one or more samples are analyzed
The variables can be numerical or divided into intervals (income: low, medium,high)
Comparing Counts- one sample, one variables
compare counts of each value of the response to whats expected
ex: hair color is the variable: test the hypothesis that 65% of people are brunettes, 20 blonds, and 15 red
**Use the chi-square goodness of fit test-test the sample counts are likely given the hypothesized proportions if not, H0 is rejected
Comparing counts- One sample, Two response variables
one sample, with two response variables assessed on each member.
ex: class (fresh sophomore, etc) and suffers form anxiety (yes, no)
** Construct a contingency table- compare counts of each combination of the two variables to whats expected
***Use the chi-square test of independence
***** (look into this for more detail) Comparing Counts- Two or more samples, One variable
One explanatory variables, one response variable.
Ex: explanatory variable: age (3 groups: young adults, middle aged adults, elderly adults)
response variable: suffers form chronic pain (yes or no)
***Use the chi square test of homogeneity
Chi Square Test of Independence
purpose: test whether the two variables are independent
sample: 1
variables: 2
DF: (row-1) * (column - 1)
Construct a contingency table! & compare the observed #'s to the expected #'s
The Chi Square distribution
it approaches a more normal U or Bell shaped curve the more the DF increases
Conditions for the Chi Square Test for Independence
Counts are being analyzed
sample size < 10% of population
random, independent collection
all expected counts greater than or equal to 5******
Chi Square Test for Goodness of Fit
purpose: test whether a variables distribution follows a proposed distribution
No. of samples: 1
No. of variables: 1
DF: Number of categories - 1
H0: the pop. distribution of the variable is the same as the proposed distribution
HA: the pop. distribution of the variable differs from the proposed distribution
Chi Square Test for Goodness of Fit Procedure/ equation
Find the expected counts for each category (bin) = proposed proportion X sample size (25% x 100 students = 25)
then for each category compute observed - expected / expected
Are humans like random generators slide..
H0: humans are like random number generators and produce numbers in equal quantities
HA: humans do not produce numbers in equal quantities
Relative Risk
Determine how much the risk of an outcome occurring is
compute: ratio of the risk in there treatment group divided by the risk in the not treated group
Example of Relative Risk
Probability of getting cancer w/ treatment = .07
Prob. of getting cancer with no treatment = .02
RR = .07/ .02 = 3.5
Interpreting Relative Risk
If RR = 1 (risk in treated = risk not treated)
If RR > 1 (risk in treated > risk in not treated)
If RR < 1 (Risk in treated < risk in not treated)
Example of RR interpretation
RR of Alzehimers when taking estrogen supplements vs. not taking supplements = 0.353
0.353 < 1 - So taking estrogen supplements reduced the risk of contracting Alzhemiers disease.
by how? 1- 0.35 = .65 --> by about 65%
**Also look at the confidence intervals, if they do not obtain a 1, then you gucci mayne
Odds
Say E is an event
Prob (E) is the probability that E will occur
1- prob. (E) is the probability that E will not occur
odds of event E occurring = prob. (E) / 1- Prob. (E)
Odds Ratio
OR = odds of outcome if Condition A / odds of outcome of condition B
OR < 1 - outcome is less likely in condition A
OR = 1 - outcome is equally likely in condition A and condition B
OR > 1 - outcome is more likely in condition A
Two types of Logistic Regression
Binary logistic regression &
Multinomial logistic regression
Binary logistic regression
two possible outcomes
ex: pass / fail
multinomial logistic regression
more then two possible outcomes
Why Cant use ordinary regression to predict binary outcome?
violates normality
violates homogeneity of variances
erroneously forces a linear relationship
results in nonsensical predictions
Logistic Regression
Logistic Regression allows a sensible reduction to be made.
Predicts: the probability of one outcome occurring
ex: the probability of being reading proficient at the start of kindergarten
It also tests whether the probability is affected
The logistic Function
the model for predicting probabilities form one or more predictor variable
Log Odds
in logistic regression, the criterion variable is:
the natural log. of the odds of one outcome occurring
***** Ln[P/(1-p)] is called log odds
this is also called the logic function
Digression: natural Logarithm (ln)
if x= eb
Ln(x) = b
Logistic regression equation
the model we want to estimate: Ln[P(1-p)] = a+bx+E
the parameters (a and Bo) are estimated using:
- maximum likelihood ML method
ML estimates maximize the likelihood of observed sample results
start with initial estimates
Odds ratio example
prob of a smoker having heart attack = .115 - odds: .115/.885 = .13
prob of non smoker having a heart attack = .075 - odds: .075/ .925 = .08
OR = .13/ .08 = 1.625
Smokers are 1.625 times greater than non smokers at having a heart attack
Interpretation of Beta
Compute exp. (b)- ratio of two odds
exp(B) = odds of outcome Y when X= X+1 / odds of outcome Y when X = X
Dummy variables
a dummy variable is a numerical variable used in regression analysis to represent sub groups of the sample in your study.
Survival analysis:
investigates time until an event or interest occurs
survival probability
S)t) = P(T>t)
Complicating factor: Censoring
censoring: some subjects will not experience the event by the end of the study. they either drop out of study, or event of interest does not occur by end of study for some subjects.
Kaplan-Maier Procedure
it is used to estimate the survival function from lifetime data
Key assumption: subjects who become invaluable in time period t are counted as:
having survived through the time period
these subjects are not counted as at risk in the next period
Coding Censoring
can display a censored event time with a + (plus sign) placed to the right of the observed time.
another way is to create a censoring status variable - C
S(t)
probability of surviving longer than time t
hazard ratio
the hazard ratio is a ratio of two different hazard rates:
typically the treatment group h (t) compared to the control group h(t)
HR = 1 there is no difference in H between two groups
HR > 1 ; H(t) higher in the treatment group
HR < 1 ; H(t) higher in the control group
Cox Proportional Hazards Model
this model is used to test hypotheses about the effects of one or more variables on h(t)
Using Latent variables: advantages
organizes your variables
control for measurement error
more closely connects your variables to theory
data reduction
goal is to identify a smaller set of variables
Principal Component Analysis (PCA)
a widely used and fundamental multivariate method.
used for: generating hypotheses about what is being measured by a set of variables
data reduction
properties of principal components
principal components are uncorrelated
eigenvector
provides the wights for the p variables in a principal component. There are p eigenvectors
eigenvalue
the variance of each principal component. there are p eigenvalues.
the sum of the eigenvalues = the sum of the values in the main diagonal of the cov. or core. matrix *****
principal component scores
PCA can be followed by a statistical analysis that uses principal component scores instead of the original variable as input
the component scores are obtained for each case by multiplying:
- the values of p original variables by the eigenvector weights
- often, the variables are first centered by their respective means.
interpreting Principal components:
a principle component can be thought of as dimension with two poles or ends
a principal components variance reflects the importance of the dimension in your dataset
a PC is often interpreted by examining its correlations with the original variables
*** ignore any value less than 0.03
Interpreting Principal components cont.
large positive correlation: the component is measuring an attribute that the variables measured. large positive values on the variables result in a high score on the component
large neg, correlation: the component is measuring an attribute that the variable measures. Large positive values on the variables result in a low score on the component
PCA example
We're only going to focus on Interpreting 1st two PCs and thats it!!!
if a wight is below .03 we just ignore it!!
Digression Covariance:
covariance: statistic that assesses the degree to which two variables are linearly related
can be positive, negative or zero
Digression Correlation
like the covariance, assesses the extent to which two variables are linearly related:
but is unaffected by the measurement units
Som we can compare correlations
values ranges form -1 to 1
;