Research Methods and Statistics - University of Liverpool year 1
Terms in this set (75)
What does it mean if sample data is independent?
no two samples rate linked or related
e.g. 10 old, 10 young, but no old-old/young-young/old-young connection.
What does it mean is sample data is dependent/related?
e.g. taken ten people, measure height and weight, each persons weight is related/connected to their height.
Define inference
drawing conclusions about some population based on a sample.
What are descriptive statistics?
simply describing the sample/data, they don't prove anything!
e.g. mean, standard deviation, median, range, histograms, scatter plots...
When looking at descriptive statistics for a sample do you account for the population?
no!
Define statistic
some value compared from the data,
e.g. the sample mean.
Define parameter
some value associated with the population,
we don't know the true value of a parameter we have to estimate them using statistics from the sample.
Define variable
some measured quantity, e.g. "number of words recalled".
Define observation
a particular outcome e.g. "19 words".
Define independent variable
the variable which you change.
Define the dependent variable
the variable which you are observing, the thing that changes because of your altered IV.
Define Nominal Data
Categorize subjects and count how many is in each category, e.g. how many old and how many young.
Define Ordinal Data
Categorize subjects into ranked categories (doesn't need to be same interval between each rank)
Define Interval Data
When data can be ordered with equal intervals but data can't fall below 0.
Define Ratio Data
The same as interval except you can go below 0, e.g. temperature.
What is a numerical summary (sample statistics)?
can give info such as the typical magnitude or the amount of variation in the data.
What are the measures of location (magnitude)?
mean - arithmetic average (mean = (∑x)/n)
median - middle value (ordered)
mode - most common value(s) <- least useful
What are the measure of spread?
raw range - highest to lowest
variance and standard deviation
What is the formula for variance?
(∑(x²) - ((∑x)²)/n)/(n-1)
What is the formula for standard deviation?
√Variance
What is a histogram?
range of values is divided into classes (usually equal width) and the no. observations in each class shown. The AREA not height of each bar is the no. observations.
What are scatter plots used for?
to show relationships between multiple variables.
What does EDA stand for?
Exploratory Data Analysis.
What is EDA?
looking carefully at the sample of data, after EDA you can do statistical inference - see what the sample tells us about the population.
What does EDA emphasise?
robust measures and plots.
What is robustness?
when measure are not overly affected by a few very extreme observations,
e.g. for location; median is more robust then the mean
for spread; the mid-spread is more robust then the range.
How do you calculate the mid-spread?
Upper hinge (Q₃) - the lower hinge (Q₁)
(DATA MUST BE ORDERED)
How do you calculate the lower hinge position?
(median position + 1)/2
(DATA MUST BE ORDERED)
How do you calculate the upper hinge position?
take the lower hinge position form the other end.
What must you include in a stem and leaf diagram?
a KEY!!
What are the aspects of a box plot?
- extends from lower to upper hinge
- length of box; mid-spread/h-spread/IQR
- line inside box shows median
- "whiskers" can instead extend highest/lowest values within the Inner Fences
- the "whiskers" extend to the highest/lowest values unless outliers are involved
- outliers are values above/below the fence and are marked separately on the diagram.
How do you calculate the inner fence?
1.5 × box length, above/below appropriate hinge.
How do you see Skew on a box plot?
- if the median line is in the middle it's symmetric, no skew
- if the median line is to the left (toward lower values) there is a positive skew
- if the median line is to the right (toward higher values) there is a negative skew
How many samples can you compare through box plots?
as many as you want, side-by-side comparison.
What does CDA stand for?
Confirmatory Data Analysis.
Define probability
a numerical description of how likely it is that a particular outcome will occur,
any outcome has a probability between 0 (impossible) and 1 (certain).
Define random variable
a numerical outcome of an experiment.
What does probability distribution specify?
the probability outcome of a random variable.
Define discrete
if there are only a few possible values it can take.
Define continuous
if there are very many possible values it can take.
What is normal distribution?
the most important continuous probability distribution,
looks quite symmetrical on a histogram,
bell-shaped graph.
What are the parameters for normal distribution (parametric data)?
mean and standard deviation.
What is the standard normal distribution?
has a mean 0 and an SD 1.
What does correlation measure?
the strength of the linear relationship between two variables,
how close the points are to being on a straight line.
What is the first thing you do with paired data?
draw a scatterplot and see if there is a clear pattern, e.g. do the points lie along a straight line?
What are the main measures of correlation?
- Spearman's Rank Correlation (p, rho or rs) -> ordinal (rank) data
- Pearson's Correlation Coefficient (r) -> interval/ratio data
What are correlation values ALWAYS within?
-1 and +1
-1.00 is a perfect -ve correlation
0.00 is no linear relationship
+1.00 is a perfect +ve correlation
How do you workout Spearman's rho for two paired samples?
1. RANK each sample, don't forget to take into account paired ranks
2. For each individual, subtract their ranks in the two samples, giving the rank difference D
3. Square each D to get D²
4. Use the formula;
rs = 1 - ((6∑D²)/N(N²-1))
When is Spearman's rho most appropriate?
When the data only consists of two (paired) sets of ranks.
Give some examples of other correlations and when they'd be used.
1. Kendall's tau; same as Spearman's rho
2. Kendall's concordare; a measure of agreement between the rankings of multiple objects by multiple judges.
3. Point Biserial Correlation; true dichotomy versus a true (normally distributed) measurement
4. Biserial Correlation; artificial dichotomy vs true (normal) measurement
5. Phi Coefficient; true dichotomy vs true dichotomy
What is the full name of Pearson's coefficient?
Pearson-Heartly product moment correlation coefficient.
Does correlation have a scale? Is it affected by scale?
no it is scale independent.
What kind of relationship is appropriate for?
a linear relationship.
What is the formal for Pearson's Coefficient for two variables?
r = (covariance lk)/SDι × SDκ
where l=x and y=k
where covariance is the joint variance of X and Y.
What is the more general formula for Pearson's Coefficient?
(∑xy - (1/n)(∑x)(∑y))/√[∑x² - (1/n)(∑x)²][∑y² - (1/n)(∑y)²]
where x and y are the variables and n is the number of observations.
On what type of data do you use a Spearman's correlation?
skewed data, non-parametric.
On what type of data do you use a Pearson's correlation?
normally distributed data, parametric.
How do you represent a straight line?
Y = a + bX
where a and b are parameters
Y is the DV, Predictor or experimental variable
X is the IV, response or criterion
a is the Y intercept/constant
b is the gradient/slope
What is linear regression?
finding the line of best fit to a set of points,
amounts to estimating the values of a and b.
What is the most common method for linear regression?
least squares, minimises the sum of the squared vertical elevations of the points from the line.
For any given x value we can use the fitted line to estimate/predict a y value (y hat), what is the formula for this?
y (hat) = a + bX
y (hat) = residual + Y
where residual is the "error" for the observed point not lying exactly on the line.
As there are only two variables what is this sometimes called?
a bivariate linear regression.
What is it called is there are more than two variables?
a multivariate regression.
How do you test a null hypothesis, H₀?
- decide on a critica value, c
- if |r|>c then reject H₀
What is a type 1 error?
when H₀ is true and you reject H₀
the probability of doing this is α
What is a type 2 error?
when H₀ is false and you accept H₀
the probability of doing this is β
What is the value of α called? what values do we usually take as α?
the significance level,
0.05 (some evidence) or 0.01 (strong evidence)
To use standard tables you should compute the t value, how do you do this?
t = r(√(n-z)/(1-r²))
where n is the no. observations and (n-z) is the degree of freedom.
What do you find when looking in the t tables?
the critical value c which satisfies;
Pr(|n-z|>c) = α
When would you reject H₀?
if the value of |t|>c at level α.
What kind of distribution uses a t test?
normally distributed data.
If H₀: p = 0, then what is the alternative H₁?
H₁: p ≠ 0
If you include H₀ and H₁ what type of test is this?
a two-tailed test.
When do you use a one-tailed test?
when you have stated what the connection between the data is in your hypothesis.
When do you use a two-tailed test?
when you have stated there is a connection but not the direction in you hypothesis.
