Tests & Measurements Chap. 3-5

Test Scores
mathematical representation of an examinee's performance
Raw scores: number of items scored in a specific manner
to give raw scores more meaning, we need to transform them into standard scores
Standard scores
norm-referenced OR criterion-referenced
Norm-referenced interpretations: examinee's performance is compared to that of other people (most psych. test are norm-referenced)
-norms: average scores of an identified group of individuals
-norm-based interpretation: process of comparing an individual's test score to a norm group
Standardized samples should be representative of the type of individuals expected to take the test
Developing normative data: define population, select random sample and test it
National standardization sample obtained through stratified random sampling, in the U.S. samples stratified based on gender, age, ethnicity, etc. (must exceed 1,000 participants)
once standardization sample is selected, normative tables or norms are developed
Nationally representative samples are common
other samples are available for some tests like local norms and clinical norms
Standardized administration: test should be administered under the same conditions and same administrative procedures
-standard scores: raw scores are transformed to another unit of measurement
-use SD units to indicate where an examinee's score is located relative to the mean of the distribution
There are several standard scores formats (transforming raw scores into standard scores): z-score (m=0, SD=1), T-scores (m=50, SD=10), IQs (m=100, SD=15)
standard scores can be set to any desired M and SD (with the fancy of the test author frequently being the sole determining factor)
Z-scores (+ is above mean, and - is below mean): z=(X-M)/SD
z score to raw score: X=(Z)(SD)+M
Disadvantages of z scores: difficult to interpret
half of the z scores in a distribution will be negative, carry decimal places, few test publishers routinely report z-scores
Percentile rank: reflects the percentage of people scoring below a given point (so a percentile rank of 20 indicates that only 20% of individuals scored below this point)
-range from 1 to 99 (rank of 50 indicates mean score)
-percentile rank is not the same as percentage correct: a percentile rank of 60 means that examinee scored better than 60% of sample, NOT correctly answered 60% of questions
-percentile (not percentile rank): point in a distribution at which a specific percentage of scores are less than or equal to a specified score (so 60% percentile at 104 indicates that 60% of scores are 104 or below)
Quartile scores: lower 25%=1, 26 to 50%=2, 51 to 75%=3, upper 25%=4
Stanine: not as common as percentiles, expressed in whole numbers from 1 to 9, with 4, 5, and 6 being considered average
Criterion-referenced score interpretations: the examinee's performance is compared to a specified level of performance
-criterion-referenced interpretations are absolute: compared to an absolute standard
-often used in educational settings
Examples of criterion-referenced interpretations:
-percentage correct (i.e. 85% on a classroom test)
-mastery testing: a cut score is established (pass/fail driver's license)
-standards-based interpretations: involves 3 to 5 performance categories (i.e. assigned "A" to reflect superior work)
The terms norm-referenced and criterion-referenced apply to score interpretations
NOT tests!
Norm-referenced interpretations can be applied to both maximum performance and typical response tests
Criterion-referenced are typically applied only to maximum performance
Item Response Theory Scores (Rasch/IRT-scores, Change Sensitive Scores (or CSS)): fundamental for computer adaptive testing
-theory holds that responses to items on a test are accounted by latent traits
-latent trait: it is inferred to exists based of theories and evidence of its existence
-intelligence is a latent trait
IRT Scores cont'd: each examinee possesses a certain amount of intelligence
-IRT describes how examinees at different levels of ability will respond to individual test items
-the specific ability level of an examinee is defined as the level at which examinee can get half of the items correct
-they can be transformed to either norm or criterion referenced scores
Qualitative descriptions of test scores: helps communicate test results (i.e. IQs 145 and above=very gifted, IQs 90-109=average)
-Test manuals should provide information on: normative samples (type of sample like national, size of sample, how well it matched U.S. population) and test scores (type of scores provided like T-score, how to transform raw scores, information on confidence intervals)
Reliability refers to the: consistency, accuracy, or stability of test scores
Factors that may affect reliability: time test was administered, items included, external distractions, internal distractions, person grading the test
Measurement Error: error is present in all measurement
even in physics it is reduced but not eliminated
Classical Test Theory (or CTT) is the most influential theory to help us understand measurement issues (Charles Spearman in the early 1900s):
-holds that every score has two components: true score that reflects the examinee's true skills AND error score which is the unexplained difference between a person's actual score on a test and that person's true score
Xi = T + E
Xi = Obtained or observed score
T = True score
E = Random measurement error
Random measurement error varies from:
-person to person
-test to test
-administration to administration
True score can not be directly measured:
It is a theoretical reflection of the actual amount of the trait so all we see is an observed score
Measurement error:
Random measurement error is the result of chance factors
-It can increase or decrease an individual's observed score
-It reduces:
the usefulness of measurement, ability to generalize, confidence in test results
-Random error reduces the reliability of test results
if errors are responsible for much of the variability so test scores will be inconsistent AND if errors have little effect on test scores so test reflects mainly consistent aspects of performance
Systematic measurement error: increases or decreases the true score by same amount each time (E.g., scale that adds 2 pounds, social desirability)
-Does not lower reliability: test is reliably inaccurate the same each time
-It is difficult to identify
-It is not considered in reliability analysis
Measurement errors are random: Equally likely to be positive or negative, over an infinite number of testings the error will increase and decrease a person's score by the same amount, and errors will tend to average zero
-Make a test longer also reduces the influence of random error for the same reason
-Error is normally distributed
-Reduce the error and reliability increases
-Job is to reduce the sources of error as much as possible
Sources of measurement error: tests rarely include every possible question
-Content sampling error (considered largest source of measurement error): differences between sample of items on test and total domain of items like all possible items, if items are a good sample of domain then content error will be small
-Time sampling error (temporal stability): random fluctuations in performance over time, includes changes in examinee like fatigue and the environment like distractions
-inter-rater differences: when scoring is subjective
-errors in administration
-clerical errors
Reliability coefficients: CTT: Xi = T + E, extended to incorporate the concept of variance: σ2X = σ2T + σ2E
σ2X = Observed score variance
σ2T = True score variance
σ2E = Error score variance
General symbol for reliability is rxx: rxx = σ2T / σ2X
(reliable tests will have positive signs)
-Reliability is the ratio of true score variance to total score variance
-Reliability is the proportion of test score variance due to true score variance
Reliability coefficients are correlation coefficients: reflect the proportion of test score variance attributable to true score variance
-so rxx = .90 indicates that 90% of the score variance is due to true score variance
-there are different ways to obtain the scores that are correlated
Psychologist use different methods for checking reliability:
-Test-retest reliability
-Alternate forms
-Internal consistency
-Inter-rater agreement
Test-Retest Reliability: administer the same test on two occasions, correlate the scores from both administrations, primarily reflects time sampling error
-reflects the degree to which test scores can be generalized to different situations or over time
-important to consider length of interval between testing
-optimal interval is determined by the way tests results are used (i.e. Intelligence and Mood)
-Carry-over effects
-Practice and memory effects
-Characteristics of attribute may change with time, also time consuming and expensive
Procedure: test-retest
-Administering a test to a group of individuals
-Re-administering the same test at a later time
-Compute the correlation between both scores, should be above .70
Alternate-Form Reliability (like test form "A" and "B"): Requires two equivalent or parallel forms, correlate the scores of the different forms, can be administered simultaneously (time error) or delayed (content and time error)
-Alternate-form reliability may reduce, but typically not eliminate carryover effects
-Few tests have alternate forms
Internal Consistency: Estimates errors related to content sampling, Extent to which individuals respond similarly to items measuring the same concept, single administration
-Split-Half Reliability
-Coefficient alpha
Split-Half Reliability: Administer the test, then divide it into two equivalent halves, Correlate the scores for the half tests
-How to split a test? First half -second half, Odd-even split, Randomly
-longer tests more reliable
-twice as many test items, able to sample domain more accurately
-better sample of domain, lower error due to content sampling and higher reliability
-BUT, splitting test makes it shorter, less reliability
Adjusting Split-Half Estimates: Correction formula: The Spearman-Brown formula; statistically adjusts reliability coefficient when test length is reduced to estimate what the reliability would have been if test were longer
rt= 2rh/1+rh
rh = the half correlation
Split-Half Method
-Advantages: No need for separate administrations or alternate forms
-Problems: Primarily reflects content-sampling error and
correlation may vary depending on how test is split
Coefficient Alpha: sensitive to content-sampling error and item heterogeneity;
can be calculated from one test administration; used as a measure of reliability
-Examines the consistency of responding to all items
-Represents the mean reliability coefficient from all possible split halves
-Especially useful for tests that do not have right or wrong answers
(E.g., attitudes, personality)
Reliability coefficient is a function of:
-Extent to which each item represents an observation of the same "thing" observed by other test items
-Number of observations one makes
rxx = k(rij) / 1 + (k-1) rij
k = number of items in the test
rij = average inter-correlation among test items
-Compute the correlations among all items
-Compute the average of those inter-correlations
-Use formula to obtain standardized estimate
One way to increase reliability is to increase the number of items:
-Each item represents an individual assessment of the true score
-With multiple items combined, errors will tend to average out
-Therefore, increasing the number of items increases reliability
Kuder-Richardson Reliability:
Applicable when tests are scored dichotomously (i.e., right or wrong, scored 0 or 1)
Inter-Rater Reliability: Two or more individuals score the same test independently
-Calculate correlation between the scores
- Appropriate when scoring requires making judgments
-Important when scoring is subjective
-A popular index to estimate inter-rater agreement is Cohen's Kappa (categorical data)
Interpreting Reliability Coefficients: The proportion of a scale's total variance that is attributable to a true score
rxx = 1 - error variance
SO, for example, rxx = .80, i.e., 20% of variability is due to unsystematic variance
Composite scores: when scores are combined to form a composite (like IQ scores)
-the reliability of composite scores is better than individual scores in composite
-tests are simply sample of the test domain
-combining multiple measures is analogous to increasing the number of observations
Difference scores: involves calculating the difference between two scores (i.e. D = X - Y, where D = Achievement test - IQ Score)
-the reliability of difference scores is typically lower than the individual scores
If a test is to be administered multiple times: Test-Retest Reliability
Tests to be administered one time:
-Homogeneous content - coefficient alpha
-Heterogeneous content - split-half coefficient
Alternate Forms available:
Alternate form reliability: delayed and simultaneous
Factors to consider when evaluating reliability coefficients:
-Construct: what might be acceptable for measure of personality may not be for intelligence
-Time available for testing
-How the scores will be used
-Method of estimating reliability
The standard error of measurement (SEM) is more useful when interpreting test scores.
Reliability coefficients are most useful in comparing the scores produced by different tests.
Standard error of measurement: the SD of the distribution of scores that would be obtained by one person if he or she were tested on an infinite number of parallel forms of a test compromised if items randomly sampled from the same content domain
-Function of the reliability coefficient and standard deviation of the scores
-As reliability increases, the SEM decreases
Confidence Intervals: reflect a range that contains the examinee's true score
-Confidence intervals are calculated using the SEM and the SD of the scores
-As reliability increases, SEM and confidence intervals get smaller
About 68% of the scores in a normal distribution are located between 1 SD above and below the mean
If an individual obtains a scores of 70 in a test with a SEM of 3.0. we would expect her true score to be between 67 and 73
About 95% of the scores in a normal distribution are located between 1.96 SD above and below the mean
If an individual obtains a scores of 70 in a test with a SEM of 3.0. we would expect her true score to be between 64.12 and 75.88
The SEM and confidence intervals remind us that scores are not perfect
-When the reliability of the test scores is high, the SEM is low because high reliability implies low random measurement error
-The smaller the standard error of measurement, the narrower the range
CTT: Only an undifferentiated error component
Generalizability theory: Shows how much variance is associated with different sources of error
Reliability information reported as a Test Information Function (TIF): A TIF illustrates reliability at different points along the distribution.
TIFs can be converted into an analog of the SEM.
How Test Manuals Report Reliability Information:
At a minimum, manuals should report: internal consistency reliability estimates, test-retest reliability, standard error of measurement (SEM), and information on confidence intervals (typically 90% and 95% intervals)
Validity: refers to the appropriateness and accuracy of the interpretation of test scores (does the test measure what it is designed to measure?)
if test scores are interpreted in multiple ways, each interpretation needs to be evaluated
An achievement test can be used to:
-evaluate students' performance
-assign a student to an appropriate instructional program
-evaluate a learning disability
(the validity of each of these interpretations needs to be evaluated)
Reliability tells us whether a test measures whatever it measures consistently
Validity is about our confidence that interpretations we make from a test score are likely to be correct
Reliability is a necessary, but insufficient, condition for validity.
-For interpretation of scores to be valid, test scores must be reliable.
-However, reliable scores do not guarantee valid score interpretations.
Construct underrepresentation: Present when the test does not measure important aspects of the specified construct.
A test of math skills that contains division problems only
Construct-irrelevant variance:
Present when the test measures features that are unrelated to the specified construct.
A math test with complex written instructions
External Features that Can Impact Validity
-Examinee characteristics (e.g., anxiety): max performance test: low motivation/high anxiety impact interpretations AND typical response test: client may attempt to present him/herself in a more/less pathological manner
-deviation from standard test administration/scoring procedures (follow time limits/provide instructions)
-instruction and coaching
-appropriateness of standardization sample (norm-referenced interpretations)
Traditional validity nomenclature:
-Content Validity: is the content of the test relevant and representative of the domain?
-Criterion-Related Validity: involves examining the relationships between the test and external variables
-Construct Validity: involves an integration of evidence that relates to the meaning of the test scores
Traditional validity nomenclature suggests that there are different "types" of validity
-Modern conceptualization views validity as a unitary concept.
-Not types of validity but sources of validity evidence.
-The current view is that validity is a single concept with multiple sources of evidence to demonstrate it
Sources of Validity Evidence: Standards for Educational and Psychological Testing (1999) describe five sources of evidence:
-Evidence Based on Test Content
-Evidence Based on Relations to Other Variables
-Evidence Based on Internal Structure
-Evidence Based on Response Processes
-Evidence Based on Consequences of Testing
Evidence Based on Test Content: Traditionally referred as content validity, Examines the relationship between the content of the test and the construct it is designed to measure, Does the test cover the content that it is suppose to cover?
-Process of relevance of the content starts at early stages of development: Identify what we want to measure and delineate the construct or content domain to be measured
-Typical response scale to measure anxiety: Experts review clinical and research literature and develop items designed to assess the theoretical construct being measured
-Test developers include a detailed description of procedures for writing items as validity evidence
After test is develop, developers continue collecting validity evidence based on content
-A qualitative process: expert judges review correspondence of test content and its construct
-Experts: same who help during test construction or independent group
Experts evaluate two major issues:
-Item Relevance: Does each individual item reflects content in the specified domain?
-Content Coverage: Does overall test reflects essential content in the domain?
Content-based validity is specially important for:
-Academic achievement tests
-Employment tests: sample of skills needed to succeed at job and used to demonstrate consistency between content of test and job requirements
Face Validity
-not a form of validity
-Does the test "appear to measure" what it is designed to measure to the general public?
-Tests with "face validity" are usually better received by the public.
Evidence Based on Relations to Other Variables: Historically referred as criterion validity
-Obtained by examining relationships between test scores and other variables
-Several distinct applications:
Test-Criterion Evidence, Convergent and Discriminant Evidence, and Contrasted Groups Studies
Test-Criterion Evidence: Criterion: Measure of some outcome of interest
-Many tests are designed to predict performance on some variable (the criterion)
-Can test scores predict performance on a criterion? (e.g., SAT predict college GPA)
-Types of studies to collect test-criterion evidence: Predictive Studies and Concurrent Studies
Predictive studies involve a time interval between test and criterion.
In concurrent studies, the test and criterion are measured at the same time.
Predictive evidence of validity:
-Administering a test to applicants of a job
-Holding their scores for a pre-established period of time but not using those scores as part of selection process
-When time has elapsed, a measure of the behavior that the test was designed to predict (criteria) is taken
-A test has predictive validity when its scores are significantly correlated with the scores on the criteria
Concurrent evidence of validity:
-Collect criterion data from a group of current employees
-Give those same employees the test they wish to use as part of their selection process
-The test demonstrates evidence of concurrent validity if its scores are significantly correlated with the scores on the criteria
Researchers use a correlation coefficient to examine the relationship between the criterion and the predictor
In this context, the correlation coefficient is referred as the validity coefficient (rxy)
Issues in test-criterion studies
-Selecting a criterion: Criterion's measure must be both valid and reliable
-Criterion contamination: Predictor and criterion scores must be obtained independently
-Interpreting validity coefficients: How large should validity coefficients be?
-Validity generalization
Convergent Evidence: Construct Validity
-Correlate test scores with tests of the same or similar construct
-Expect moderate to strong positive correlations (like anxiety and depression)
Discriminant Evidence: Construct Validity
-Correlate test with tests of a dissimilar construct
-Expect negative correlations (like self-esteem and anxiety)
Multitrait-Multimethod Studies combines convergent and divergent strategies
-Requires to examine two or more traits using two or more measurement methods
-Allows to determine what the test correlates with (and does not correlate with) as well as how method of measurement influences the relationship
Contrasted Group Studies: Examine different groups expected to differ on the construct measured by the test
-Contrast depressed vs. non-depressed
-Young vs. old examinees
Evidence Based on Internal Structure: Examine the internal structure and determine if it matches the construct being measured
Factor analysis is a prominent technique.
Factor Analysis: A statistical method that evaluates the interrelationships of variables and derives factors
-Factor analysis allows one to detect the presence and structure of latent constructs among a set of variables.
-Factor analysis starts with a correlation matrix.
Evidence Based on Response Processes
-Are the responses invoked by the test consistent with the construct being assessed?
-Does a test of math reasoning require actual analysis and reasoning, or simply rote calculations?
-Can also include actions of those administering and grading the test.
Evidence Based on Consequences of Testing: "consequential validity evidence." informal
-If the test is thought to result in benefits, are those benefits being achieved?
-Some suggest that this concept should incorporate social issues and values.
Validity Argument: Validation should involve the integration of multiple sources of evidence into a coherent commentary.
-All information on test quality is relevant to validity: score reliability, standardized administration and scoring, accurate scaling equating and setting, attention to fairness
How Test Manual Report Validity Evidence
-Different types of validity evidence are most applicable to different types of tests.
-The manual should use multiple sources of validity evidence to build a compelling validity argument.
In classical test theory, T stands for ____ score, X stands for ___ score and E stands for ____
true; observed; random measurement error
Define random error of measurement and provide an example
testing environment a result of chance factors
Define systematic error of measurement and provide an example
two extra pounds for every measurement of weight
____ error reduces the reliability of test results while ____ error does not lower reliability (test is reliably inaccurate by the same amount each time). Therefore, ____ error is the main focus of classical test theory.
Random; systematic; random
What conclusion could be drawn from a reliability coefficient of .75?
25% error
____ reliability requires that two forms of the test are administered to the same group of individuals while in ____ a test developer gives the same test to the same group of test takers on two different occasion.
alternate form; test/retest
____ method of estimating reliability requires dividing the test into halves, then correlating the set of individual test scores on the second half.
split-half reliability
The coefficient alpha is also known as the ____ of all possible split-half coefficients.
average (mean)
____ tests produce more reliable scores than ____ tests.
long; short
Unreliable test scores will lead to ____ standard error of measurements.
When interpreting the test scores of individuals, the ____ is more practical than the ____.
standard error of measurement; reliability coefficient
In terms of threats to validity....
constructive underrepresentation is present when the test does not measure important aspects of the specified construct
On the other hand, ....
construct irrelevant variance is present when the test measures features that are unrelated to the specified construct
Common threats to validity
-examinee characteristics (high test anxiety)
-deviations from standard test procedures
Contemporary conceptualizations views validity as a....
unitary construct while
Traditional nomenclature suggests that there are three different....
types of validity
Validity evidence based on ....
test content is produced by an examination of the relationship between the content of the test and the construct or domain the test is designed to measure
____validity is not technically a form of validity and refers to the degree to which a test 'appears' to measure what it is designed to measure
Examples in which validity evidence is based on relations to other variables
GRE given to students prior to entering their first year of grad school
____studies involve a time interval between test and criterion but in ____studies the test and criterion are measured at the same time.
Predictive; concurrent
"Correlating scores on a new test to measure anxiety with a measure of sensation seeking" is an example of ____validity
"Correlating scores on a new IQ test with scores on the Wechsler Intelligence Scale" is an example of ____validity
____ ____ studies combine convergent and divergent strategies.
Multi-trait multimethod
____ ____ allows one to detect the presence and structure of latent constructs among a set of variables.
Factor analysis
____ ____ is a statistical procedure that allows one to predict performance on one test from performance on another (given that both are correlated with each other).
Linear regression
____ ____ is a method of obtaining validity that examines different groups expected to differ on the construct measured by the test, e.g., contrasting depressed vs. non-depressed groups.
Contrasted group studies
It is important that predictor and criterion scores be obtained independently in order to avoid ____ ____
criterion contamination
"Correlating scores on a new IQ test with scores on the Wechsler Intelligence Scale" is an example of ____validity