121 terms

Test Scores

mathematical representation of an examinee's performance

Raw scores: number of items scored in a specific manner

to give raw scores more meaning, we need to transform them into standard scores

Standard scores

norm-referenced OR criterion-referenced

Norm-referenced interpretations: examinee's performance is compared to that of other people (most psych. test are norm-referenced)

-norms: average scores of an identified group of individuals

-norm-based interpretation: process of comparing an individual's test score to a norm group

-norm-based interpretation: process of comparing an individual's test score to a norm group

Standardized samples should be representative of the type of individuals expected to take the test

Developing normative data: define population, select random sample and test it

National standardization sample obtained through stratified random sampling, in the U.S. samples stratified based on gender, age, ethnicity, etc. (must exceed 1,000 participants)

once standardization sample is selected, normative tables or norms are developed

Nationally representative samples are common

other samples are available for some tests like local norms and clinical norms

Standardized administration: test should be administered under the same conditions and same administrative procedures

-standard scores: raw scores are transformed to another unit of measurement

-use SD units to indicate where an examinee's score is located relative to the mean of the distribution

-use SD units to indicate where an examinee's score is located relative to the mean of the distribution

There are several standard scores formats (transforming raw scores into standard scores): z-score (m=0, SD=1), T-scores (m=50, SD=10), IQs (m=100, SD=15)

standard scores can be set to any desired M and SD (with the fancy of the test author frequently being the sole determining factor)

Z-scores (+ is above mean, and - is below mean): z=(X-M)/SD

z score to raw score: X=(Z)(SD)+M

Disadvantages of z scores: difficult to interpret

half of the z scores in a distribution will be negative, carry decimal places, few test publishers routinely report z-scores

Percentile rank: reflects the percentage of people scoring below a given point (so a percentile rank of 20 indicates that only 20% of individuals scored below this point)

-range from 1 to 99 (rank of 50 indicates mean score)

-percentile rank is not the same as percentage correct: a percentile rank of 60 means that examinee scored better than 60% of sample, NOT correctly answered 60% of questions

-percentile (not percentile rank): point in a distribution at which a specific percentage of scores are less than or equal to a specified score (so 60% percentile at 104 indicates that 60% of scores are 104 or below)

-percentile rank is not the same as percentage correct: a percentile rank of 60 means that examinee scored better than 60% of sample, NOT correctly answered 60% of questions

-percentile (not percentile rank): point in a distribution at which a specific percentage of scores are less than or equal to a specified score (so 60% percentile at 104 indicates that 60% of scores are 104 or below)

Quartile scores: lower 25%=1, 26 to 50%=2, 51 to 75%=3, upper 25%=4

Stanine: not as common as percentiles, expressed in whole numbers from 1 to 9, with 4, 5, and 6 being considered average

Criterion-referenced score interpretations: the examinee's performance is compared to a specified level of performance

-criterion-referenced interpretations are absolute: compared to an absolute standard

-often used in educational settings

-often used in educational settings

Examples of criterion-referenced interpretations:

-percentage correct (i.e. 85% on a classroom test)

-mastery testing: a cut score is established (pass/fail driver's license)

-standards-based interpretations: involves 3 to 5 performance categories (i.e. assigned "A" to reflect superior work)

-mastery testing: a cut score is established (pass/fail driver's license)

-standards-based interpretations: involves 3 to 5 performance categories (i.e. assigned "A" to reflect superior work)

The terms norm-referenced and criterion-referenced apply to score interpretations

NOT tests!

Norm-referenced interpretations can be applied to both maximum performance and typical response tests

Criterion-referenced are typically applied only to maximum performance

Item Response Theory Scores (Rasch/IRT-scores, Change Sensitive Scores (or CSS)): fundamental for computer adaptive testing

-theory holds that responses to items on a test are accounted by latent traits

-latent trait: it is inferred to exists based of theories and evidence of its existence

-intelligence is a latent trait

-latent trait: it is inferred to exists based of theories and evidence of its existence

-intelligence is a latent trait

IRT Scores cont'd: each examinee possesses a certain amount of intelligence

-IRT describes how examinees at different levels of ability will respond to individual test items

-the specific ability level of an examinee is defined as the level at which examinee can get half of the items correct

-they can be transformed to either norm or criterion referenced scores

-the specific ability level of an examinee is defined as the level at which examinee can get half of the items correct

-they can be transformed to either norm or criterion referenced scores

Qualitative descriptions of test scores: helps communicate test results (i.e. IQs 145 and above=very gifted, IQs 90-109=average)

-Test manuals should provide information on: normative samples (type of sample like national, size of sample, how well it matched U.S. population) and test scores (type of scores provided like T-score, how to transform raw scores, information on confidence intervals)

Reliability refers to the: consistency, accuracy, or stability of test scores

Factors that may affect reliability: time test was administered, items included, external distractions, internal distractions, person grading the test

Measurement Error: error is present in all measurement

even in physics it is reduced but not eliminated

Classical Test Theory (or CTT) is the most influential theory to help us understand measurement issues (Charles Spearman in the early 1900s):

-holds that every score has two components: true score that reflects the examinee's true skills AND error score which is the unexplained difference between a person's actual score on a test and that person's true score

Xi = T + E

Xi = Obtained or observed score

T = True score

E = Random measurement error

Xi = Obtained or observed score

T = True score

E = Random measurement error

Random measurement error varies from:

-person to person

-test to test

-administration to administration

-person to person

-test to test

-administration to administration

True score can not be directly measured:

It is a theoretical reflection of the actual amount of the trait so all we see is an observed score

It is a theoretical reflection of the actual amount of the trait so all we see is an observed score

Measurement error:

-Random

-Systematic

-Random

-Systematic

Random measurement error is the result of chance factors

-It can increase or decrease an individual's observed score

-It reduces:

the usefulness of measurement, ability to generalize, confidence in test results

-Random error reduces the reliability of test results

if errors are responsible for much of the variability so test scores will be inconsistent AND if errors have little effect on test scores so test reflects mainly consistent aspects of performance

-It reduces:

the usefulness of measurement, ability to generalize, confidence in test results

-Random error reduces the reliability of test results

if errors are responsible for much of the variability so test scores will be inconsistent AND if errors have little effect on test scores so test reflects mainly consistent aspects of performance

Systematic measurement error: increases or decreases the true score by same amount each time (E.g., scale that adds 2 pounds, social desirability)

-Does not lower reliability: test is reliably inaccurate the same each time

-It is difficult to identify

-It is not considered in reliability analysis

-It is difficult to identify

-It is not considered in reliability analysis

Measurement errors are random: Equally likely to be positive or negative, over an infinite number of testings the error will increase and decrease a person's score by the same amount, and errors will tend to average zero

-Make a test longer also reduces the influence of random error for the same reason

-Error is normally distributed

-Reduce the error and reliability increases

-Job is to reduce the sources of error as much as possible

-Error is normally distributed

-Reduce the error and reliability increases

-Job is to reduce the sources of error as much as possible

Sources of measurement error: tests rarely include every possible question

-Content sampling error (considered largest source of measurement error): differences between sample of items on test and total domain of items like all possible items, if items are a good sample of domain then content error will be small

-Time sampling error (temporal stability): random fluctuations in performance over time, includes changes in examinee like fatigue and the environment like distractions

-inter-rater differences: when scoring is subjective

-errors in administration

-clerical errors

-Time sampling error (temporal stability): random fluctuations in performance over time, includes changes in examinee like fatigue and the environment like distractions

-inter-rater differences: when scoring is subjective

-errors in administration

-clerical errors

Reliability coefficients: CTT: Xi = T + E, extended to incorporate the concept of variance: σ2X = σ2T + σ2E

σ2X = Observed score variance

σ2T = True score variance

σ2E = Error score variance

σ2T = True score variance

σ2E = Error score variance

General symbol for reliability is rxx: rxx = σ2T / σ2X

(reliable tests will have positive signs)

(reliable tests will have positive signs)

-Reliability is the ratio of true score variance to total score variance

-Reliability is the proportion of test score variance due to true score variance

-Reliability is the proportion of test score variance due to true score variance

Reliability coefficients are correlation coefficients: reflect the proportion of test score variance attributable to true score variance

-so rxx = .90 indicates that 90% of the score variance is due to true score variance

-there are different ways to obtain the scores that are correlated

-there are different ways to obtain the scores that are correlated

Psychologist use different methods for checking reliability:

-Test-retest reliability

-Alternate forms

-Internal consistency

-Inter-rater agreement

-Alternate forms

-Internal consistency

-Inter-rater agreement

Test-Retest Reliability: administer the same test on two occasions, correlate the scores from both administrations, primarily reflects time sampling error

-reflects the degree to which test scores can be generalized to different situations or over time

-important to consider length of interval between testing

-optimal interval is determined by the way tests results are used (i.e. Intelligence and Mood)

-Carry-over effects

-Practice and memory effects

-Characteristics of attribute may change with time, also time consuming and expensive

-important to consider length of interval between testing

-optimal interval is determined by the way tests results are used (i.e. Intelligence and Mood)

-Carry-over effects

-Practice and memory effects

-Characteristics of attribute may change with time, also time consuming and expensive

Procedure: test-retest

-Administering a test to a group of individuals

-Re-administering the same test at a later time

-Compute the correlation between both scores, should be above .70

-Re-administering the same test at a later time

-Compute the correlation between both scores, should be above .70

Alternate-Form Reliability (like test form "A" and "B"): Requires two equivalent or parallel forms, correlate the scores of the different forms, can be administered simultaneously (time error) or delayed (content and time error)

-Alternate-form reliability may reduce, but typically not eliminate carryover effects

-Few tests have alternate forms

-Few tests have alternate forms

Internal Consistency: Estimates errors related to content sampling, Extent to which individuals respond similarly to items measuring the same concept, single administration

-Split-Half Reliability

-Coefficient alpha

-Kuder-Richardson

-Coefficient alpha

-Kuder-Richardson

Split-Half Reliability: Administer the test, then divide it into two equivalent halves, Correlate the scores for the half tests

-How to split a test? First half -second half, Odd-even split, Randomly

-longer tests more reliable

-twice as many test items, able to sample domain more accurately

-better sample of domain, lower error due to content sampling and higher reliability

-BUT, splitting test makes it shorter, less reliability

-longer tests more reliable

-twice as many test items, able to sample domain more accurately

-better sample of domain, lower error due to content sampling and higher reliability

-BUT, splitting test makes it shorter, less reliability

Adjusting Split-Half Estimates: Correction formula: The Spearman-Brown formula; statistically adjusts reliability coefficient when test length is reduced to estimate what the reliability would have been if test were longer

rt= 2rh/1+rh

rh = the half correlation

rh = the half correlation

Split-Half Method

-Advantages: No need for separate administrations or alternate forms

-Problems: Primarily reflects content-sampling error and

correlation may vary depending on how test is split

-Problems: Primarily reflects content-sampling error and

correlation may vary depending on how test is split

Coefficient Alpha: sensitive to content-sampling error and item heterogeneity;

can be calculated from one test administration; used as a measure of reliability

can be calculated from one test administration; used as a measure of reliability

-Examines the consistency of responding to all items

-Represents the mean reliability coefficient from all possible split halves

-Especially useful for tests that do not have right or wrong answers

(E.g., attitudes, personality)

-Represents the mean reliability coefficient from all possible split halves

-Especially useful for tests that do not have right or wrong answers

(E.g., attitudes, personality)

Reliability coefficient is a function of:

-Extent to which each item represents an observation of the same "thing" observed by other test items

-Number of observations one makes

-Number of observations one makes

rxx = k(rij) / 1 + (k-1) rij

k = number of items in the test

rij = average inter-correlation among test items

k = number of items in the test

rij = average inter-correlation among test items

-Compute the correlations among all items

-Compute the average of those inter-correlations

-Use formula to obtain standardized estimate

-Compute the average of those inter-correlations

-Use formula to obtain standardized estimate

One way to increase reliability is to increase the number of items:

-Each item represents an individual assessment of the true score

-With multiple items combined, errors will tend to average out

-Therefore, increasing the number of items increases reliability

-With multiple items combined, errors will tend to average out

-Therefore, increasing the number of items increases reliability

Kuder-Richardson Reliability:

Applicable when tests are scored dichotomously (i.e., right or wrong, scored 0 or 1)

Inter-Rater Reliability: Two or more individuals score the same test independently

-Calculate correlation between the scores

- Appropriate when scoring requires making judgments

-Important when scoring is subjective

-A popular index to estimate inter-rater agreement is Cohen's Kappa (categorical data)

- Appropriate when scoring requires making judgments

-Important when scoring is subjective

-A popular index to estimate inter-rater agreement is Cohen's Kappa (categorical data)

Interpreting Reliability Coefficients: The proportion of a scale's total variance that is attributable to a true score

rxx = 1 - error variance

SO, for example, rxx = .80, i.e., 20% of variability is due to unsystematic variance

SO, for example, rxx = .80, i.e., 20% of variability is due to unsystematic variance

Composite scores: when scores are combined to form a composite (like IQ scores)

-the reliability of composite scores is better than individual scores in composite

-tests are simply sample of the test domain

-combining multiple measures is analogous to increasing the number of observations

-tests are simply sample of the test domain

-combining multiple measures is analogous to increasing the number of observations

Difference scores: involves calculating the difference between two scores (i.e. D = X - Y, where D = Achievement test - IQ Score)

-the reliability of difference scores is typically lower than the individual scores

If a test is to be administered multiple times: Test-Retest Reliability

Tests to be administered one time:

-Homogeneous content - coefficient alpha

-Heterogeneous content - split-half coefficient

-Homogeneous content - coefficient alpha

-Heterogeneous content - split-half coefficient

Alternate Forms available:

Alternate form reliability: delayed and simultaneous

Factors to consider when evaluating reliability coefficients:

-Construct: what might be acceptable for measure of personality may not be for intelligence

-Time available for testing

-How the scores will be used

-Method of estimating reliability

-Time available for testing

-How the scores will be used

-Method of estimating reliability

The standard error of measurement (SEM) is more useful when interpreting test scores.

Reliability coefficients are most useful in comparing the scores produced by different tests.

Standard error of measurement: the SD of the distribution of scores that would be obtained by one person if he or she were tested on an infinite number of parallel forms of a test compromised if items randomly sampled from the same content domain

-Function of the reliability coefficient and standard deviation of the scores

-As reliability increases, the SEM decreases

-As reliability increases, the SEM decreases

Confidence Intervals: reflect a range that contains the examinee's true score

-Confidence intervals are calculated using the SEM and the SD of the scores

-As reliability increases, SEM and confidence intervals get smaller

-As reliability increases, SEM and confidence intervals get smaller

About 68% of the scores in a normal distribution are located between 1 SD above and below the mean

If an individual obtains a scores of 70 in a test with a SEM of 3.0. we would expect her true score to be between 67 and 73

About 95% of the scores in a normal distribution are located between 1.96 SD above and below the mean

If an individual obtains a scores of 70 in a test with a SEM of 3.0. we would expect her true score to be between 64.12 and 75.88

The SEM and confidence intervals remind us that scores are not perfect

-When the reliability of the test scores is high, the SEM is low because high reliability implies low random measurement error

-The smaller the standard error of measurement, the narrower the range

-The smaller the standard error of measurement, the narrower the range

CTT: Only an undifferentiated error component

Generalizability theory: Shows how much variance is associated with different sources of error

Reliability information reported as a Test Information Function (TIF): A TIF illustrates reliability at different points along the distribution.

TIFs can be converted into an analog of the SEM.

How Test Manuals Report Reliability Information:

At a minimum, manuals should report: internal consistency reliability estimates, test-retest reliability, standard error of measurement (SEM), and information on confidence intervals (typically 90% and 95% intervals)

Validity: refers to the appropriateness and accuracy of the interpretation of test scores (does the test measure what it is designed to measure?)

if test scores are interpreted in multiple ways, each interpretation needs to be evaluated

An achievement test can be used to:

-evaluate students' performance

-assign a student to an appropriate instructional program

-evaluate a learning disability

(the validity of each of these interpretations needs to be evaluated)

-assign a student to an appropriate instructional program

-evaluate a learning disability

(the validity of each of these interpretations needs to be evaluated)

Reliability tells us whether a test measures whatever it measures consistently

Validity is about our confidence that interpretations we make from a test score are likely to be correct

Reliability is a necessary, but insufficient, condition for validity.

-For interpretation of scores to be valid, test scores must be reliable.

-However, reliable scores do not guarantee valid score interpretations.

-However, reliable scores do not guarantee valid score interpretations.

Construct underrepresentation: Present when the test does not measure important aspects of the specified construct.

A test of math skills that contains division problems only

Construct-irrelevant variance:

Present when the test measures features that are unrelated to the specified construct.

Present when the test measures features that are unrelated to the specified construct.

A math test with complex written instructions

External Features that Can Impact Validity

-Examinee characteristics (e.g., anxiety): max performance test: low motivation/high anxiety impact interpretations AND typical response test: client may attempt to present him/herself in a more/less pathological manner

-deviation from standard test administration/scoring procedures (follow time limits/provide instructions)

-instruction and coaching

-appropriateness of standardization sample (norm-referenced interpretations)

-deviation from standard test administration/scoring procedures (follow time limits/provide instructions)

-instruction and coaching

-appropriateness of standardization sample (norm-referenced interpretations)

Traditional validity nomenclature:

-Content Validity: is the content of the test relevant and representative of the domain?

-Criterion-Related Validity: involves examining the relationships between the test and external variables

-Construct Validity: involves an integration of evidence that relates to the meaning of the test scores

-Criterion-Related Validity: involves examining the relationships between the test and external variables

-Construct Validity: involves an integration of evidence that relates to the meaning of the test scores

Traditional validity nomenclature suggests that there are different "types" of validity

-Modern conceptualization views validity as a unitary concept.

-Not types of validity but sources of validity evidence.

-The current view is that validity is a single concept with multiple sources of evidence to demonstrate it

-Not types of validity but sources of validity evidence.

-The current view is that validity is a single concept with multiple sources of evidence to demonstrate it

Sources of Validity Evidence: Standards for Educational and Psychological Testing (1999) describe five sources of evidence:

-Evidence Based on Test Content

-Evidence Based on Relations to Other Variables

-Evidence Based on Internal Structure

-Evidence Based on Response Processes

-Evidence Based on Consequences of Testing

-Evidence Based on Relations to Other Variables

-Evidence Based on Internal Structure

-Evidence Based on Response Processes

-Evidence Based on Consequences of Testing

Evidence Based on Test Content: Traditionally referred as content validity, Examines the relationship between the content of the test and the construct it is designed to measure, Does the test cover the content that it is suppose to cover?

-Process of relevance of the content starts at early stages of development: Identify what we want to measure and delineate the construct or content domain to be measured

-Typical response scale to measure anxiety: Experts review clinical and research literature and develop items designed to assess the theoretical construct being measured

-Test developers include a detailed description of procedures for writing items as validity evidence

-Typical response scale to measure anxiety: Experts review clinical and research literature and develop items designed to assess the theoretical construct being measured

-Test developers include a detailed description of procedures for writing items as validity evidence

After test is develop, developers continue collecting validity evidence based on content

-A qualitative process: expert judges review correspondence of test content and its construct

-Experts: same who help during test construction or independent group

-Experts: same who help during test construction or independent group

Experts evaluate two major issues:

-Item Relevance: Does each individual item reflects content in the specified domain?

-Content Coverage: Does overall test reflects essential content in the domain?

-Content Coverage: Does overall test reflects essential content in the domain?

Content-based validity is specially important for:

-Academic achievement tests

-Employment tests: sample of skills needed to succeed at job and used to demonstrate consistency between content of test and job requirements

-Employment tests: sample of skills needed to succeed at job and used to demonstrate consistency between content of test and job requirements

Face Validity

-not a form of validity

-Does the test "appear to measure" what it is designed to measure to the general public?

-Tests with "face validity" are usually better received by the public.

-Does the test "appear to measure" what it is designed to measure to the general public?

-Tests with "face validity" are usually better received by the public.

Evidence Based on Relations to Other Variables: Historically referred as criterion validity

-Obtained by examining relationships between test scores and other variables

-Several distinct applications:

Test-Criterion Evidence, Convergent and Discriminant Evidence, and Contrasted Groups Studies

-Several distinct applications:

Test-Criterion Evidence, Convergent and Discriminant Evidence, and Contrasted Groups Studies

Test-Criterion Evidence: Criterion: Measure of some outcome of interest

-Many tests are designed to predict performance on some variable (the criterion)

-Can test scores predict performance on a criterion? (e.g., SAT predict college GPA)

-Types of studies to collect test-criterion evidence: Predictive Studies and Concurrent Studies

-Can test scores predict performance on a criterion? (e.g., SAT predict college GPA)

-Types of studies to collect test-criterion evidence: Predictive Studies and Concurrent Studies

Predictive studies involve a time interval between test and criterion.

In concurrent studies, the test and criterion are measured at the same time.

Predictive evidence of validity:

-Administering a test to applicants of a job

-Holding their scores for a pre-established period of time but not using those scores as part of selection process

-When time has elapsed, a measure of the behavior that the test was designed to predict (criteria) is taken

-A test has predictive validity when its scores are significantly correlated with the scores on the criteria

-Holding their scores for a pre-established period of time but not using those scores as part of selection process

-When time has elapsed, a measure of the behavior that the test was designed to predict (criteria) is taken

-A test has predictive validity when its scores are significantly correlated with the scores on the criteria

Concurrent evidence of validity:

-Collect criterion data from a group of current employees

-Give those same employees the test they wish to use as part of their selection process

-The test demonstrates evidence of concurrent validity if its scores are significantly correlated with the scores on the criteria

-Give those same employees the test they wish to use as part of their selection process

-The test demonstrates evidence of concurrent validity if its scores are significantly correlated with the scores on the criteria

Researchers use a correlation coefficient to examine the relationship between the criterion and the predictor

In this context, the correlation coefficient is referred as the validity coefficient (rxy)

Issues in test-criterion studies

-Selecting a criterion: Criterion's measure must be both valid and reliable

-Criterion contamination: Predictor and criterion scores must be obtained independently

-Interpreting validity coefficients: How large should validity coefficients be?

-Validity generalization

-Criterion contamination: Predictor and criterion scores must be obtained independently

-Interpreting validity coefficients: How large should validity coefficients be?

-Validity generalization

Convergent Evidence: Construct Validity

-Correlate test scores with tests of the same or similar construct

-Expect moderate to strong positive correlations (like anxiety and depression)

-Expect moderate to strong positive correlations (like anxiety and depression)

Discriminant Evidence: Construct Validity

-Correlate test with tests of a dissimilar construct

-Expect negative correlations (like self-esteem and anxiety)

-Expect negative correlations (like self-esteem and anxiety)

Multitrait-Multimethod Studies combines convergent and divergent strategies

-Requires to examine two or more traits using two or more measurement methods

-Allows to determine what the test correlates with (and does not correlate with) as well as how method of measurement influences the relationship

-Allows to determine what the test correlates with (and does not correlate with) as well as how method of measurement influences the relationship

Contrasted Group Studies: Examine different groups expected to differ on the construct measured by the test

Examples:

-Contrast depressed vs. non-depressed

-Young vs. old examinees

-Contrast depressed vs. non-depressed

-Young vs. old examinees

Evidence Based on Internal Structure: Examine the internal structure and determine if it matches the construct being measured

Factor analysis is a prominent technique.

Factor Analysis: A statistical method that evaluates the interrelationships of variables and derives factors

-Factor analysis allows one to detect the presence and structure of latent constructs among a set of variables.

-Factor analysis starts with a correlation matrix.

-Factor analysis starts with a correlation matrix.

Evidence Based on Response Processes

-Are the responses invoked by the test consistent with the construct being assessed?

-Does a test of math reasoning require actual analysis and reasoning, or simply rote calculations?

-Can also include actions of those administering and grading the test.

-Does a test of math reasoning require actual analysis and reasoning, or simply rote calculations?

-Can also include actions of those administering and grading the test.

Evidence Based on Consequences of Testing: "consequential validity evidence." **informal**

-If the test is thought to result in benefits, are those benefits being achieved?

-Controversial

-Some suggest that this concept should incorporate social issues and values.

-Controversial

-Some suggest that this concept should incorporate social issues and values.

Validity Argument: Validation should involve the integration of multiple sources of evidence into a coherent commentary.

-All information on test quality is relevant to validity: score reliability, standardized administration and scoring, accurate scaling equating and setting, attention to fairness

How Test Manual Report Validity Evidence

-Different types of validity evidence are most applicable to different types of tests.

-The manual should use multiple sources of validity evidence to build a compelling validity argument.

-The manual should use multiple sources of validity evidence to build a compelling validity argument.

In classical test theory, T stands for ____ score, X stands for ___ score and E stands for ____

true; observed; random measurement error

Define random error of measurement and provide an example

testing environment a result of chance factors

Define systematic error of measurement and provide an example

two extra pounds for every measurement of weight

____ error reduces the reliability of test results while ____ error does not lower reliability (test is reliably inaccurate by the same amount each time). Therefore, ____ error is the main focus of classical test theory.

Random; systematic; random

What conclusion could be drawn from a reliability coefficient of .75?

25% error

____ reliability requires that two forms of the test are administered to the same group of individuals while in ____ a test developer gives the same test to the same group of test takers on two different occasion.

alternate form; test/retest

____ method of estimating reliability requires dividing the test into halves, then correlating the set of individual test scores on the second half.

split-half reliability

The coefficient alpha is also known as the ____ of all possible split-half coefficients.

average (mean)

____ tests produce more reliable scores than ____ tests.

long; short

Unreliable test scores will lead to ____ standard error of measurements.

larger

When interpreting the test scores of individuals, the ____ is more practical than the ____.

standard error of measurement; reliability coefficient

In terms of threats to validity....

constructive underrepresentation is present when the test does not measure important aspects of the specified construct

On the other hand, ....

construct irrelevant variance is present when the test measures features that are unrelated to the specified construct

Common threats to validity

-examinee characteristics (high test anxiety)

-deviations from standard test procedures

-deviations from standard test procedures

Contemporary conceptualizations views validity as a....

unitary construct while

Traditional nomenclature suggests that there are three different....

types of validity

Validity evidence based on ....

test content is produced by an examination of the relationship between the content of the test and the construct or domain the test is designed to measure

____validity is not technically a form of validity and refers to the degree to which a test 'appears' to measure what it is designed to measure

Face

Examples in which validity evidence is based on relations to other variables

GRE given to students prior to entering their first year of grad school

____studies involve a time interval between test and criterion but in ____studies the test and criterion are measured at the same time.

Predictive; concurrent

"Correlating scores on a new test to measure anxiety with a measure of sensation seeking" is an example of ____validity

discriminant

"Correlating scores on a new IQ test with scores on the Wechsler Intelligence Scale" is an example of ____validity

convergent

____ ____ studies combine convergent and divergent strategies.

Multi-trait multimethod

____ ____ allows one to detect the presence and structure of latent constructs among a set of variables.

Factor analysis

____ ____ is a statistical procedure that allows one to predict performance on one test from performance on another (given that both are correlated with each other).

Linear regression

____ ____ is a method of obtaining validity that examines different groups expected to differ on the construct measured by the test, e.g., contrasting depressed vs. non-depressed groups.

Contrasted group studies

It is important that predictor and criterion scores be obtained independently in order to avoid ____ ____

criterion contamination

"Correlating scores on a new IQ test with scores on the Wechsler Intelligence Scale" is an example of ____validity

convergent