Study sets, textbooks, questions
Upgrade to remove ads
Terms in this set (22)
Behavior or performance that is usual, average, normal, standard, expected, or typical.
Normative data; the test performance data of a group of test takers, designed as a reference for interpreting, or otherwise placing in context individual test scores.
Criterion Reference Test
Describes the specific types of skills, tasks, or knowledge that the test taker can demonstrate
Knowledge that we want person to have
Contrast with norm-referenced test
Mostly applied in educational settings
Setting cutoffs to determine pass/fail
Ex: Drivers license test, college exam
Norm Reference Test
Comparing the results of one person to others to see how they stand in relation to mean and SD.
Useful primarily when we need to compare individuals with one another or with a reference group in order to evaluate differences between them on the characteristic the test measures
EX: Career, personality inventories
Relates a level of test performance to the age of people who have taken the test:
Mainly relates to cild development
Most IQ tests (e.g., Stanford-Binet IQ test)
Derived by locating the performance of test takers within the norms of the students at each grade level - and fractions of grade levels - in the standardization level
Can be misleading
equating alternate forms (same examinees)
distribute both forms of the test randomly to a large representative sample of examinees.
Generate the descriptive statistics (mean, sd) on both forms of the test.
Equating raw score from one test to the scale score of the second test (using z scores).
equating alternate forms (different examinees)
give to a new sample of examinees a selection of items from the old test together with the new test.
The use of an anchor test allows us to seperate differences among examinees from differences between the test forms.
The consistency or stability of a measure of behavior
The extent to which a score from a test is consistent and free from errors of measurement.
Reliability scores above 0.8 are very good, however if they are less than 0.6 they are unusable and should be removed.
Is caused by any factors that systematically affect measurement of the variable across the sample
________ _______ tend to be consistently either positive or negative -because of this, systematic error is sometimes considered to be bias in measurement
Does effect average
Is caused by any factors that randomly affect measurement of the variable across the sample
It does not have any consistent effects across the entire sample
It adds variability to the data but does not affect average performance for the group
Examples that may cause this are: stress, guessing, external distractions, subjective scoring etc.
True Score Theory
• When you take a test, the developer is interested in the differences among people. We don't want every person to get the same score.
• We want a measure that will be as accurate as possible.
• Interested in the variance among all the participants.
Characteristics of Errors
Mean error of measurement = 0
True scores & error are uncorrelated (rte=0)
The standard deviation of the error is greater than zero (sde>0)
the proportion of variance in test scores that is due to or accounted for by variability in true scores.
rxx= Stotal (squared) / Sobserved score (squared)
Reflects the stability of a test over time
Administration done to the same applicants and given the same test during two different testing periods.
Scores at time one are correlated with scores at time two.
Appropriate time to re-test is between 2 weeks and 6 months.
This is good for tests that we do not expect to have practice effects, and tests in which the measured trait will not change.
Tends to be errors related to the administrator or the individual (possibly reflect random errors).
Good: tests of achievement
Bad to use: trait tests that change -developmental
alternate forms reliability
Reflects the equivalence of 2 alternate forms of a test
Administration of the test is performed on the same applicants. Here the tests measure the same attribute and tests are similar. This test can be administered simultaneously or at two different test periods. Then scores from the first form are correlated with scores on the second form.
Tests should be equivalent in terms of content, response process and statistical characteristics.
Counterbalancing (within one testing period giving half the group test A first and the other group test B first) is used to guard against order effects.
This type of test is difficult to develop --> requires two tests measuring the same thing but are different in terms of content.
Reflects the consistency of the test items by measuring
Item homogeneity/heterogeneity. Do the items overlap and test the same measure?
The longer the test, the higher its ________ ________, only if all the questions measure the same trait.
Common methods for determining internal reliability:
Split-half method, Internal consistency methods (Cronbach's Coefficient Alpha; Kuder-Richardson Formula 20)
split half method
type of test used to measure internal reliability. Administered to the same applicants in one test. The test is split into half and each half is correlated with each other. When splitting the test into two it is as if we changed the length of one really long test to two shorter tests (thereby decreasing it's internal reliability). To solve for this the Spearman-Brown prophecy formula is used to adjust the correlation.
a formula which takes the correlation achieved by the split half method and readjusts it to describe the corrected correlation. Accounts for the changing of lengths in the questionnaire.
cronbach's coefficient alpha
Used with ratio or interval data. Used to measure internal reliability. Administration uses same applicants, same test, and the average intercorrelation among test items.
Kuder-Richardson Formula 20
Used for test with dichotomous items (yes-no; true-false). Used to measure internal reliability. Administration uses same applicants, same test, and the average intercorrelation among test items.
Inter-rater (interobersver) reliability
Used when human judgment of performance is involved in the scoring process
Refers to the degree of agreement between 2 or more raters by correlating level of agreement between judges.
standard error of measurement (SEM)
a number representing how far are we from the true score. Gives an indication of how much does the error effect us. We want a low score.
_____ ______ ____ _______ = standard deviation of the error measurement (also known SEest or SEE)
When a tester divides the number of students who answered a test item correctly by the number of students who attempted the item, the result is called
According to the discussion of the race and IQ controversy
Wechsler developed the deviation IQ based on the normal distribution. What would an overall IQ of 130 mean on his test?
Mr. Nicklaus has requested a TAKS test which has fewer choices for his student based on the student's IEP. Which accommodation is he using??
Recommended textbook explanations
Psychology: Principles in Practice
Spencer A. Rathus
A Concise Introduction To Logic (Mindtap Course List)
Lori Watson, Patrick J. Hurley
Arlene Lacombe, Kathryn Dumper, Rose Spielman, William Jenkins
Sets with similar terms
Psychological Testing: Chapter 5
Tests and Measurements CH 6
Research and Measurement Methods
Other sets by this creator
Chemistry 300 - Unit Conversions
מילים מהמכברת 26-29
מילים במקברק 22 - 25
20 - 23 מילים מהמחברת
Other Quizlet sets
AP PSYCH Personality (chap 15)
Masulis Ch 3O V0CAB
Theology Exam 1
a. According to Carl Jung, what is the relationship between archetypes and the collective unconscious? b. Describe Erikson's stages of psycho-social development. c. What do you think might happen to society if the structure of the human mind had an id, but not an ego or a superego?
Andrew is so afraid of spiders that he is having a hard time leaving his "spider-proofed" home and going to work. Andrew's psychiatrist has diagnosed him with a. agoraphobia. b. a phobia. c. panic disorder. d. generalized anxiety disorder. e. post traumatic stress disorder.
There is a saying stating that “beauty is only skin deep.” Do you think it is true? Do people act as if it is true? Explain.
Do you think using algorithms rather than heuristics is always the best way to solve problems? Why or why not?