Scheduled maintenance: Wednesday, February 8 from 10PM to 11PM PST
hello quizlet
Home
Subjects
Expert solutions
Create
Study sets, textbooks, questions
Log in
Sign up
Upgrade to remove ads
Only $35.99/year
Social Science
Psychology
Psych 440 ch 8
Flashcards
Learn
Test
Match
Flashcards
Learn
Test
Match
Terms in this set (86)
For a norm-referenced test, a good item is one where
people who scored high on the test tended to get it right, and people who scored low tended to get it wrong
For a criterion-referenced test, the items need to
assess mastery of the concepts
Scaling is the process of
selecting rules for assigning numbers to measurement of varying amounts of some trait, attribute, or characteristic
Likert (& Likert-type)
taker presented with 5 alternative responses on some continuum
-generally reliable
-result in ordinal level data
-summative scale
Method of Paired Comparisons
taker presented with two test stimuli and are asked to make some sort of comparison
Sorting Tasks
takers asked to order stimuli on the basis of some rule
-Categorical - placed in categories -Comparative - placed in an order
Guttman Scale
-Items range from weaker to stronger expressions of variable being measured
-Arranged so that agreement with stronger statements implies agreement with milder statements as well
-ordinal data
Thurstone Scaling Method
Process designed for developing a 'true'
Thurstone Scaling Method - start with a large item pool
-Get ratings of the items from experts
-Items are selected using a statistical evaluation of the judges ratings
Test Construction: Choosing Your Item Type
selected or constructed
Selected response items
generally take less time to answer and are often used when breadth of knowledge is being assessed
Constructed response items
are more time consuming to answer and are often used to assess depth of knowledge
Item types: Advantages and Disadvantages
...
Test Construction: Writing Items
Rule of thumb is to write twice as many items for each construct as what will be intended for the final version of the test
An ITEM POOL is a
reservoir of potential items that may or may not be used on a test
Test Construction: Scoring Items
Decisions about scoring of items are related to the scaling methods used when designing the test
options= cumulative, class/categorical/ipsative
Stage 3: Test Tryout
-Should use participants and conditions that match the test's intended use
-Rule of thumb is that initial studies should use five or more participants for each item in the test
Guessing and Faking
-Guessing is only an issue for tests where a "correct answer" exists
-Faking can be an issue with attitudes
Guessing Correction Methods
- Verbal or written instructions that discourage guessing
- Penalties for incorrect answers (i.e., test- taker will get no points for a blank answer, but will lose points for an incorrect answer)
- Not counting omitted answers as incorrect
- Ignoring the issue
Faking Corrections
- Lie scales
- Social Desirability scales
- Fake Good/Bad scales
- Infrequent response items
- Total score corrections based on scores obtained from measures of faking
- Using measures with low face validity
Step 4: Item Analysis
- A good test is made up of good items
• Good items are reliable (consistent)
• Good items are valid (measure what they are supposed to measure)
• Just like a good test!
- Good items also they help discriminate between test-takers on the basis of some attribute.
- Item Analysis is used to differentiate good items from bad items
Item Analysis - Basic Procedures
Procedures used may vary depending upon the goals of the developer
-goals: enhance forms of reliability, certain forms of validity, discrimination
Four indices are used to analyze and select items:
-Indices of item difficulty
-Indices of item reliability
-Indices of item validity
-Indices of item discrimination
Ideally, if we develop a test that has "correct" and "incorrect" answers
we would like to have those takers who are highest on the attribute to get more items correct than those that are not high on that attribute
The proportion of the total number of testtakers who got the item right (pi)
p1 = .90; 90% got the item correct
p2 = .80; 80% got the item correct
p3 = .75; 75% got the item correct
p4 = .25; 25% got the item correc
Ideal Average
Ideal average pi is halfway between chance guessing and 1.0
Item-Total Correlation
A simple correlation between the score on an item and the total test score
Advantages of Item-Total Correlation
-can test statistical significance of the correlation
-can interpret % of variability item accounts for (rit2)
Item-Reliability Index
is the product of the item-score standard deviation and the correlation between the item score and the total test score
-Provides an indication of the tests internal consistency. The higher the index, the higher the consistency
Item-Reliability
Remember, internal consistency is a measure of how well all items on a test are measuring the same construct
-another way is factor analysis
Item-Discrimination Index
If discrimination between those with high and low on some construct is the goal
-we would want items with higher proportions of high scorers getting the item "correct"
-and lower proportions of low scorers getting the item "correct"
Item-Discrimination Index is used to
compare the performance on a particular item with performance in the upper and lower regions of a distribution of continuous test scores
Item-Discrimination Index symbolized by
d - compares proportion of high scorers getting item "correct" and proportion of low scorers getting item "correct"
d = [U - L] / n
Item-Discrimination (Method 2)
...
Empirically Keyed Scales
- Goal is to choose items that produce differences between the groups that are better than chance
- Resulting scales often have heterogeneous content and have a limited range of interpretation
- Used in clinical settings, especially for diagnosis of mental disorders
• Also used in career counseling
Step 5: Test Revision
Modifying the test stimuli, administration, etc., on the basis of either quantitative or qualitative item analysis
Cross-Validation
Re-establishing the reliability and validity of the test with other samples
Item Fairness
-An item is unfair if it favors one particular group of examinees in relation to another
-Results in systematic differences between groups that are not due to the construct being tested
Items can be designed to measure breadth or depth of knowledge
It is difficult to measure both breadth and depth at the same time
Item difficulty and item discrimination are both important considerations for selecting effective items for a test
Optimal item difficulty (from a psychometric standpoint) may be impractical sometimes
Classical Test Theory formula
X=T+E
Standard Error of Measurement formula
σ√(1-α)
α= reliability of the test
σ= variability of test scores
Error in test construction
Item or content sampling (differences in item wording and how content is selected may produce error)This error is due to variation of items within a test or between different testsMay have to do with how a behavior is sampled or what behavior is sampled
Error in test administration
Anything that occurs during the administration of a test that could affect performance Environmental factorsTest-taker factors Examiner factors
Error in test scoring and interpretation
Subjectivity in scoring is a source of error variance It is more likely to be a problem in non-objective personality tests, essay tests, behavioral observations, computer scoring errors
The higher an item difficulty index is...
the easier the item is
An item difficulty index of .28 means that
28% of the test takers answered the item correctly; an item difficulty index of .73 means 73% of the test takers answered the item correctly. Therefore, the second item would be easier than the first one.
Cronbach's alpha is a measure of the
overall internal consistency of a scale. It is the mean of all possible split-half reliability measures for the scale
Item-total correlations are a measure of the
amount of covariance (overlap in variability) between an item and the rest of the scale
-They're measures of the validity of individual items within a scale.
If an item-total correlation is low
that's a sign that an item should be eliminated
An item will have poor content validity if it is
not measuring the same construct as other items in the scale
Items that measure the same construct as the rest of the scale will have
good content validity and strong, positive item-total correlations
Item difficulty indexes are calculated through
dividing the total number of correct responses by the total number of all responses
Item discrimination indexes are calculated through
dividing the difference between experts and novices by the average number of people in each group
The lower the reliability
the larger the SEM
If the test is reliable the true score will be
higher so that means if a test has high reliability it will have a low SEM and if the test has low reliability it will have a high SEM
a simple way of thinking about reliability is how consistent it is
the best way to see if a test is reliable is to create a trial with different populations to see if you get the results you'd expect. A reliability index could also show a test's internal consistency
One of the simplest ways to tell if a test is reliable is to use the
test-retest method in which the same group of participants takes the same exam twice over a set period of time. If the participants scores for both tests are similiar then the tests demonstrate having reliability
Can you have high reliability and low validity on a test?
Yes, you can have highly consistent results, or highly consistent data/questions, and also have low validity. The questions/results can be reliable, but they can also have nothing to do with what the test is actually trying to measure
Likert Scaling: good for
assessing constructs related to degree or frequency, such as political opinions or prevalence of happy moods
Guttman Scaling: good for assessing
constructs where ideas build on each other, such as attitudes toward how to best treat mental health challenges
Thurstone Scaling: good for assessing
attitude-related constructs that can be adapted to agree/disagree statements, where these statements correspond to an clear level of favorability toward the attitude topic
The observed score will be the best estimate of the
true score
The observed score will be the best estimate of the true score but not
an exact indicator, because of measurement error
SEM forces us to think of observed test scores as
indicating a potential range of scores for the individual
what is validity
how well a test measures what it's trying to measure
Content validity
1. find a precise definition of the construct being measured
2. use domain sampling to determine all possible behaviors that represent that construct
3. determine how well test items sample full domain of those behaviors
concurrent validity
how well results of this test align with other outcomes measured at the same time
predictive validity
how well test scores align with other outcomes measured after the test has been taken
predictive validity - incremental validity
type of predictive validity, does this measure add any predictive power?
convergent validity
does our measure highly correlate with other tests designed to measure the same construct? if so, good
discriminant validity
does our measure correlate with other tests of dissimilar constructs? if so, bad
fairness and bias determine test validity in
practice
bias
problem with the test itself (statistical )
high systematic error that results in inaccurate measurements across groups
means the test cannot possibly be fairly used
fairness
a problem with the way the test is used
regardless of statistical properties, means that test is being used in discriminatory way
item format
form, plan, structure, arrangement and layout of individual test items
selected response
fast, good for breadth of knowledge
more structured (more reliable)
constructed response
slower
good for depth of knowledge
more subjective (less reliable)
discrimination
how well items tell who should do well from people who should not do well
item analysis - differentiating good from bad items
4 indices used
-item difficulty
-item reliability
-item validity
-item discrimination
item reliability and validity
often measured through confirmatory factor analysis
-FA: measure of how many sources of variance there are in a test
-CF: measure of extent to which test variance aligns to theory
item difficulty
proportion of all people who got the question right
# correct answers/total # questions
item discrimination
based on the number of people in two subgroups who got it right
# correct for experts - # correct for novices / average # in each group
item discrimination interpretation
d=.60
positive - more experts than notices answered the item correctly
above 0 - there is a reasonable difference in performance between experts and novices
signs of a bad item
everyone/no one getting it wrong
.
distractors
incorrect answer options on multiple choice test
-item difficulty and discrimination affected
Other sets by this creator
Psych 440 midterm 2
193 terms
Psych 440 Midterm 1
97 terms
Psych 440 final 4 chapters
165 terms
Buddhism exam 2
27 terms
Verified questions
question
Suppose that, for a sample of size $n=100$ measurements, we find that $\bar{x}=50$. Assuming that $\sigma$ equals $2$, calculate confidence intervals for the population mean $\mu$ with the following confidence levels: $$ 97\% $$
algebra
Evaluate $$ \left(\begin{array}{l}9 \\ 4\end{array}\right) $$
psychology
The typical FR pattern is also known as b_______-and-r______ pattern, with a pause that is followed quickly by a (high/low) rate of response.
economics
What does the law of increasing opportunity costs state?
Other Quizlet sets
AP2, Circulatory Pathways - Questions
33 terms
Therapeutik Psychiatrie
20 terms
Tracks- Exam
11 terms
Econ Test One
86 terms