Upgrade to remove ads
Psych Testing chpt 8 Test Development
Terms in this set (61)
test development is an umbrella term for all that goes into the process of creating a test
5 Stages of Test Development
[back to Test Tryout]
-the impetus for developing a new test is some thought that "there ought to be a test for..."
Stimulus of Test Conceptualization
-The stimulus could be knowledge of psychometric problems with other tests, a new social phenomenon, or any number of things.
-there may be a need to assess mastery in an emerging occupation
Preliminary Questions of Test Conceptualization
regarding the test:
-what is it designed to measure?
-what is the objective?
-is there a need for it?
-who will take/use it?
-what content will it cover?
-how will it be administered?
-what is the ideal format of it?
-should more than one form be developed?
-what special training will be required of users for administering or interpreting it?
-what types of responses will be required of testtakers?
-who benefits from an administration?
-is there any potential harm as a result of administration?
-how will meaning be attributed to scores on the test?
Item Development in Tests
-test items may be pilot studied to evaluate whether they should be included in the final form of the instrument
Item Development in Norm-Referenced Tests
Generally a good item on a norm-referenced achievement test is an item for which high scorers on the test respond correctly. Low scorers respond incorrectly.
Item Development in Criterion-Referenced Tests
-Ideally, each item on a criterion-oriented test addresses the issue of whether the respondent has met certain criteria.
-Development of a criterion-referenced test may entail exploratory work with at least two groups of testtakers: one group known to have mastered the knowledge or skill being measured and another group known not to have mastered it.
-types of scales
the process of setting rules for assigning numbers in measurement
Types of Scales
Scales are instruments to measure some trait, state or ability. May be categorized in many ways (e.g. multidimensional, unidemensional, etc.).
-LL Thorndike was influential in development of sound scaling methods
Scaling Methods of Test Construction
Numbers can be assigned to responses to calculate test scores using a number of methods
-Method of Paired Comparisons
-Method of Equal-Appearing Intervals
a grouping of words, statements, or symbols on which judgments of the strength of a particular trait, attitude, or emotion are indicated by the testtaker.
-all rating scales result in ordinal level data
-some are unidimensional, others are multidimensional
Unidimensional Rating Scales
only one dimensions is presumed to underlie the ratings
Multidimensional Rating Scales
more than one dimension is thought to underlie the ratings
Each item presents the testtaker with five alternative responses (sometimes seven), usually on an agree/disagree or approve/disapprove continuum.
Method of Paired Comparisons
ex: select the behavior that you think would be more justified:
a) cheating on taxes if one has a chance
b) accepting a bribe in the course of one's duties
-For each pair of options, testtakers receive a higher score for selecting the option deemed more justifiable by the majority of a group of judges.
-The test score would reflect the number of times the choices of a testtaker agreed with those of the judges.
Entails judgments of a stimulus in comparison with every other stimulus on the scale.
Stimuli (e.g. index cards) are placed into one of two or more alternative categories.
Items range sequentially from weaker to stronger expressions of the attitude, belief, or feeling being measured.
-All respondents who agree with the stronger statements of the attitude will also agree with milder statements.
Method of Equal-Appearing Intervals
can be used to obtain data that are interval in nature
Test Construction Writing Items
The reservoir or well from which items will or will not be drawn for the final version of the test.
-comprehensive sampling provides a basis for content validity of the final version of the test.
Includes variables such as the form, plan, structure, arrangement, and layout of individual test items.
-constructed response format
Selected-Response Format (Item Format)
items require testtakers to select a response from a set of alternative responses.
Constructed-Response Format (Item Format)
items require testtakers to supply or to create the correct answer, not merely to select it.
has 3 elements:
1) a stem
2) a correct alternative or option
-stem--> A psychological test, an interview, and a case study are:
-correct alt. --> a)psychological assessment tools
-distractors--> b) standardized behavioral samples; c) reliable assessment instruments; d) theory-linked measures
Writing Items for Computer Administration
-Item-bank: a relatively large and easily accessible collection of test questions
-computerized adaptive testing (CAT)
Computerized Adaptive Testing (CAT)
an interactive, computer-administered test-taking process wherein items presented to the testtaker are based in part on the testtaker's performance on previous items.
-able to provide economy in testing time and number of items presented
-tends to reduced floor effects and ceiling effects
Scoring Items of Test Construction
-cumulatively scored test
Cumulatively Scored Test
assumption that the higher the score on the test, the higher the testtaker is on the ability, trait, or other characteristic that the test purports to measure.
responses earn credit toward placement in a particular class or category with other testtakers whose pattern of responses is presumably similar in some way (e.g diagnostic testing).
comparing a testtaker's score on one scale within a test to another scale within that same test.
-test should be tried out on the same population that it was designed for
-5-10 respondents per item
-should be administered in the same manner, and have the same instructions, as the final product
What is a "Good Item" in Test Tryout
-reliable and valid
-discriminates testtakers: high scorers on the test overall answer the item correctly
the nature of the item analysis will vary depending on the goals of the test developer
-among the tools test developers might employ to analyze and select items are:
an index of the item's difficulty, reliability, validity and discrimination
the proportion of respondents answering an item correctly
-For maximum discrimination among the abilities of the testtakers, the optimal average item difficulty is approximately .5, with individual items on the test ranging in difficulty from about .3 to .8.
Item Reliability Index
indication of the internal consistency of the scale
-Factor analysis can also provide an indication of whether items that are supposed to be measuring the same thing load on a common factor.
The Item-Validity Index
Allows test developers to evaluate the validity of items in relation to a criterion measure.
The Item-Discrimination Index
Indicates how adequately an item separates or discriminates between high scorers and low scorers on an entire test.
-a measure of the difference between the proportion of high scorers answering an item correctly and the proportion of low scorers answering the item correctly
Analysis of Item Alternatives
The quality of each alternative within a multiple-choice item can be readily assessed with reference to the comparative performance of upper and lower scorers.
Item Characteristic Curves (ICC)
a graphic representation of item difficulty and discrimination
Other Considerations in Item Analysis
-a biased test item
Test developers and users must decide whether they wish to correct for guessing but to date no entirely satisfactory solution to correct for guessing has been achieved.
the degree, if any, a test item is biased
Biased Test Item
an item that favors one particular group of examinees in relation to another when differences in group ability are controlled
Item analyses of tests taken under speed conditions yield misleading or uninterpretable results. The closer an item is to the end of the test, the more difficult it may appear to be.
Qualitative Item Analysis
a general term for various nonstatistical procedures designed to explore how individual test items work.
-think aloud test administration
Qualitative methods: techniques of data generation and analysis that rely primarily on verbal rather than mathematical or statistical procedures.
Think Aloud Test Administration
Think aloud test administration - respondents are asked to verbalize their thoughts as they occur during testing.
experts may be employed to conduct a qualitative item analysis
items are examined in relation to fairness to all prospective testtakers. Check for offensive language, stereotypes, etc
-revision in new test development
-revision in the life cycle of a test
-the use of IRT in building and revising tests
Revision in New Test Development
-Items are evaluated as to their strengths and weaknesses - some items may be eliminated.
-Some items may be replaced by others from the item pool.
-Revised tests will then be administered under standardized conditions to a second sample
-Once a test has been finalized, norms may be developed from the data and it is said to be standardized.
Revision in the Life Cycle of a Test
Existing tests may be revised if the stimulus material or verbal material is dated, some out-dated words become offensive, norms no longer represent the population, psychometric properties could be improved, or the underlying theory behind the test has
-In test revision the same steps are followed as with new tests (i.e. test conceptualization, construction, item analysis, tryout, and revision).
Cross-validation refers to the revalidation of a test on a sample of testtakers other than those on whom test performance was originally found to be a valid predictor of some criterion.
-Item validities inevitably become smaller when administered to a second sample - validity shrinkage.
a test validation process conducted on two or more tests using the same sample of testtakers.
-economical for test developers
Test developers employ examiners who have experience testing members of the population targeted by the test. Examiners follow standardized procedures and undergo training.
-anchor protocols are used
a test protocol scored by a highly authoritative scorer that is designed as a model for scoring and a mechanism for resolving scoring discrepancies.
-A discrepancy between scoring in an anchor protocol and the scoring of another protocol is referred to as scoring drift.
The Use of IRT in Building and Revising Tests
Items are evaluated on item-characteristic curves (ICC) in which performance on items is related to underlying ability.
-3 possible applications of IRT in building and revising tests
3 Possible Applications of IRT
1) evaluating existing tests for the purpose of mapping test revisions,
2) determining measurement equivalence across testtaker populations, and
3) developing item banks
YOU MIGHT ALSO LIKE...
Assessment - Chapter 8 (Test Development)
Psych Measurement 2
Psych Testing Exam 2
Final Exam Review
OTHER SETS BY THIS CREATOR
Learning and Memory Test 4: Facial Cognition
(Learning and Memory) Test 3 Review
Learning and Memory Test 2