Upgrade to remove ads
Assessment - Chapter 8 (Test Development)
Terms in this set (80)
Umbrella term for all that goes into the process of creating a test.
5 stages of test development
1. test conceptualization
2. test construction
3. test tryout
4. item analysis
5. test revision
Idea for a test is conceived
a stage in the process of test development that entails writing test items (or re-writing or revising existing items), as well as formatting items, setting scoring rules, and otherwise designing and building the final version of the test.
Once a preliminary form of the test is developed, it is administered under this. The data will be collected and testtakers' performance on the test as a whole and on each to assist in making judgments about which items are good as they are, which items need to be revised, and which items need to be discarded.
action taken to modify a test's content or format for the purpose of improving the test's effectiveness as a tool of measurement. Usually based on item analysis, as well as related info. derived from the test tryout. The revised version will then be tried out on a new sample of testtakers.
Some preliminary questions in test conceptualization
1. What is the test designed to measure?
2. What is the objective of the test?
3. Is there a need for this test?
4. Who will use this test?
5.. Who will take this test?
6. What content will the test cover?
7. How will the test be administered?
Criterion referenced tests
individuals' scores are given meaning by comparison to a standard or criterion. They are typically used in licensing for occupation.
Examples: Driver's license exam; SAT; academic skills assessment
Criterion-referenced instruments derives from:
a conceptualization of the knowledge or skills to be mastered.
preliminary research surrounding the creation of a prototype of the test. The items may be pilot studied to evaluate whether they should be included in the final form of the instrument. Test developer typically attempts to determine how best to measure a targeted construct.
the process of setting rules for assigning numbers in measurement. The measuring device is designed and calibrated and by which numbers are assigned to different amounts of the trait, attribute, or characteristic being measured. L. L. Thurston at the forefront of efforts to develop methodologically sound scaling methods.
Types of scales
a. age-based scale
b. grade-based scale
c. stanine scale
a grouping of words, statements, or symbols on which judgments of the strength of a particular trait, attitude, or emotion are indicated by the testtaker.
The final test score is obtained by summing the ratings across all the items.
A type of summative rating scale. Usually used to scale attitudes. Each item presents the testtaker with 5 alternative responses. Usually reliable.
Method of paired comparisons
Testtakers are presented with pairs of stimuli, which they are asked to compare. Had Katz et al. used the method on their scale.
One method of sorting that entails judgments of a stimulus in comparison to every other stimulus on the scale.
Stimuli are placed into one of two or more alternative categories that differ quantitatively with respect to continuum.
A scaling method that yields ordinal-level measures. Items on it range sequentially from weaker to stronger expressions of the attitude, belief, or feeling being measured. (If all respondents who agree with a should also agree with b, c, and d.
Resulting data of a Guttman scale are analyzed through this item-analysis procedure and approach to test development that involves a graphic mapping of a testtaker's responses.
the reservoir or well from which items will or will not be drawn for the final version of the test.
Variables such as the form, plan, structure, arrangement, and layout of the individual test items. Two types: selected-response format and the constructed-response format.
Require testtakers to select a response from a set of alternative responses.
Require testtakers to supply or to create the correct answer, not merely to select it.
1. a stem, 2.. a correct alternative or option, 3. several incorrect alternatives or options variously referred to as distractors or foils.
testtaker is presented with two columns: premises on the left and responses on the right. Determine which response is best associated with which premise.
A multiple-choice item that contains only two possible responses.
Contains a single-idea, is not excessively long, and is not subject to debate; the correct response must undoubtedly be one of the two choices.
Requires the examinee to provide a word or phrase that completes a sentence. AKA short-answer item.
a relatively large and easily accessible collection of test questions for which the test can take items.
Computerized adaptive testing (CAT)
an interactive, computer-administered test-taking process wherein items presented to the testtaker are based in part on the testtaker's performance on previous items.
refers to the diminished utility of an assessment tool for distinguishing testtakers at the low end of the ability, trait, or other attribute being measured.
Designing an item bank
1. Items (a. acquisition and development, b. classification, c. management).
2. Tests (a. assembly, b. administration, scoring, and reporting, c. evaluation).
3. System (a. acquisition and development, b. software and hardware, c. monitoring and training, d. access and security).
4. Use and Acceptance (a. general, b. instructional improvement, c. adaptive testing, certification of competence, e. program and curriculum eval., f. testing and reporting requirements imposed by external agencies).
5. Costs (A. cost feasibility, B. Cost comparisons).
the diminished utility of an assessment tool for distinguishing testtakers at the high end of the ability, trait, or other attribute being measured.
The ability of the computer to tailor the content and order of presentation of test items on the basis of responses to previous items.
Class scoring (category scoring)
testtaker responses earn credit toward placement in a particular class or category with other testtakers whose pattern of responses is presumably similar in some way.
comparing testtaker's score on one scale within a test to another scale within that same test.
the test should be tried out on people who are simular tin critical respects as the people for whom the test was designed. Executed under conditions as identical as possible to the conditions under which the standardized test will be administered; all instructions, etc.
After the first draft of the test has been administered to a representative group of examinees, the test developer analyzed test scores and responses to individual items. Different group of statistical scrutiny that the test data can potentially undergo.
Item analysis tools
is obtained by calculating the proportion of the total number of test takers who got the item right. An item with a mid-range difficulty level is likely to be "good."
The statistic provides not a measure of the percentage of people passing an item by a measure of the percent of people who said yes to, agreed with, or otherwise endorsed the item.
provides an indication of the internal consistency of a test. the higher this index, the greater the test's internal consistency. This index is equal to the product of the item-score standard deviation (s) and the correlation (r) between the item score and the total test score.
A statistical tool useful in determining whether items on a test to be measuring the same thing(s).
A statistic designed to provide an indication of the degree to which a test is measuring what it purports to measure.
Item-validity index can be calculated once these factors are known:
*the item-score standard deviation.
*the correlation between the item score and the criterion score.
A measure of item discrimination, symbolized by a lowercase italic (d). this estimate of item discrimination compares performance on a particular item with performance in the upper and lower regions of a distribution of continuous test scores.
A graphic representation of item difficulty and discrimination.
3 criteria that any correction for guessing must meet as well as the other interacting issues that must be addressed.
1. A correction must recognize that a guess is not typically made on a totally random basis.
2. Correction for guessing must also deal with the problem of omitted items.
3. Lucky guessing
Refers to the degree a test item is biased.
Biased test item
an item that favors one particular group of examinees in relation to another when differences in group ability are controlled.
Item analysis of tests taken under speed conditions yield misleading or uninterpretable results. discrimination levels are higher toward the end of the test.
Techniques of data generation and analysis that rely primarily on verbal rather than mathematical or statistical procedures.
Qualitative item analysis
a general term for various nonstatistical procedures designed to explore how individual test items work. Compares individual test items to each other and to the test as a whole.
"think aloud" test administration
Cohen et al. proposed qualitative research tool designed to shed light on the testtaker's thought processes during the administration of a test.
may provide qualitative analysis of test items. In the test development process, a group of people knowledgeable about the subject matter being tested and/or the population for whom the test was designed who can provide input to improve the test's content, fairness, and other related ways. Are used in the process of test development to screen test items for possible bias.
a study of test items, typically conducted during the test development process, in which items are examined for fairness to all prospective testtakers and for the presence of offensive language, stereotypes, or situations.
Some forms of content bias
Status, stereotype, familiarity, offensive choice of words, other.
A stage in New Test Development. Act judiciously on all information and mold the test into its final form. Some items from the original item pool will be eliminated and others will be rewritten.
Tests are deemed to be due for revision if the following exist:
1. Stimulus materials look dated and current testtakers cannot relate to them.
2. Verbal content of the test, including the administration instructions and the test items, contains dated vocabulary that is not readily understood by current testtakers.
3. As pop culture changes and words take on new eaning, certain words or expressions in the test items or directions may be perceived as inappropriate.
4. Test norms are no longer adequate as a result of group membership changes in the pop of potential testttakers.
5. Test norms are no longer adequate as a result fo age-related shifts in the abilities measured over time, so an age extension of the norms is necessary.
6. Reliability or validity of the test, as well as the effectiveness of individual test items, can be significantly improved by revision.
7. Theory on which the test was originally based has been improved significantly, and these changes should be reflected in the design and content of the test.
the revalidation of a test on a sample of testtakers other than those on whom test performance was originally found to be a valid predictor of some criterion.
The decrease in item validities that inevitably occurs after cross-validation of findings. Expected and is viewed as integral to the test development process.
a test validation process conducted on two or more tests using the same sample of testtakers.
When co-validation is used in conjunction with the creation of norms or the revision of existing norms.
A test protocol scored by a highly authoritative scorer that is designed as a model for scoring and a mechanism for resolving scoring discrepancies.
A discrepancy between scoring in an anchor protocol and the scoring of another protocol.
Item response theory (IRT)
Item statistics are independent of the samples which have been administered the test. Test items can be matched to ability levels. Facilitate advanced psychometric tools/methods.
Classical test theory
Smaller sample sizes required for testing. Utilizes relatively simple mathematical models. Assumptions underlying this theory are weak allowing this theory wide applicability. Most researchers are familiar with this basic approach Many data analysis and statistics-related software packages are built from this perspective.
Differential item functioning (DIF)
An item functions differently in one group of testtakers as compared to another group of testtakers known to have the same level of the underlying trait.
Test developers scrutinize group-by-group item response curves, looking for what are termed DIF items.
those items that respondents from different groups at the same level of the underlying trait have different probabilities of endorsing as a function of their group membership.
Developing item banks
Each of the test items assembled as part of an item bank, have undergone rigorous qualitative and quantitative evaluation. Many item banking efforts begin with the collection of appropriate items from existing instruments.
Norm referenced tests
individuals' scores are given meaning by comparison to normative sample.
Examples: ACT, GRE, WAIS III, Iowa Tests of Basic Skills
A test taker is presumed to have more or less of the characteristics measured by a (valid) test as a function of the test score. The higher or lower the score, the more or less of the characteristics he or she presumably possesses.
Writing items for computer administration
the most commonly used model is the cumulative model, due in large part to its simplicity and logic. Typically, the rule in a cumulatively scored test is that the higher the score on the test, the higher the test taker is on the ability, the trait, or some other characteristic the test purports to measure.
test takers predict or presume the response. Test developers should plan corrections for guessing. It poses methodological problems for the test developer.
Methods of evaluating item bias:
a) noting differences between the item-characteristics curves, b) noting differences in the item-difficulty levels, noting differences in the item discrimination indexes.
Test Revision in the Life Cycle of an Existing Test includes:
Cross-validation and co-validation and Quality assurance during test revision
The test revision process it typically includes all of the steps that the initial test development included.
Idea for a test may come from:
social need, review of the available literature, and common sense appeal.
THIS SET IS OFTEN IN FOLDERS WITH...
Assessment Chapter 4 (Of tests and testing)
Assessment Chapter 5 (Reliability)
Statistical Reasoning-Chapter 1 Vocabulary
psych testing final
YOU MIGHT ALSO LIKE...
Psych Testing chpt 8 Test Development
Psychological Testing: Chapter 5
Psych Assessment Final Review
Psych Testing Exam 2
OTHER SETS BY THIS CREATOR
Couseling Children (Ch. 1 - Introduction to a Chil…
Cooper Ch. 27 (Self-Management)
CH. 26 (Cooper) Contingency Contracting
Ch. 25 Verbal Behavior (Cooper)