D7.2 Test construction (cloze narratives)

ITEM RESPONSE THEORY

An (1) is conducted to determine which items to retain in the final version of a test. An (2) (p) is calculated by dividing the number of examinees who answered the item correctly by the (3). It ranges in value from (4) to (5). In general, an item difficulty level of (6) is preferred because it not only maximizes (7) between examinees of low and high ability but also helps ensure that the test has high (8). However, the optimal difficulty level is affected by the probability that an examinee can (9). For this reason, the optimal p value for true/false items is (10). An (11) index (D) is calculated by subtracting the percent of examinees in the lower-scoring group from the percent of examinees in the upper-scoring group who answered the item correctly. It ranges in value from (12) to (13). Advantages in IRT are that item parameters are (14) and performance on different sets of items or tests can be easily (15). In summary, use of IRT involves deriving an item (16) for each item that provides information on one, two, or three parameters, i.e., (17), (18), and (19).
Click the card to flip 👆
1 / 11
Terms in this set (11)
ITEM RESPONSE THEORY

An (1) is conducted to determine which items to retain in the final version of a test. An (2) (p) is calculated by dividing the number of examinees who answered the item correctly by the (3). It ranges in value from (4) to (5). In general, an item difficulty level of (6) is preferred because it not only maximizes (7) between examinees of low and high ability but also helps ensure that the test has high (8). However, the optimal difficulty level is affected by the probability that an examinee can (9). For this reason, the optimal p value for true/false items is (10). An (11) index (D) is calculated by subtracting the percent of examinees in the lower-scoring group from the percent of examinees in the upper-scoring group who answered the item correctly. It ranges in value from (12) to (13). Advantages in IRT are that item parameters are (14) and performance on different sets of items or tests can be easily (15). In summary, use of IRT involves deriving an item (16) for each item that provides information on one, two, or three parameters, i.e., (17), (18), and (19).
OVERVIEW OF RELIABILITY

Reliability is a measure of (3). A test's reliability is commonly estimated by calculating a reliability, which is a type of (4). The reliability coefficient ranges in value from (5) and is interpreted directly as a measure of (6) variability. For example, if a test has a reliability coefficient of .91, this means that (7)% of variability in obtained test scores is due to (8) variability, while the remainder reflects (9).
TEST-RETEST RELIABILITY

Test-retest reliability is assessed by administering a test to the same group of examinees at (10) and then (11) the two sets of scores. The test-retest reliability coefficient is also known as the coefficient of (12). An alternate forms reliability coefficient is calculated by administering two (13) and correlating the two sets of forms. The alternate forms reliability coefficient is also referred to as the coefficient of (14).
INTERNAL CONSISTENCY

To assess internal consistency reliability, a test is administered once to a single group of examinees. A (15) reliability coefficient is calculated by splitting the test in half and correlating examinees' scores on the two halves. Because the size of a reliability coefficient is affected by test length, the (15) method tends to (16) a test's true reliability. Consequently, the (17) formula is often used in conjunction with (15) reliability to obtain an estimate of true reliability. Coefficient (18), another method used to assess internal consistency reliability, indicates the average (19) consistency rather than the consistency between two halves of the test. A specific mathematical formula, the (20), can be used as a substitute for coefficient (18) when test items are scored (21). (15) reliability, coefficient (18), and (20) are not appropriate for speed tests because they tend to (22) the reliability of these tests.
INTER-RATER RELIABILITY

IRR should be assessed whenever a test is (21) scored. The scores assigned by different raters can be used to calculate a (22). For example, the (23) statistic can be used when ratings represent a (24) scale of measurement. Alternatively, percent agreement between raters can be calculated. A problem with this approach is that the resulting index of reliability can be artificially inflated due to the effects of (25).
FACTORS THAT AFFECT RELIABILITY COEFFICIENTS

The magnitude of a reliability coefficient is affected by several factors. In general, the longer a test, the (1) its reliability coefficient. The (2) formula is used to estimate the effects of (3) a test on its reliability coefficient. If the new items do not represent the same (4) as the original items or are more susceptible to (5), this formula is likely to (6) the effects of lengthening the test. Like other correlation coefficients, the reliability coefficient is affected by the range of scores. The greater the range, the (7) the reliability coefficient. To maximize a test's reliability coefficient, the tryout sample should include people who are (8) with regard to the attribute(s) measured by the test. A reliability coefficient is also affected by the probability than an examinee can (9). The easier it is to (9), the (10).
MEASUREMENT ERROR

While the reliability coefficient is useful for assessing (8), it does not directly indicate how much we can expect an individual examinee's obtained test score to reflect (9). For this purpose, the (10) is more useful. It is calculated by multiplying the standard deviation of the test scores by the (11) of (12), which is expressed as one minus the reliability coefficient. For example, if a test's standard deviation is 10 and its reliability coefficient is .91, the (12) is equal to (13). The (10) is used to construct a (14) around an examinee's obtained score. In terms of magnitude, the standard error of the difference between two scores is always (15) than the (12) of either score because it reflects measurement error from (16).
OVERVIEW OF VALIIDTY

A test is valid when it (1). There are three main forms of validity. (2) validity is of concern whenever a test has been designed to measure one or more (3). (4) validity is important when a test will be used to measure (5), such as achievement, motivation, intelligence, or mechanical aptitude. (6) validity is of interest when a test has been designed to (7) on another measure. (8) validity reflects the extent to which a test over-(8), under-(8), or excludes the elements required to measure the construct. Content validity is usually built into a test as it is being constructed through a(n) (8) sample of items. After a test has been developed, its content validity is checked by having (9) evaluate the test in a systematic way.
CONSTRUCT VALIDITY

A test has construct validity when it has been shown that the test (1). One method for assessing construct validity is to determine if the test has both (2) and (3) validity. When a test has (4) correlations with measures that assess the same construct, this provides evidence of (2). When a test has (5) correlations with measures of unrelated characteristics, this indicates that the test has (3). The (6) matrix provides a systematic way to organize the data collected when assessing a test's (2) and (3) validity. The matrix is a table of (7). It indicates that a test has (2) validity when there are large (8). It indicates has (3) validity when there are small (9) and (10).
How do you interpret a multitrait-multimethod matrix?- convergent validity = obtained scores for the same trait (construct) are highly correlated across methods = high monotrait-heteromethod coefficients - discriminant validity = obtained scores for different traits on the same test AND different traits on different tests are poorly correlated = low heterotrait-monomethod AND heterotrait-heteromethod coefficients = low "triangles" on the MTMM diagram