Upgrade to remove ads
Final Exam Review
PSY721 - Advanced Tests and Measurements Study Guide for the final exam.
Terms in this set (152)
Reliability, in a broad statistical sense, is synonymous with:
A source of error variance may take the form of:
item sampling, test takers' reactions to environment-related variables such as room temperature and lighting, and test taker variables such as amount of sleep the night before a test, amount of anxiety, or drug effects (all of these.)
Which type of reliability estimate is obtained by correlating pairs of scores from the same person (or people) on two different administrations of the same test?
a test-retest estimate
A reliability coefficient is:
an index, a ratio of the total variance attributed to true variance, and unaffected by a systematic source of error (All of these.)
What is the difference between alternate forms and parallel forms of a test?
Alternate forms do not necessarily yield test scores with equal means and variances.
Which of the following types of reliability estimates is the most expensive due to the costs involved in test development?
An estimate of test-retest reliability is often referred to as a coefficient of stability when the time interval between the test and retest is more than:
As the reliability of a test increases, the standard error of measurement:
Which type of reliability estimate would be appropriate only when evaluating the reliability of a test that measures a trait that is relatively stable over time?
Which of the following is true of systematic error?
It has no effect on the reliability of a measure.
Computer-scorable items have tended to eliminate error variance due to:
Which of the following might lead to a decrease in test-retest reliability?
the passage of time between the two administrations of the test, coaching designed to increase test scores between the two administrations of the test, and practice with similar test materials between the two administrations of the test (All of these.)
If items from a test are measuring the same trait, estimates of reliability yielded from KR-20 will typically be ________ as compared to estimates from split-half methods.
Which of the following is TRUE for estimates of alternate- and parallel-forms reliability?
Two test administrations with the same group are required, Test scores may be affected by factors such as motivation, fatigue, or intervening events like practice, learning, or therapy, and Item sampling is a source of error variance (All of these.)
If traditional measures of reliability are applied to criterion- referenced tests, the reliability estimates will likely be:
Test-retest estimates of reliability are referred to as measures of ________, and split-half reliability estimates are referred to as measures of ________.
stability; internal consistency
For a heterogeneous test, measures of internal-consistency reliability will tend to be ________ compared with other methods of estimating reliability.
Which of the following factors may influence a split-half reliability estimate?
fatigue, anxiety, and item difficulty (all of these.)
KR-20 is the statistic of choice for tests with which types of items?
multiple-choice and true-false (all of these.)
The Spearman-Brown formula is used for:
correcting for one half of the test by estimating the reliability of the whole test, determining how many additional items are needed to increase reliability up to a certain level, and determining how many items can be eliminated without reducing reliability below a predetermined level (all of these.)
Typically, adding items to a test will have what effect on the test's reliability?
Reliability will increase.
Which of the following is NOT an acceptable way to divide a test when using the split-half reliability method?
Assign easy items to one half of the test and difficult items to the other half.
Coefficient alpha is appropriate to use with all of the following test formats EXCEPT:
essay exam with no partial credit awarded.
Which of the following is TRUE about coefficient alpha?
It is a characteristic of a particular set of scores, not of the test itself.
A police officer mistakenly records the blood alcohol level of a suspected drunk driver after administering a breathalyzer test. This mistake is most related to which type of reliability?
A coefficient alpha over .9 may indicate that:
the items in the test are redundant.
Which best conveys the meaning of an inter-scorer reliability estimate of .90?
Ninety percent of the variance in the scores assigned by the scorers was attributed to true differences and 10% to error.
If a time limit is long enough to allow test takers to attempt all items, and if some items are so difficult that no test taker is able to obtain a perfect score, then the test is referred to as a ________ test.
If a test is homogeneous:
it is functionally uniform throughout, it will likely yield a high internal-consistency reliability estimate compared with test-retest, and it would be reasonable to expect a high degree of internal consistency (all of these.)
Which type(s) of reliability estimates would be most appropriate for a measure of heart rate?
Typically, speed tests:
contain items of a uniform difficulty level.
Which type(s) of reliability estimates would be appropriate for a speed test?
test-retest, alternate-form, and split-half from two independent testing sessions (all of these.)
Generalizability theory is most closely related to
In classical test theory, there exists only one true score. In Cronbach generalizability theory, how many of these true scores exist?
many, depending on the number of different universes
Traditional measures of reliability are inappropriate for criterion-referenced tests because variability:
is minimized with criterion-referenced tests.
A test is considered valid when the test:
measures what it purports to measure.
Face validity refers to:
the appearance of relevancy of the test items.
Which is NOT a method of evaluating the validity of a test?
evaluating the percentage of passing and failing grades on the test
Predictive and concurrent validity can be subsumed under:
may influence the way the test-taker approaches the situation, relates more to what the test appears to measure than what the test may actually measure, and has received little attention and is given short-shrift as compared to other indices of validity (all of these.)
Which assessment technique has the MOST face validity?
administering a word processing test to a person applying to be a word processor
Relating scores obtained on a test to other test scores or data from other assessment procedures is typically done in an effort to establish the __________ validity of a test.
An instructor announces that an examination will cover the topics of reliability and validity. A student boasts that he will read and study only the material on reliability. In fact, all the test questions are only on reliability. The best conclusion a student of assessment could draw from this is that:
the examination lacked content validity.
Before constructing a comprehensive final examination, your instructor reviews the objectives of the course, the textbook, and all lecture notes. Your instructor is making an effort to maximize the __________ validity of the final examination.
Lawshe devised a method for determining agreement among raters or judges who rate items on how essential they are. This method provides a way to quantify what type of validity?
In calculating the content validity ratio, panelists are asked to determine:
if the skill or knowledge measured by the item is essential.
A standard against which a test or test score is evaluated is known as:
The minimum value of a content validity ratio necessary to be statistically significant at the .05 level is dependent on:
the number of panelists judging the items.
Which may best be viewed as varieties of criterion-related validity?
concurrent validity and predictive validity
The form of criterion-related validity that reflects the degree to which a test score is correlated with a criterion measure obtained at the same time that the test score was obtained is known as:
The form of criterion-related validity that reflects the degree to which a test score correlates with a criterion measure that was obtained some time subsequent to the test score is known as:
A key difference between concurrent and predictive validity has to do with:
the time frame during which data on the criterion measure is collected.
Which is an example of a criterion?
achievement test scores, success in being able to repair a defective toaster, and student ratings of teaching effectiveness (all of these.)
An index of utility can be distinguished from an index of reliability and an index of validity in that an index of utility can tell us something about:
the practical value of the information derived from what a test measures.
sets a ceiling on test utility.
One of the noneconomic benefits of a diagnostic test used to make decisions about involuntary hospitalization of psychiatric patients is a benefit to:
Costs associated with testing include all of the following EXCEPT:
return on investment.
The end-point of a utility analysis is typically an educated decision about:
which of many possible courses of action is optimal.
A utility analysis is conducted using:
expectancy tables, Naylor-Shine tables, and Taylor-Russell tables (All of these.)
If targeted test-takers for a particular test consistently fail to follow the directions for taking the test then:
the test could still have great utility and the test could still be valid (b and c.)
Validity is to ____________ as utility is to ____________.
A potential noneconomic benefit of a well-run evaluation program is:
increase in quantity and quality of workers' on-the-job performance, decrease in time it takes to train new workers, and reduction in the number of workplace accidents (All of these.)
The Angoff method of setting cutting scores relies heavily on:
the judgment of experts.
The "Achilles heel" of the Angoff method is:
A hospital uses a compensatory model of selection in hiring surgeons. In their hiring evaluations, ratings regarding past safety record is given more weight than ratings regarding the surgeon's "bedside manner." From this, one could reasonably conclude that the people who are in charge of hiring surgeons believe that:
bedside manner is less important compared to surgical safety.
The term item-mapping refers to an IRT-based method of:
setting cut scores that entails an ordering or histographic representation of test items.
Which of the following is a direct economic cost that could result as a consequence of NOT evaluating personnel for employment positions within a large corporation?
the cost of lawsuits against the corporation
The idea for a new test may come from:
social need, review of the available literature, and common sense appeal (all of these.)
This term is used to refer to the preliminary research surrounding the creation of a prototype of a test:
pilot work, pilot study, and pilot research (all of these.)
Often used for the purpose of licensing persons in professions, these tests are called:
Likert scales measure attitudes using continuums. A continuum of items measuring ___________ could be used for a Likert scale.
like it or not, agree/disagree, and approve to do not approve (All of these.)
Test items that contain alternatives with five points ranging from "strongly agree" to "strongly disagree" are characterized as using this approach to scaling:
typically are constructed so that agreement with one statement may predict agreement with another statement.
Which is an example of the selected-response item format?
a multiple-choice item
Having a large item pool available during test revision is:
an advantage because poor items can be deleted in favor of the good items.
A well-written true-false item:
has a correct response that is veritably true or false, and not subject to debate.
Computer-adaptive testing has been found to:
reduce by as much half the number of test items administered.
Item branching refers to:
administering certain test items on a test depending on the test-takers' responses to previous test items.
Which statement is TRUE of the test tryout phase of test construction?
Test conditions should be as similar to the actual administration as possible.
The item-validity index is key in determining:
An item-difficulty index of 1 occurs when:
all examinees answer the item correctly.
The higher the item-difficulty index, the ________ the item.
In item analysis, the term item endorsement refers to the percent of test-takers who:
indicate that they agree with a particular item.
An item-reliability index provides a measure of a test's:
An item-difficulty index can range from ________ to ________.
. 0; 1
In Sternberg's study of the characteristics of academic intelligence, laypeople stressed the "interpersonal and social aspects," whereas experts stressed:
The test that launched the testing movement in the United States was the ______ test.
Neisser argued that intelligence:
cannot be explicitly defined.
What conclusion concerning intelligence could reasonably be drawn based on the 1921 symposium published in the Journal of Educational Psychology?
Experts had a multitude of definitions of intelligence.
Binet believed that the primary purpose of an intelligence test was to assist the test user in:
Galton's conception of intelligence focused on:
The Wechsler tests of intelligence:
measure more than two factors.
According to Wechsler, as cited in the text, intelligence should be conceived as a __________capacity that is best measured by measuring ______________ abilities.
global; qualitatively differentiable
The Stanford-Binet-5 is based on which theory?
Cattell-Horn-Carroll theory of intellectual abilities.
Binet, Wechsler, and Piaget would most likely agree with which of the following statements?
"Heredity and environment interact to influence the development of intelligence, but a person may not exceed his or her genetic potential."
According to Wechsler's approach to cognitive assessment of adults and children, which of the following is TRUE?
Similar tasks may be used on different tests, but the actual content of the items will differ at different age levels.
The WPPSI-III is used to measure the intelligence of children from ages ________ through ________.
The concepts of social intelligence, concrete intelligence, and abstract intelligence are collectively best associated with which theorist?
Which of the following is NOT true of Piaget's stages?
The stages were adapted from Binet's work with children.
According to Piaget, a form of cognitive structure or organization is referred to as:
Which statement is NOT true of Cattell's two-factor theory of intelligence?
Crystallized intelligence is relatively culture-free.
Who first hypothesized that the proportion of the variance that a number of tests have in common accounts for a general factor of intelligence?
Logical-mathematical, bodily-kinesthetic, linguistic, musical, spatial, interpersonal, and intrapersonal intelligence are all associated with which theory of intelligence?
Crystallized intelligence includes:
application of general knowledge.
Which of the following best characterizes the basis of CHC theory?
According to Howard Gardner, the ability to form an accurate and realistic view of oneself would be referred to as what type of intelligence?
Spearman's g factor refers to:
what different intelligence tests have in common.
The best measure of "intelligence" in very young children could probably be obtained by:
assessment of sensorimotor skills.
In discussing the role of personality in the measured intelligence of infants, the term ________ is used.
Which is a technique or method used to minimize cultural bias in tests?
minimized verbal instruction, use of teaching items, and use of sample items (All of these.)
Public Law 95-561 defines giftedness with reference to:
creativity, leadership ability, and intellectual ability (All of these.)
Since the 1921 Symposium on Intelligence, researchers and theorists have agreed that:
None of these (They didn't agree on anything.)
Children's intelligence is assessed primarily for:
educational placement and planning.
A child is administered an IQ test at age 5 and another at age 10. The reported score at age 10 is much higher than the reported score at age 5. This may be because:
the child's IQ naturally unfolded with maturation, the child is receiving an excellent education at school and at home, and the examiner used a different IQ test that assesses different abilities (all of these.)
A child's IQ test score may be influenced by:
the person's temperament, the IQ test administered and environmental stressors such as divorced parents (all of these.)
The Flynn effect is characterized by:
an average rise in measured intelligence each year from the year a test was normed.
Which of the following is TRUE regarding the stability of intelligence?
Intelligence is generally stable through adulthood.
Which is a reasonable conclusion regarding our current state of knowledge regarding intelligence?
There exists widespread disagreement on the definition of intelligence.
Starting with moderately difficult test items and then giving easier or harder items, depending on the test-taker's performance, is termed:
A ceiling level refers to the:
point at which a subtest is discontinued.
On the Wechsler tests of intelligence, the Full Scale IQ has a mean of ________________ and a standard deviation of _______________.
The WISC-IV is appropriate for:
children ages 6-16.
Group intelligence tests:
are efficient and cost-effective, can be useful as screening instruments, and can be useful for research purposes (all of these.)
Compared with individually administered intelligence tests, group intelligence tests:
are more psychometrically sound, have a higher degree of predictive validity, and have the advantage in terms of cost efficiency (all of these.)
Children deemed to be at risk are:
preschool children who may not be ready for school, and preschool children with documented difficulties in one or more psychological, social, and academic areas requiring intervention (both A and B.)
Psychoeducational test batteries are designed to measure:
ability and achievement.
If John earns a full-scale IQ of 90 on the WISC-IV:
John scored at the low end of the average range.
Normative information is available in the test's manual for WAIS-IV test-takers:
as old as 90 years, 11 months.
The fifth edition of the Stanford-Binet Intelligence Scale was based on which theory of intelligence?
the CHC model
How many people were in the standardization sample for the fifth edition of the Stanford-Binet Intelligence Scale?
When administering an individual test of intelligence, the examiner is alert to:
cues that the examinee is not alert, how examinee copes with frustration, and the cooperative level of the examinee (All of these.)
Which would NOT be considered extra-test behavior on the part of a test-taker?
responding to the examiner's questions
Stanford-Binet Full Scale scores are converted into nominal categories designated by certain cutoff boundaries. For example, an SB5 measured IQ in the range of 110 to 119 falls into the __________ category.
Which of the following is NOT a variable assessed as part of an APGAR evaluation?
An instrument used to identify which children should receive a more comprehensive evaluation is, most likely, a ______ instrument.
The history of personality types dates at least as far back as the days of:
A personality trait:
is relatively enduring, varies within and between individuals, and is distinguishable (All of these.)
138. Personality tests are used for:
evaluating influences on health, evaluating influences on academic performance, and planning psychotherapeutic interventions (All of these.)
Which BEST describes what is typically measured in personality assessment?
traits and states.
On the Self-Directed Search, terms such as Artistic, Enterprising, and Investigative are examples of:
Neuroticism, Extraversion, Openness, Agreeableness, and Conscientiousness. These variables are all measured by which personality assessment instrument?
The Big 5 was developed by:
McCrae and Costa
Projective tests are _____ methods of personality assessment.
are increasingly becoming norm-referenced
The Rorschach test:
continues to be a widely used clinical tool, despite its questionable validity.
Behavioral assessment tends to focus on:
Which of the following is TRUE of behavioral assessment?
The frequency, intensity, or duration of the behavior is generally specified.
Which is NOT a quantifiable definition of a target behavior?
the number of seconds Johnny spends daydreaming during his social studies class.
A culturally sensitive psychological assessment includes sensitivity to which of the following?
acculturation and language, personal identity, and values and worldview (all of these.)
The DSM-IV has _____ number of axis:
A clinical psychologist would be LEAST likely to use individually administered tests:
to evaluate and counsel clients regarding potential career choices.
The DSM-IV-TR is a diagnostic system that is used by psychologists:
to diagnose patients, for insurance reimbursement purposes, and for research purposes (all of these.)
THIS SET IS OFTEN IN FOLDERS WITH...
tests and measurements unit 5
PSY 3450 Ex#1 practice Q's
testing and measurement Unit 4
YOU MIGHT ALSO LIKE...
Combo with Final Exam Review and 1 other
Psychology Tests and Measurements Quiz 2
Use of Standardized Test chapter 6
OTHER QUIZLET SETS
Ch. 15 Quality Control - Pop Quiz
BADM 211 Test #2
Audit Chapter 9 Terms