Writing Test Items
Introduction Connect Apply Reflect Extend
Introduction to test items
Because in school we generally receive instruction before taking the test, intuitively we tend to think of the teacher writing the instruction before writing the test as well. This is what often happens in practice, outside the discipline of instructional systems design (ISD). The practice of writing the test after writing the instruction, coupled with the lack of a system for writing sound test items, usually has the unintended consequence of testing learners primarily on fact-type knowledge rather than on the full range of performances the instruction is intended to develop.
In the ISD approach we develop assessment instruments--and write specific test items or assessment exercises--before developing the instruction. This helps insure that both the test and the instruction match the full range of performances indicated in your objectives. Remember that we have based our objectives on real-world performance, derived from the optimals identified during the analysis process. When test items are written to match these real-world objectives, we increase the chances that performance on our test will equate with performance in the real world (i.e., on-the-job).
In this module we will look at ways to measure different types of performances, and offer a system for developing test items that match the defined performances.
The major points to be covered in this chapter include the following:
- Test items are used to ensure learning has taken place. The writing of test questions should be done before the instructional material is created, and matched precisely to the previously stated objectives.
- There are two broad categories of tests; norm-referenced tests compare learners to each other, and criterion based tests match each learner to pre-specified criteria. Many assessment situations combine both types of questions.
- Instructional designers can be assisted in the writing of appropriate test questions with a seven-step heuristic presented in the chapter.
- Different types of knowledge (facts, concepts, procedures, and principles), and different types of performances (remember and apply), require different types of test questions.
- Although scoring is often done by selecting a single correct answer, more intricate tests use checklists or rubrics to assess comprehension.
In this moduleNorm referenced and criteria referenced tests
How many test items should I write?
A seven-step approach to test item writing
Matching test items to objectives
Norm referenced and criteria referenced tests
Remember 8th grade geography class in which Coach Andersen urged you on to higher performance by grading the class on a curve? The "curve" he had in mind was the "normal" curve--a graph of the normal distribution, or "bell-shaped" curve (Figure 1).
Figure 1. The normal distribution of test scores, or "bell-shaped" curve.
His reasoning seemed to make sense, didn't it? The idea was that any group (like your class) that reasonably represented the population as a whole, on any given performance test, there would be a few stars, a whole lot of plain folks somewhere in the middle, and a few poor performers bringing up the rear. It even sounds like a description of your class, if you think back on it, doesn't it? There were a few terrific spellers, lots of OK spellers, and a few kids that couldn't spell the word "I".
Low and behold, when Coach gave a test, there were always a few low scores, a few excellent scores, and a whole bunch of in-between scores. All he had to do was arbitrarily assign the highest actual score to the right end of the curve and the lowest actual score to the left end of the curve. He could assign cutoff points for the various letter grades, and, presto, a fair assessment of the class. Your performance was compared to your classmates and the grade you earned was plotted based upon the distribution. This is norm-referenced testing, based upon the normal curve, that compares your performance to that of the larger group.
You bought into the idea for two reasons. First, it almost guaranteed a passing grade. Only a few poor losers at the low end of the curve would get left out. Second, it made it easier to get a higher grade. If everyone missed Question 6 on the final exam, that lowered the top of the curve and brought the higher grades closer to your score.
It seemed so fair, so… right. What this type of norm referenced assessment doesn't take into account, however, is whether you, anyone else in your class, or even Coach, actually accomplished your goals for the course. What if everyone, for instance, did miss Question 6? It means that no one learned whatever question 6 was supposed to test, that Coach didn't teach you about that very well, or that Coach didn't write the test item very well. On the other hand, what if everyone answered at least 93% of the questions correctly? In a normative system, that would mean that a 93% is "failing," and a 96% is "average." Coach probably fudged the curve in that case.
If no one learned it (whether or not Coach taught it well or not) then those "A's" don't mean much with respect to what your parents or the school board hoped you were going to learn in geography class. If, on the other hand, everyone demonstrated that they had learned it, then those "C" or lower grades meant little. In any case, by grading you on a curve, Coach precluded any feedback you or he could have obtained either about how well you had mastered the goals of the course or how well he was teaching or assessing you.
Many of our educational institutions, public and private, are designed at least in part to operate within a norm referenced framework. Concerns about "grade inflation," where "too many students are getting A's," pervade all levels of education. This assumes a normative system. Normative assessment tells kids that not all students can learn, and not everyone can achieve. It tells teachers they can't design instruction that will serve all their students--inevitably, some students must fail, or they're not doing their job.
The alternative to norm-referenced assessment is criteria-referenced assessment. It is also sometimes called objective-referenced assessment, because it is tightly coordinated with instructional objectives. Criteria-referenced assessment compares a learner's actual performances with those spelled out in the instructional objectives. Mastery is defined as a specific level of individual performance rather than comparison with other learners. If everyone demonstrates mastery, everyone gets an "A." The learners have accomplished their goals, and the instructor has accomplished her goals. Criteria-referenced assessment might be a multiple choice examination, but it also could be some type of performance-based exercise in which the learner "does" the performance and is assessed using a rubric (e.g., your performance analysis assignment).
Notice that with criteria-referenced assessment, we don't always assume that the instruction is perfect. If individuals can't perform according to pre-established criteria, the responsibility may lie either with them or with the instruction itself, or perhaps with both. You may need to train the learners again or you may need to revise the instruction. This is in stark contrast to norm-referenced assessment.
One way to distinguish between norm-referenced and criteria-referenced tests is to ask whether you are basing individual assessment on individual performance or in comparison with group performance. In practice, almost all assessment systems contain elements of both. The ISD model assumes an emphasis on criteria-referenced assessment. That is why we spend so much time defining outcomes--our instructional objectives. The remainder of this chapter describes how to design criteria-referenced assessment tools and how to write criteria-referenced test items.
How many test items should I write?
Notice that we use the term test items instead of test questions. Since we are designing assessments for a broad range of performances, there will be many instances in which we don't ask any questions. We might observe and rate actual performances instead. More on that later.
The number of test items you write depends on several factors. First, if the purpose of your test is to determine the degree of mastery of the instructional objectives, you'll need at least one test item per objective. Second, if part of the purpose of the test is to diagnose learners' errors, you'll need at least one test item for each sub-task and/or prerequisite skill, or at least for those that represent common stumbling blocks. Third, if learners can guess the answer to an individual test item, you may need one or two more items to test mastery of that objective.
A seven-step approach to test item writing
Use this heuristic for writing criteria referenced test items that match instructional objectives (adapted from Williams & Haladyna, 19**).
Step One: Identify knowledge type
Using our old friend the content/performance matrix (Figure 2), identify the objective's knowledge type.
Facts Concepts Procedures Principles
Perform the steps
Apply a rule
Recall an association
Recall a definition
List the steps
State a rule
Figure 2. The content/performance matrix (see Module 5 for more details).
You may already have done this when you first drafted your objective, but jot it down again now, because it will serve as the framework for your new test item.
Step Two: Jot down related information.
In the case of facts, that information consists of the association itself:First president of the United States: George Washington
Related information for concepts includes all the characteristics used to classify things in the category. If you want students to be able classify different governments as monarchies, oligarchies, dictatorships, democracies, or republics, for instance, you might include characteristics such as
- independence of judicial system
- method of selecting government authorities
- division of power
- method of representation
You can list the steps of a procedure: "(1) withdraw dipstick; (2) wipe dipstick clean; (3) insert dipstick fully; (4) withdraw dipstick again; (5) compare oil level with index mark on dipstick."
Write down the "if-then" statement(s) that make up most principles: "If at first you don't succeed, then try, try again."
When you've made a note of all the information that relates to the objective, you're ready to identify the performance.
Step Three: Identify the performance
Again, you may already have done this step if you used the content/performance matrix when you first drafted your objective. Do you want to test whether learners can remember or apply the fact, concept, procedure, or principle?
Step Four: Select the response mode
Do you want learners to select or construct the correct response? This is akin to asking the familiar question, "Is it multiple choice or essay?" It applies to all remember and apply types of performances (though asking learners to apply a procedure by choosing among alternative examples of the application is probably marginally useful).
Choose the select or construct mode of response based on your situation. Select mode is easier to score. Multiple choice tests can be completed on forms or online and scored by a computer. On the other hand, select mode involves prompting learners with the correct response along with incorrect distracters. This may require less learning on the part of students (you are giving them the correct answer) and more time constructing items on the part of teachers (who must craft distracters carefully to make the item work). It is possible to "guess" the correct answer, even in the absence of the necessary knowledge or skill.
Constructing answers requires stronger learning on the part of students, but also means that some human usually has to read all those answers and make some judgments about them. This can be time-consuming and tedious if you have many students. You may need to strike a balance between efficiency and effectiveness when planning tests.
Step Five: Write item stem
The item stem consists of two parts: (1) the description of the situation, or setup; and (2) the instructions about how to respond to that situation. Sometimes the description and the instructions are combined in a single statement or question.
The description includes all the information from step two (above) that the learner needs to develop the correct answer minus the skill or knowledge itself that you are testing.
The instructions might be a question or a statement, depending on what you want the learner to do. For instance, if you are testing learners' recall of a fact, the description and instructions are combined by writing:Name the first president of the United States.
For an apply principle type objective, the setup could either describe a situation in which the principle applies, or name or even state the principle itself.
Step Six: Write the correct answer
It is important to write the correct answer for both select and construct type items. For select items, the correct answer is one of the choices you will present to the learners. Where you want students to construct an answer, you should write it so that you can develop appropriate criteria for scoring student answers (see rubrics, below).
Step Seven: Write distracters (select type items only)
There are some guidelines for writing good distracters for select type items. First, distracters should be plausible responses, designed to attract examinees whose mastery of the objective is incomplete. At the same time, distracters should not trick knowledgeable examinees.
Second, anticipate common errors or misunderstandings that learners have about the performance and generate distracters that will help you (and them) diagnose those problems. For instance, if a student choose "Soil" or "Water" as the answer to a question about the origin of most of the mass of a plant, it could indicate misunderstanding of the carbon cycle. One way to generate this type of distracter is to administer the stems to a trial group as construct type items and then use their most prevalent incorrect responses as distracters in your select type items.
Third, leave out part of the correct answer or include an incorrect element. This works well when the answer consists of a list steps (of a procedure) or characteristics or members (of a concept). Consider this item:
Choose the correct list of characteristics of a mammal:
- a. fur, live birth, doesn't suckle
- b. fins, live birth, suckles
- c. fur, lays eggs, suckles
- d. fur, live birth, suckles
The learner must have complete understanding of mammalian characteristics (at least the ones mentioned) to choose the correct response ("d," in this example).
Fourth, avoid response patterns. The correct answer should be randomly and pretty evenly distributed among the available response positions. It shouldn't consistently show up in the third, or "c" position, for instance.
Fifth, and finally, don't make the distracters consistently longer or shorter than the correct answer. A common pitfall is to make the correct answer longer than the distracters. Examinees pick up on this quickly and can guess the correct answer.
Matching test items to objectives
Following these seven steps (above) will help you construct sound test items that match the instructional objectives you wrote earlier. This is vital, because those objectives are your link to the real world. The performances you constructed there are supposed to reflect, as closely as possible, real world performances the learner is trying to master.
Lets look at some examples of how to construct test items.
Since we can only remember facts (not apply them) you can only test people's ability to recall associations. As an example, take this objective:Given an unlabeled diagram of the brain, medical students will be able to label all the parts, using the Latin terms.
This is fact type knowledge, since there is no system to accurately predict the name of the cerebellum, for instance. To test recall of these facts (the names of the brain parts), the objective calls for learners to label a diagram of the brain. So the test item might supply an diagram of the brain with blank labels and read simply:Label the diagram using the Latin terms.
For a remember concept type objective, you could either describe the characteristics of the class and direct learners to select or construct the name of the class, or you could give them the name of the class and ask them to select or construct it's characteristics. For instance:What is covered with fur, gives live birth, and suckles its young?
orDescribe three important characteristics of mammals.
For apply concept type objectives, you can give the description or an illustration of a specific member of the class and ask learners to select or construct the correct category, or you can give them the class and ask them to select or construct the characteristics. So you could show them a picture of a falcon and write:Walking through the woods you come upon this animal. Is it a(n)
Or you could explain that:
If the City Zoo called you and told you they were bring a reptile over for you to examine, describe three characteristics you would anticipate.
When an objective calls for learners to remember a procedure, you can test their recall by asking them to either select or construct the steps involved. So:Which of the following lists the steps for adjusting a microscope in the correct order:
- insert slide; use coarse focus to pull element up; use coarse focus to lower element almost to surface of slide; look through eye piece; use coarse focus to pull element up to approximate focus; use fine focus to achieve clear image.
- use coarse focus to pull element up; insert slide; use coarse focus to lower element almost to surface of slide; look through eye piece; use coarse focus to pull element up to approximate focus; use fine focus to achieve clear image.
- look through eye piece; use coarse focus to pull element up; insert slide; use coarse focus to lower element almost to surface of slide; use coarse focus to pull element up to approximate focus; use fine focus to achieve clear image.
- use coarse focus to pull element up to approximate focus; insert slide; look through eye piece; use coarse focus to pull element up; use fine focus to achieve clear image; use coarse focus to lower element almost to surface of slide.
Or:List and describe the steps for focusing a microscope.
The best way to test objectives that call for learners to apply procedures is to actually have them perform the procedure, under real or simulated conditions, if possible.
To remember a principle means simply to state it correctly. You can give the name or other designation of the principle and ask learners to either select or construct the correct statement of the principle. So:State the principle of conservation of matter and energy.
The apply principle objective calls for applying a general rule to a specific situation. So:
You notice that a planet is orbiting around a star at a distance of 4.7 million miles. What else would you need to know in order to calculate the mass of the star?
Scoring selected response test items is fairly straightforward, since it involves simply matching the learner's response with the correct response. It can even be done by computer. Some simpler construct type test items also lend themselves to routine or computer scoring. One-word responses, for example, can be scored by a computer with appropriate consideration for misspelling, capitalization, and so forth.
Complex construct type answers, however, including most authentic tasks such as actually assembling a circuit board, writing a paper, or performing heart surgery, are more difficult to score "objectively." There are a number of tools for judging these types of performances, including checklists and rubrics. Checklists can be either product or activity checklists, or combinations of the two, and are relatively easy to construct.
Rubrics, on the other hand, lend themselves to scoring complex tasks in authentic settings and, if carefully constructed, can provide useful feedback to learners and instructors alike. Rubrics list and describe specific criteria against which both learners and instructors can judge learner performances. They facilitate peer evaluation (a learning strategy in its own right).
Let us suppose that you have as an objective that "Real estate sales people will be able to present the 26-point marketing plan to prospective clients clearly, completely, and in less than a half hour." Your test item reads something like:Present the 26-point marketing plan to a prospective client in less than a half hour. Describe all 26 points clearly and maintain good eye contact.
You might construct a rubric to score or judge this performance. First, make a list of the criteria mentioned in the objective, separating each one out:
Mentions all 26 points
Completes in half hour
Describes points clearly
Maintains eye contact
Now, divide each criteria into three or more categories, from incomplete or unacceptable at the low end to excellent or perfect at the upper end. Describe what you would look for in each category to agree that examinees had attained that level of mastery. Assign specific points in each category. This rubric is shown in Figure 3.
1 2 3
Mentions all 26 points
Covered few points.
Covered most points.
Mentions all 26 points
Describes points clearly
Seemed confused about some points. Was inarticulate.
Clear about most points. Reasonably articulate.
Describe points very clearly, in an articulate fashion.
Completes in half hour or less
Went more than 5 minutes over time.
Completed in a half hour ± 5 minutes.
Completed in under a half hour.
Maintains eye contact
Maintained eye contact little of the time.
Maintained eye contact most of the time.
Maintained eye contact at all times.
Figure 3. Rubric for scoring real estate sales marketing plan presentation.
Rubrics allow you to further define or elaborate what you mean by the criteria stated in your objective, and assign points to specified levels of performance on each criteria. This is useful to learners because they can get more feedback on their strengths and weaknesses. Instead of merely "passing" or "failing" an instructional objective, they get some direction on how to focus their efforts as they pursue complete mastery.
Identify the weakness in each test item, and then review your answer by clicking on the check mark.
Original test item
With use of a periodic table of the elements, learners will be able to list the number of subatomic particles for at least five elements.
List the number of electrons, neutrons, and protons for the following elements: Hydrogen, Helium, Boron, Carbon, and Oxygen.
Students will be able to identify the island of Hawaii on a map.
On the accompanying globe, write in the name of the island of Hawaii on its location.
With access to a job aid, the assembly line worker will be able to identify which of three welds was placed with a mixture too rich in oxygen.
On the table are three faulty welds. Identify the problems that caused the faults with the welds. You may use the reference job aid on welding problems if you wish.
The computer technician will be able to list the four most common problems that occur with a computer network.
Here is a malfunctioning computer network. Identify the problem.
The dental assistants will know the names of the 12 tools most often requested by a dentist.
Here are the 12 most common tools requested by a dentist. As I call off each name, please hand me the tool.
Next week, you will complete a graded Objectives Exercise. The ability to write defined, explicit learning objectives is fundamental skill required by instructional design. The Objectives Exercise will test your ability to construct objectives as well as identify and repair faulty ones.
This activity is designed to help you prepare for the graded exercise. Complete this activity by visiting the Assingnments section of Blackboard. Here, you will find the "Practice Objectives Exercise". You may complete this exercises as many times as you like. Feedback will be provided.
In the past few weeks you described a learning experience that was less than ideal. You posted this to the discussion board and may have received comments from classmates along the way.
This week, revisit your learning experience. Consider what type of assessment was involved in the learning experience you described. Perhaps it was selected-response, or maybe no formal assessment took place.
Reflect on the assessment, or lack thereof, by adding a follow-up to your original posting. In this follow-up, consider whether your performance was assessed:
- If so, how? Was it a valid assessment of your learning? Did it test the outcome(s) your were expected to apply following the class? Was it based on defined objectives?
- If not, what type of assessment should have been conducted? Make a recommendation regarding selected response or constructed response and describe, briefly, the type of testing you envision.
Overview of this section
People in action
We left Barbara just as she was generating her objectives for next Friday's experiment. Over the weekend, she rewrote the lab experiment in simpler terms, color coded her beakers, and posted her five lab rules. She decided that before she would allow the students to complete the lab exercise, they would need to pass a test, based on her four objectives. Below are her objectives and test questions.
Using a job aid, students will be able to summarize each of the required steps before beginning an experiment.
Summarize the five steps for the acid rain experiment. You may use the acid rain experiment job aid.
Using the color coded equipment job aid, students will be able to correctly identify each of the two flasks and five beakers.
On the teacher's desk I have placed a color-coded Erlenmeyer and Florence flask, and a 50 ml, 100 ml, 250 ml, 500 ml, and 1 L beaker. Each piece of equipment has a letter identifying it. Identify each according to their letter. You may use the color-coded job aid.
From memory, students will be able to write word-for-word the five safety
rules for laboratory exercises.
Write the lab's five safety rules. They must be written exactly as shown
on the handout provided earlier this week.
Using either the lab book or experiment job aid, students will be able to list an experiment's
a) list of required material and b) special safety
Using either the lab book or job aid on the acid rain experiment, list the following:
- required materials
- special safety precautions
Types of tests
- In the ISD model test items are written before creating the instruction material, and are always matched to the objectives.
- A norm-based grading system (based on the "Bell Curve" principle) is computed to ensure that, no matter what range of scores are obtained, a small percentage of students will receive an "A" and a small percentage will fail.
- In a criterion referenced assessment, students are not graded against one another, but against instructional objectives. If everyone masters the objectives, everyone passes.
- Most assessment systems contain components of both norm-based and criterion referenced components.
Writing test items
- The number of items needed in a test will be based on the number of objectives, how critical the tasks are which make up the objectives, and how easily a learner might be able to guess the right answer.
- To match the test question to the objective, use the seven step heuristic:
- Identify whether the objective is a fact, concept, process, procedure, or principle.
- Jot down relevant information related to the objective.
- Identify if you want the learner to remember or apply the information.
- Identify if you want the learner to select or construct the correct response.
- Write the test item.
- Write a correct answer for use in measuring learner's knowledge.
- Write distracters for select type items that are both plausible and can be used to identify common errors.
Test items and the content/performance matrix
- When writing test items, it is best to start with the content/performance matrix. Each component in the content/performance matrix requires a special type of test item.
- Testing whether a learner can "remember facts" is done with either a selection or recall question.
- "Remember concept" objectives are tested by describing the characteristics of the class and asking for the class's name, or providing the name of the class and asking learners to describe the characteristics.
- "Apply concept" objectives are tested by giving the description or illustration of a specific member of a class and asking learners to construct the category, or providing the class and asking learners to construct the characteristics.
- "Remember process" objectives can be tested by asking learners to recall the phases, stages, or relationships of the process's components.
- "Apply process" objectives are tested by asking learners to predict what will occur at various stages of a process, or by giving a situation and asking what lead to those conditions.
- "Remember procedure" objectives are tested by asking learners to select or list steps in the specific sequence.
- To test the ability to "apply procedure," it is best to have the learner perform the procedure under real or situated conditions.
- A "remember principle" objective is tested by having the student select or list the principle.
- To test an "apply principle" objective, provide a situation and ask the learner apply the principle's rule.
- Since there is usually one correct answer in a "select" test item, scoring tends to be accomplished easily.
- Test items in which students need to "construct" an answer are often scored with a checklist, or for more complex items, with a rubric.
- Rubrics list and describe specific criteria against which the learner's performance is judged.
Now that we have designed the objectives and test items, there's one last step to accomplish before leaving the design phase. We still need to decide upon and organize a coherent strategy to use when implementing the instruction. This phase when the instruction sequence is organized, based on our understanding of how people learn, is known as the instructional analysis, and is the topic for our next chapter.
For more information
Dick, W., & Carey, L. (1996). The systematic design of instruction, fourth edition. New York: Harper Collins Publishers, Inc.
Mager, R. (1984). Measuring instructional results. Belmont, CA: David S. Lake Publishers.
Williams, R. & Haladyna, T. (19**). Logical operations for generating intended questions (LOGIQ): A typology for higher level test items. In **
Introduction Connect Apply Reflect Extend
Page authors: Bob Hoffman & Donn Ritchie & James Marshall. Last updated: Marshall, Spring, 2006.
All rights reserved