Table of Contents: Using Tests to Measure Learning

 

Measuring learning: an overview

Testing formats

Sources of tests

Item types

Simulations

Reprising Fraenkel and Wallen: reliability and validity

Just a few resources (of which you should be aware!)

 

Measuring learning: an overview

When measuring learning, you’re making the assumption that people are (or have been) part of an instructional intervention with clear outcomes: behavioral, psychomotor, attitudinal, etc.. Though one-time training or learning events abound, evaluators tend to involve themselves with interventions that are implemented or delivered multiple times – though there may be differences or variations in delivery modality, duration or time frame, audience targets, activities and tasks, and deliverables. Though core outcomes and processes remain constant, you’d find these “divergencies” even in our own courses – among them EdTec 541, 561 and 572 (summer/traditional semester; online/place-bound).

 

Testing, then, is about …

§         Content

§         Structure

§         Criteria (for measuring “success” or competence)

§         Alignment (with objectives, content delivery, and opportunities for practice and feedback)

§         Environment (where the test is “administered”)

 

Evaluators, you’ll see, are often called upon to perform two different but related functions relative to measurement of learning:

§         Creating (or locating) actual tests … and then deploying them (often as part of an objectives-oriented study)

§         Determining the quality of existing measures already in place

 

As an evaluator, you’re responsible for doing your homework: awareness of the nuances of test design, being able to distinguish between and among item types, familiarity with the content the test targets, awareness of the audience (prerequisite skills/knowledge, reading level, etc.).

 

Some food for thought …

§         The summative evaluator’s primary interest is to find and use instruments that measure whether a program (product/process) obtained its overall goals. Example: Now that it’s been in place for several years, in what ways has community-based policing affected relationships between police and residents of different SD communities?

Because a summative evaluation report can affect important decisions about a program or product’s future, you need to ensure high credibility in the instruments you select or develop.

§         By contrast, the formative evaluator’s reasons for measuring performance are less “official.”  In essence, you’re making a progress check to ensure participants or users are learning (accomplishing) what is expected …and maintaining the anticipated pace.  The audience for your research tends to be program or product staff and planners.

The formative evaluator has more flexibility than his/her summative counterpart in choosing or developing performance instruments.  For example, an end-of-unit test might be an appropriate instrument on which to rely.

 

Morris, Fitz-Gibbon, and Lindheim (1987) argue that performance “scores” can address a number of fundamental evaluation issues:

§         Which participants should be targeted for a program (or a product).

§         Whether or not participation in the program makes a difference (incrementally or over time).

§         To what extent program outcomes have been attained.

§         To what performance variations might be attributed.

§         Whether or not a program (or program) warrants differentiated features (localization).

 

Back to top

 

Testing formats

There is no one way to write a test – but there are certainly “better” and “worse” ways to assess learning/performance. A corollary caveat is that different “modes” may be used to assess the same domain. For example, one might assess knowledge through paper/pencil assessments (with varying item types), performance tasks, observations, and content analyses (reviews of lesson plans, for example). Many tests are conducted orally – especially important if language may be an unfair barrier to success, if flexibility in response is important, or when body language is important to note/consider.

This is a good time to remember that tests may be norm- or criterion-referenced.

§         Norm-referenced measures provide information about how examinees perform relative to others.

§         Criterion-referenced measures provide information about how examinees perform relative to specified standards or performance criteria.

A number of ERIC Digests cover this very well!

 

Back to top

 

Sources of tests

The advantages of assessing performance using a published measure are many; the two most often cited are: time/labor savings and gaining the benefit of others’ experiences/expertise.

But where can such tests be found??

§         The curriculum materials used for a program might be accompanied by pre- or posttests, unit tests, or curriculum-embedded progress or mastery tests.

§         A state, district, or funding agency may administer a test as part of its area-wide assessment program.

§         A test can be purchased from a test publisher or borrowed from a researcher, professional association, etc. 

Note that:

§         An increasing number of these tests are criterion- rather than norm-referenced.

§         Published tests are almost always standardized.  By the time they’ve reached the marketplace, they’ve been through a validation phase with carefully selected groups of subjects.

§         Published tests tend to come with technical manuals that provide norms or comparative data based on the scores of the tryout group(s); information about the validity and reliability of the instruments; and instructions for administering, scoring, and interpreting results.

 

As important is awareness of the following:

§         That tests built into a program (end of workshop, for instance) are generally not appropriate to employ in a summative evaluation.  Their periodic passing does not guarantee that the program as a whole is accomplishing its objectives. 

§         That tests designed for summative assessment of a program or product are not necessarily implemented/administered when learners “are ready” (as tends to be the case w/ tests designed for formative assessment of a program or product).

Certainly purchasing/using an existing test has its limitations. Avoid going this route if there’s not a close fit between the items or tasks constituting the test … and the major objectives of the program being evaluated.

 

Back to top

 

Item types

Selected response types include: multiple choice, true/false, and matching.

Constructed response types include: short answer, problem-solving, and making a (structured) oral presentation. One might be asked to write a letter that demonstrates both content knowledge and use of proper grammar … although some might argue that this scenario also depicts (quite broadly) one type of simulation.

Simulations tend to include: exercises that test decision-making skills, or processes and procedures.

Work samples may feature: observations (clandestine or planned) of on-the-job performance (where integration with existing skills is critical) and/or review of documents.

 

Back to top

 

A closer look at simulations (performance-based testing)

Among the advantages of simulations are these, according to Phillips (2000): reproducibility (meaning jobs or parts of jobs nearly replicate the “real thing”); cost effectiveness (better, for example, to train pilots in simulators – at least at first – than in multi-million dollar aircraft!); and safety considerations.

Among the techniques with which you might want to be familiar are these: electrical or mechanical (simulated patients are classified here); task; games; in-basket; case studies; role plays; and assessment center. Each has advantages and disadvantages; not surprising, some are better suited to certain tasks (job classifications, industries) than others. Each will more fully explained on the handout you’ll receive as part of our in-class testing writing activity. 

 

Back to top

 

Reprising Fraenkel and Wallen: validity and reliability

Assessments of validity and reliability help to determine the amount of faith people should place in a measurement instrument.

§         Validity 

Is the instrument an appropriate one for measuring what you want to know?

There are many kinds of validity, for example – and the following supplements/reviews your reading in Fraenkel and Wallen:

CONTENT – Does the test/data collection instrument represent the content in question?  Here you look at how well the test aligns with the objectives and practice.

CRITERION – In this instance, you’re seeing how well your test/data collection instrument measures up against “outside” criteria.  In practice, this may mean that you’re measuring your test/data collection instrument against an older, well-established test/data collection instrument.

PREDICTIVE – The GRE represents this category.  How well does your test/data collection instrument predict or suggest future performance or behavior?

CONSTRUCT – In this case, you’re validating that the test/data collection instrument accurately measures a construct like self-efficacy or open-mindedness.  For example, you might measure the construct of creative writing with a test that looks at the learner’s original thinking, as well as his/her use of descriptive language and dialogue. 

What might a test that measures “learning to learn” look like? Without defining/describing the construct, there’s no way to meaningfully measure it.

 

Threats to validity

o        Lack of standardization in test administration

o        Response bias or evaluation apprehension

o        Too few items per objective

o        Tests that measure the skill too narrowly

o        Mismatch between the skills called for by the test and the stated objective of the test

o        Tests that attempt to measure very complex constructs

o        Tests whose format and wording are tied to the idiosyncrasies of a particular set of instructional materials or of a particular program

o        Coding/scoring errors

§         Reliability

Does the test produce consistent results? People often confuse validity with reliability, but a test or data collection instrument is RELIABLE if it provides consistent results time after time after time—if the test seems to be free of unexpected kinds of errors.

 

Ways to establish reliability

  • Test/retest
  • Alternate form
  • Split-half

 

What are some ways to establish interrater reliability? Why is interrater reliability important?

 

Threats to reliability abound, among them, fluctuations in mood or alertness of respondents because of illness, fatigue, recent personal experiences, or other temporary differences among members of the group(s) being measured. Variances in the conditions of test administration also impact reliability.

 

§         Common test construction errors that may impact validity and/or reliability …

o        Content or language that strikes respondents as racially, ethnically, religiously, sexually, geographically, or economically biased

o        Content or language (construction as well as words) inappropriate for the target group(s)

o        Response types that are inappropriate for the type of question asked.

o        Questions that are ambiguous or cover multiple things (ideas, themes, topics, etc.)

o        Expectations (in an essay, for instance) that are not distinguished (e.g., how many examples must the respondent give; what ‘compare’ means; what ‘describe’ means, and so on)

 

Back to top

 

Just a few resources (of which you should be aware!)

Mager, R. F. (1997). Measuring instructional results (3rd ed.). Atlanta: Center for Effective Performance.

Morris, L. L., Fitz-Gibbons, C. T., & Lindheim, E. (1987). How to measure performance and use tests. Thousand Oaks, CA: SAGE Publications.

Robinson, D. G., & Robinson, J. C. (1989). Training for impact: How to link training to business needs and measure the results. San Francisco: Jossey-Bass Publishers.

 

I also urge you to purchase an Info-line from ASTD by Jack Phillips (2000) entitled: Level 2 Evaluation: Learning.

 

Back to top