home>modules>module 06>printable
Go to the SDSU home page

06: Instruments, Data, & Statistics

 

introduction

This module provides a very broad overview of data, statistics, instruments for data collection. Statistics is a powerful "foreign language." Keep the Handout "Statistical Family Tree" handy because it will be your navigator in the "ocean" of statistics.

The rest of the modules will zoom-in on the major topics of this module and further illustrate how statistics is used in research, especially in the field of educational technology. You will have opportunities to practice "speaking" it in class activities and by working on your projects.

Data and Statistics

In summary, data are information and evidence collected in systematic ways. Data can be numerical, narrative, aural, and visual. Statistics is used to analyze data "in such a way as to obtain a more efficient and comprehensive summary of the overall results" (Coolidge, 2000, p. 5).

In general, statistics is about the differences between what you expect and what you observe. It compares chance and random errors with systematic influence to determine the likelihood of an effect that has occurred due to systematic influence. A difference or relationship is too large to occur by chance is called statistically significantly.

To toss a coin 100 times. How many times would you expect to see a head and how many times to see a tail? The answer from common sense will be: 50. However, the results rarely turn out as an exact 50-50. How to explain this variability? Is the variability due to chance or due to other factors (e.g., the skills of the coin tosser)? We generate a hypothesis and then use statistical tests to test this hypothesis. We follow statistical rules to determine if we should accept or reject the hypothesis.

The chance-to-influence test is also called signal-to-noise ratio, which is "borrowed from signal detection theory where the effect of a treatment is considered the signal, and random variation in the numbers is considered the noise" (Coolidge, 1998, p. 106).

Statistical tests therefore compare chance against other non-chance factors, and conclude if a difference is statistically significant. It is therefore also a test of significance. For example, employees achieved higher performance scores in web-based training than those trained in classroom setting. Is this difference due to chance or due to the invention (the instructional strategies)? A t-test will test the signal-to-noise ratio. If p(robability) level of the t-test is less than 0.05 (also known as alpha--the level of significance chosen for this test), we'll conclude that the difference is NOT due to chance. Because the probability of chance effect on the score differences is very small.

Types of Statistics

There are two types of statistics: descriptive and inferential. The former are the building blocks for the latter.

Descriptive Statistics

"Descriptive Statistics involve measuring data using graphs, tables, and basic descriptions of numbers such as averages or means. These universally accepted descriptions of numbers are called parameters" (Coolidge, 2000, p. 5-6).

It describes a sample’s characteristics through the measures of central tendency, variability, and relationship.

When and how to use graphs The "why" of using graphs is to make conclusions and arguments. Read the chapters about frequency distribution in all textbooks. It is important that you know how to make a distribution graph both by hand and with software.

Inferential Statistics

Making conclusions about the population (a large group of data) from the sample’s characteristics (a small group of data).

A general formula in using inferential statistics is Fly IDAIR--identify the problem, design the statistical test, apply the method, infer from the test, and reporting the results.

Parametric Nonparametric
Statistical technique used for group comparison when the characteristic being studied (e.g., learning outcomes) is normally distributed in the population, sample was randomly selected, and data being analyzed are interval or ratio (e.g., test scores). Statistical techniques used for group comparison the characteristic being studied is not normally distributed in the population, sample size is small and not randomly selected, and data being analyzed are ordinal (rank) or nominal (categories).
t-test: for dependent samples (same group) and independent samples (two different groups)

Analysis of Variance (ANOVA)

ANCOVA: similar to ANOVA but for controlling the influence of an IV that may vary between groups before the treatment is implemented.

MANOVA: multivariate ANOVA. Used when there is more than one IV.

Wilcoxon matches pairs test (t-equivalent): used with dependent samples and ordinal data.

Mann-Whitney U Test (t-equivalent): used with two independent samples and ordinal data.

Friedman Two-Way Analysis of Variance: used with more than two dependent samples and ordinal data.

Kruskal-Wallis One Way Analysis of Variance: used with more than two independent samples and ordinal data.

Chi-Square (for categorical data): used to test the statistical independence of two variables (e.g., gender and learning styles).

Note: t test and ANOVA are the foci of 690. T test is used to test the statistical significance of mean differences of one or two groups.

ANOVA is similar to t test. But used when you compare more than two groups or have more than one independent variable.


[updated:05.23.04]

connect

Instruments

Educational Knowledge

How do we know what works in a training or educational setting? Usually our knowledge comes from Experience, Observations, or Assessment (tests).

Core Instrumentation within the Field of Educational Technology

Surveys/questionnaires, interview or focus group protocols, observation guidelines, rubrics/checklists for review of extant data, etc.

Quantitative study relies heavily on assessment (test). Qualitative study normally uses a variety of instruments in data collection, so as to investigate the problem from different perspectives and to triangulate information collected.

Primary Data Sources - people (surveys/questionnaires), observations, documents (e.g., student portfolios), and assessment (tests)

Secondary Data Sources - administrative records, prior research studies, extant databases (e.g., National Assessment of Educational Progress, the High School and Beyond Longitudinal Studies), and documentary evidence (e.g., evaluation reports).

Zoom in On Some of the Instrumentation Methods

Survey/questionnaire Used in both quantitative and qualitative study. More and more surveys/questionnaires are conducted via the Internet.

In qualitative studies, investigators have to create a survey/questionnaire for a special group of participants. These instruments can include Likert Scale items and/or open-ended questions.

Internet Resourcesdsu.edu - Read Creating web-based survey (by Dr. Hoffman and Bober)

Observation Used predominantly in qualitative study. Investigators normally engage in one of the five types of participation: nonparticipation (e.g., watching a videotape), passive participation (no interaction), moderate participation, active participation, and complete participation.

Focal points for observation: research setting (physical, human, and social environment), participant activities and behaviors, informal interaction and spontaneous activities, nonverbal communication, etc.

Interviewing: Individual and Focus Group Observation requires direct and prolonged involvement with participants. Interviewing is an alternative "quicker" way in collecting data. Group interviewing (focus group) is a 1.5 to 2-hour session of "guided" discussion among participant.

The investigator provides less than 10 semi-structured questions for the focus group. The session is usually recorded for further analysis. The investigator should avoid asking "why" questions (trigger of defensive reactions), carefully develop the questions, establish the context for questions, and arrange the questions in a logical order (Mertens, 1998).

Document Analysis Used predominantly in qualitative, especially historical research. Document includes memos, reports, letters, field notes, chat transcripts (online), computer files, tapes, and many other artifacts related to the participants under investigation. Refer to the data analysis methods introduced in Module 5.
Criteria for Selecting an Instrument (Mertens, 1998)
  1. Identify the purpose and format of the construct as conceptualized by the author.
  2. Identify your purpose in collecting data.
  3. Identify the constructs and variables the instrument measures.
  4. Examine the validity and reliability information of the instruments. For example, read reviews in Mental Measurement Yearbook, which describes about 500 tests in 18 major categories.
  5. Examine the conditions for instrument administration, scoring, and interpretation.
  6. Does the instrument satisfy concerns about language and culture in terms of avoiding bias on the basis of gender, race and ethnicity, and disability?
  7. Synthesize the above information and decide if this will be a valid and reliable instrument for your study.
Guidance for Developing your Own Instruments

Many times investigators develop their own instruments when no existing instrument can measure exactly the construct they are interested in. Mertens (1998) recommended the following steps:

  1. Define the objective of your instrument.
  2. Identify the intended respondents.
  3. Review existing measures.
  4. Adapt questions from existing instruments and incorporate expert opinions.
  5. Prepare and pilot test the prototype.
  6. Conduct an item analysis of the pilot testing results and revise the instrument.
Quality of Instrumentation

Collecting trustworthy and believable information is crucial in any type of research. Quantitative and qualitative research have different criteria for evaluating the quality of instrumentation. Because assessment (test) is widely used in data collection, we will examine methods in establishing validity and reliability when developing and administering assessment.

Assessment (Test)

Testing is the most exact, but the process presents many problems. The biggest problem is flawed tests, which lead to flawed results and conclusions. Flaws originate when tests are:

  • not administered properly,
  • not valid, or
  • not reliable

To help with the administration, many tests are standardized. This means:

  • Personal beliefs of graders will not influence grades
  • Administrative procedures are established and followed (replicable)
  • Normative data is established for populations

Standardized tests allow for comparison of scores across time or distance, as long as procedures and scoring are followed and tests are valid and reliable.

Test Validity (p. 169 Table 5.1)

Validity is the degree to which a test measures what it is suppose to measure for a specific group. There are four types of validity -- how well the test matches the content of what is to be measured, ability to predict attributes of a variable, similarity of results to another test, and attitudes of people.

Content Validity How well the test measures the intended content (will your midterm?)
Predictive Validity Tells us how well you will do in the future (does the GRE?)
Concurrent Validity To what extent does one test correlate highly with another test? How valid will a simpler comprehensive exam reflect your knowledge compared to a more complex one?

(Correlate scores from one instrument to scores on a criterion measure, either at the same time or different time.)

Construct Validity Allows us to measure a non-observed trait (To what extent does this test reflect a person's personality? creativity? ability? motivation? achievement? team functioning? )

Constructs are abstractions that cannot be observed directly and therefore need to be measured by variables.

Variables are constructs that can take on two or more values or scores (nominal, ordinal, interval, ratio).

Word Documentm06_compatt.rtf - Attitudes Towards Computers (a measure of computer anxiety)

Internet Resourcescientology.org -Free Personality Test

Internet Resourcercn.com - Visual Personality

Internet Resourcesdsu.edu - A diagram of constructs and variables in iExpeditions

Internet Resourcem06_testval.htm - Additional information on test validity

Test Reliability

If you take the GRE twice, will you obtain the same score?

Reliability tells us how well a test consistently measures its intent (Obtained scores are estimates of true scores; i.e., obtained = true).

High reliability gives us confidence that the scores an individual receives would be the same scores if given the test later or a different test. The strength of reliability is shown on a scale from .00 to 1.00 (coefficient of reliability).

A variety of methods can be used to measure reliability, including: (P. 176 Table 5.2 of Gay, 2000)

Stability (Test-retest): Usually 7-10 days between attempts.
Equivalence (alternative forms): Relationship between two versions of a test intended to be equivalent; give alternative test forms to a single group and correlate the two scores.
Equivalence and stability: Relationship between equivalent versions of a test given at two different times
Internal consistency (split-half): One test to one group; compares one half of the student's score to the other half.
Scorer/rater: For interjudge and intrajudge subjective test scores

If reliability is low, obtained scores may not reflect true scores.

"Better than nothing. No, graduate admissions tests are perfect. They're just the most reliable measures that we have, argues Philip D. Shelton, the president and executive director of the Law School Admission Council"
- a headline from Chronicle Higher Education

Internet Resourcechronicle.com


[updated:01.26.04]

apply

Choosing an Instrument

Most educational assessment tools involve examining one of the following:

  • Achievement assessment to see what has been learned
  • Personality assessment to assess feelings, attitudes, creativity, and interests
  • Aptitude assessment to measure potential

Selecting an instrument is easier than creating one. Two reference books to check are Mental Measurements Yearbooks (MMY) and Tests in Prints. The Eric Clearinghouse on Assessment and Evaluation can help you find instruments and instrument reviews on-line.

Internet Resourceericae.net - Eric Clearinghouse on Assessment and Evaluation

Before selecting an instrument, ask yourself:

  • Is it valid?
  • Is it reliable?
  • Is it easy to administer, interpret, complete, and score?
  • Is it within your budget?

Creating your own instrument?

  • Pretest and refine with pilot group
  • Establish standards
  • Provide reason for subjects to take the time to do well

Descriptions of the Instruments in Your Report:

  • the name, publisher, and cost (if you made the instrument, then name it after yourself, e.g., Wang statistics fear test for 690 students);
  • a description of the items in the instrument;
  • validity and reliability data;
  • the type of participants for whom the instrument is appropriate;
  • procedures for administering the instruments;
  • information regarding scoring and interpretation; and
  • reviewers' overall impressions.

Common Mistakes

Many researchers make a variety of mistakes when choosing or creating an assessment instrument. These include choosing an instrument:

  • which isn't valid
  • which isn't reliable
  • because it's name sounds like it should work
  • which has been normalized for a population you are not studying
  • without considering what needs to be done by the participants

Other mistakes include:

  • trying to create an instrument without necessary knowledge
  • administering an instrument without controlling testers
  • administering an instrument, but ignoring the manual
  • trying to give too many assessments in a single session

Examples

Internet Resourcem06_pspub.htm - Theory-based assessment instruments

Internet Resourcepersonalstrengths.com - Personal Strength Publishing


[updated:05.23.04]

reflect

Reflection Questions

  • What types of test validity concerns do you have in your study, and how do you plan to address these concerns?
  • What are the characteristics of standardized tests?
  • What is test validity, and what are the different types?
  • Explain test reliability and why this component is important?
  • What should you consider when selecting a test?

[updated:01.26.04]

extend

Key Statistical Concepts

All activities are optional. The instructor will selectively guide you through them in class. Click "cancel" if the site asks you password for accessing any of the materials. Unless noted in class, materials in the Extend section are optional.

Descriptive Statistics

Frequency Distribution: "a set of scores arranged in order of magnitude along the x-axis and the frequency of each score is represented along the y-axis" (Coolidge, 2000, p. 48).

Types:

  • frequency histogram: similar to bar graphs but has no spaces between the bars
  • frequency polygon: points are connected with straight lines.

Types based on the general shape of the distribution

  • normal distribution: bell-curve
  • skewed distribution: negatively and positively (Students in Colorado remember positive skew as the right side of the snowy mountain that is appropriate for skiing. So they ski to the right->skew to the right)

A "job-aid" is available in the Salkind chapter on how to construct a frequency distribution graph. General procedures and advices (Coolidge, 2000):

  • group data into intervals (5 to 10)
  • define the size of the interval widths based on understandable units
  • Make sure the intervals do not overlap.

Activities

Review Salkind Chapter on "frequency distribution" and create a paper-pencil distributions for one of the following sample data.

Excel Spreadsheetm06_fakespring.xls

Excel Spreadsheetm06_fakefall.xls - see what quantitative data look like

Inferential Statistics

  • Used to make inferences about populations based on the behavior of a sample.
  • Concerned with how likely it is that a result based on a sample or samples are the same as results that might be obtained from an entire population.

Hypothesis

Hypothesis (chapter 7 of Salkind; continuation of Module 2): "An if-then statement of conjecture that relates variables to one another." A good hypothesis translates a problem statement or a research question into a form that can be tested through statistical techniques. Some of the statistics books state that inferential statistics is about hypothesis testing.

Null Hypothesis (Ho) (what is not true)

At the starting point of a research, hypothesis is normally stated as "null hypothesis," in the absence of any other information or a priori (before the fact) knowledge. "There is no significant difference between learning outcomes and format of training." In theory, all hypothesis-testing should start with a null hypothesis when there is no other evidence to support a non-null (alternative) hypothesis. It is like a default position, safe and conservative (Coolidge, 1998).

Research (Alternative) Hypothesis (Ha) (what is true)

A definite statement of the relationship between variables. Each null hypothesis corresponds to one or more research hypotheses.

"There is significant difference between learning outcomes and format of training - in online or classroom setting." - nondirectional

"Learning outcomes as indicated by test scores are significantly higher in online setting than in classroom setting." - directional

Research hypothesis is less conservative, because it is more sensitive to differences than null hypothesis. That means, it is more likely to show a test as significant.

Tails of a Test

Statistically Significant and Level of Significance

Statistically significant means that differences are due to systematic influence and not due to chance or random errors. As in the following example, if the fear-of-fat score differences of Australian and Indian students are statistically significant (p<0.05), that means the difference is due to cultural difference. However, the world is not perfect and chance has effects on many things. A level of significance is associated with every statistical tests.

If the findings are significant at 0.05, the translation is that there is 1 chance in 20 that any differences found were not due to the systematic influence. So the level of significance is the risk associated with not being 100% confident that the difference is due to cultural difference. In other words, it is an estimate of the probability that we are wrong when we say there is difference or no difference between the two samples (alpha level).

The alpha level (the level of risk or uncertainty) should be set based on the nature of the tests. 0.1 (being 90% confident) is used for exploratory tests that allows for larger chance factors; 0.05 (being 95% confident) is used for many educational tests; 0.01 (being 99% confident) is used for science that needs a higher accuracy and tolerates low chance factors.

Type I and II Error, alpha, and p level

Another way to understand alpha and p level is from the Type I and Type II errors. Type I error occurs when a researcher rejects null hypothesis, when it is actually true. This is considered to be a serious error because it can mislead people to believe the effects of some treatment. For example, a new drug does not work when the researcher claims that it works. Type II error occurs when a researcher retains null hypothesis when it is actually false. For example a new drug works when the researcher concludes that it does not.

p level (or alpha) is also the probability of committing a Type One Error. p<.05 means a Type one error should be less than 5 chances out of 100.

Note: In statistics, we use "significant" and "nonsignificant" to report the findings, but not "insignificant." In significant is a value judgment not a statistical concept.

An "off-topic" but interesting example: cultural difference and fear of fat.

The mean Fear-of-Fat score of Australian students is 100; the one of Indian students is 125. By eyeballing, we see there is a difference. But is this difference significantly different? In other words, is the difference due to chance (e.g., sampling error) or due to the cultural difference of the two groups of students?

We answer the question through hypothesis-testing. "There is no significant difference between the two groups of students in fear of fat." We set a level of significance (known as "alpha") and then ran independent t-test. When a=0.1 (10%), there is a 10% probability that we are wrong when we say the results are due to chance. a=0.05 (5%); a=0.01 (1%). In other words, significance level is the risk associated with not being 100% confident that your results are due to the intervention.

When interpreting the statistical results, we compare p (probability of the differences are due to chance) with a (level of risks). When p<a, we reject the null hypothesis and conclude that there is significant difference between the fear-of-fat scores. This procedure applies to the testing of research hypothesis.

Types of Errors

Confidence Interval

Because a sample normally doesn't perfectly represent the population, inferential statistics identifies how likely (90%, 95%, or 99%) the sample results represent the results that would occur in the population.

By generating hypotheses, we make probability statements that the results we see in samples would also be found in population. Our confidence in the probability statements are at 90%, 95%, or 99%. This is known as confidence interval.

Confidence interval corresponds to alpha. When a=0.05, we are 95% confident about our probability statement.

Degree of Freedom (df)

"The value that is different for different statistical tests and approximates the sample size of number of individual cells" (Salkind, 2000, p.367).

"A complicated statistical term which in some statistical tests is roughly correlated with the total number of participants or observations but always slightly less. The df is actually based in the estimation of the standard deviation and indicates the number of numbers that are free to vary in estimation theory." (Coolidge, 2000, p. 153).

  • Correlation: df=N-2 (N: no of participants)
  • t test: df=N-1 (t for one); and N1-1 + N2-1 (tea for two).

The value of DF should always be specified in statistical reports. DF is referred to when comparing a test statistic (t value in t test; f ratio in ANOVA) to a critical value at the corresponding alpha level (e.g., a=.05). The test is significant if the test statistic is larger than the critical value. As t-table and f-table indicate, critical values vary with degree of freedom.

Note: If any statistical software is used to run a test (rather than hand-calculation with formulas), one can make conclusions only by the p level obtained. T and f tables are used when the results are obtained by hand calculation.

Resources

Word Documentm06_glossary.doc - Glossary of Key Terms used in the Research Essays


[updated:05.23.04]