Ken Winters, Ph.D.
University of Minnesota
Hello, everyone. This is Ken Winters. I'm a clinical psychologist and professor in the Department of Psychiatry at the University of Minnesota. The talk I'm giving for the webinar series on survey methods is the following: Purposes of surveys and questionnaires: Assessing prevalence, risk, and outcomes.
My talk will cover three main topics: general characteristics of measures and tools; an overview of core purposes focusing on epidemiological surveys, measuring risk and protective factors, and measuring outcomes; the final topic will be a review of sources of existing tools.
So, let's begin with an overview of general principles of accurate and user-friendly assessment tools and measures. Let’s first talk about characteristics of a good questionnaire and survey.
We can identify basic properties of a good test. There are four we are going to focus on: a clearly defined purpose; a specific and standard content; a standardized administration procedure; and a set of scoring rules. Let’s consider each of these.
A good test has a clearly defined purpose. To define the purpose, the test developer and the test user must be familiar with these questions: what is the test supposed to measure; what is the domain or the content it is measuring; who will take the test; who is the test intended for; and how will the test score be used. Is the test designed to compare the performance of test takers to each other or are they designed to diagnose. Different types of items and scores are used for these different types of comparisons.
The second property has to do with the specificity and standardization of the content of the test. The content has to be specific to the domain the test is designed to cover, and also the content needs to be standardized, meaning that all test takers are tested on the same attributes or knowledge. This may seem obvious but there are some situations in which examinees may answer different but comparable questions. For example, many tests are available on more than one form. These alternate forms can be useful when people must be tested repeatedly, such as when you're looking at progress before an intervention and after an intervention. An alternate form is helpful so that you cannot just be measuring practice effects.
The third property has to do with making sure there is a standard administration procedure or procedures. It is critical that all test takers receive the same instructions and materials and that roughly same a lot of time is allowed to complete the test.
The fourth property pertains to having a standard scoring procedure. This procedure must be applied the same way to all individuals who take the test. Objective tests, such as multiple-choice tests, are relatively easy to score in a standardized and consistent way. It is more difficult to ensure consistency in the scoring of structured or semi-structured interviews, which we will talk about later.
In addition to design properties, a good test is also one that has favorable psychometric properties. These properties are determined by analyzing responses to test and questionnaire items during development of the test. There are two important psychometric properties of a good test: reliability and validity. Synonyms for reliability are consistency and stability. Just as a reliable person will be consistent in his or her actions and reactions, a reliable test will provide a consistent measure of current knowledge, skills, or characteristics. Without changes in one's knowledge or characteristics, an individual taking a reliable test can expect to obtain about the same score on a second administration. Why is this important? When test scores change we would like to be able to conclude that the test-taker has learned more or has changed somehow because of an intervention, for example. Without her a reliable test we cannot determine what a change in scores really means.
Let's look at an example. Here are two items from a measure of extroversion. If these items are reliable we would assume that a person would answer them in the same way the first time he took the test and the second time, let's say a couple days later. But if the person was receiving some kind of intervention to help improve extroversion to make the person more socially outgoing, then the second time the person took the test we would hope to see a change in their answers in the expected direction.
The second basic psychometric property is called validity. Whereas reliability analysis indicates whether the test provides a consistent measure, that does not tell us what the test actually measures. Validity indicates whether the test measures what it was designed to measure. When you make a valid point in the course of a discussion you make a point that is relevant to the issue being discussed. When a test is valid it measures the test-taker characteristics that are relevant to the purpose of the test. Typically, there are two kinds of validity for a given test, content and construct validity.
Let’s turn to an example. Imagine there is a questionnaire that measures health habits. Scales would be valid for this questionnaire if they measured things that we understand to be related to health, such as exercise, diet, how to balance work and recreation, and one’s sleeping habit. But having items on this measure related to political preference would not be pertinent and thus would represent an invalid or irrelevant scale.
There are several ways of administering the survey. The choice between administration modes is influenced by several factors, including costs, coverage of the target population, flexibility of asking questions, the respondent’s willingness to participate, and response accuracy. Different methods create mode effects that change how respondents answer, and different methods have different advantages and disadvantages. Two types we’re going to focus on: self-administered, which contain scales and individual items, versus interviews, the structured and semi-structured types.
With self-administered formats, this usually means administration is going to occur at the individual level. The respondent or the research subject is going to complete your measure privately. Basically there are two ways this is done, either the paper and pencil version or a computerized version. Computerized versions are very popular. They provide great ease in administration and they help promote easy scoring. Within this type of format, let's focus on two basic subtypes of measures, one would be scaled-based and the other based on individual items. Let's turn first to scale-based measures.
As mentioned earlier, measuring a domain with a scale needs to be done with reliability and validity. A good scale will have multiple items measuring the same domain or construct. For most scales this means somewhere in the range of six to ten items. It can be tough to measure something reliably and with validity with less than six items, and sometimes one will need several more items than ten to do a good job in measuring the domain of interest. It's typical for items within a scale to have the same response format, usually forced choice, such as true or false or agree to disagree, and the score for this scale is determined by doing some mathematical property with the responses on those items, usually summing them up. Perhaps some of the responses are weighted; often a good scale, though, can simply just, in an unweighted fashion, sum up the items.
Let's look at an example of some items from a good scale. These are taken from a psychometrically sound scale that measures readiness for drug treatment. Each of these items is worded in a relatively simple way and each of them has the same response options: strongly disagree, disagree, agree, or strongly agree. If we consider a person who is showing high readiness to seek drug treatment, then we could assume that he or she would answer items one, three, and four in the agree and strongly agree direction and would answer item two in the strongly disagree or disagree direction.
In contrast to a scale-based measure or instrument is the example of measuring some type of domain with individual items. This could mean either single or a few items are used to measure the domain of interest and because each item is scored individually one can have variability in the types of options for the respondent.
Let's look at some examples from a standardized survey. On the left we see an item that's trying to measure satisfaction with one's life. The question is, How satisfied are you with your life as a whole these days?, and the response options range from completely dissatisfied to the other end completely satisfied. So, the seven options provide variability along the dissatisfaction and satisfaction dimension. To the right are three items related to school performance. They're not simply no or yes items because the researcher was interested in an intermediate type response as well. So, take for example item 10, Have you ever had to repeat a grade in school? No; Yes, on time; Yes, two or more times.
The other type of administration format that we’re going to talk about is the interview format. Interviews are useful particularly when detailed information is needed about an individual, such as historical and current diagnostic information. Most popular interviews now are aided by a computerized version that assists with both administration and scoring.
Generally speaking, there are two types of interviews: structured and semi-structured. They differ in the degree of clinical judgment needed by the interviewer when rating or assigning symptoms and scores. Highly structured or respondent-based interviews direct the interviewer to read each question exactly as it is written and then decides whether the described symptom is absent or present based largely on the respondent’s answers. These interviews can be administered with acceptable reliability by well-trained staff without clinical experience.
Semi-structured or interviewer-based interviews require the interviewer to elicit an initial response from the interviewee then permit unstructured probing to determine whether a symptom is present. Users of these types of interviews are going to require more advanced training and assessment but this flexibility leads some to believe that semi-structured interviews provide greater opportunity to obtain richer information about the respondent and the issues and domains being assessed.
Let's look at a couple examples. First, here is a structured format item. The interviewer would just read this question to the respondent, During the past month did you feel very sad blue or down in the dumps to the point where you were not your normal self? The interviewer would just record the respondents answer as a Yes or No, there would be no probing to determine details behind the answer.
So, then, let's turn the other kind of format, a semi-structured format. Questions in such an interview might go like this, During the past month what kind of mood were you in? Did you feel very sad, blue or down in the dumps? Was this feeling not your normal self? The interviewer could probe with additional follow-up questions, Have others commented on your mood? Did you feel this way for every or nearly every day during the past month? It's typical for these kinds of interviews to not have just dichotomous response options, but have some variation on that. For this one, I've shown you a common one, three options are possible for recording the respondent’s answers, True, Possibly true, or False.
Now we move to the next main topic area of this presentation, core purposes of measures. First, let's talk about epidemiological surveys.
So, these types of surveys are often used when studying large populations. Typically, they follow a self-administered format, perhaps computer based, and it's typical to have such surveys combine single items as well as skill-based items. A reliable and valid survey that probably is also user-friendly often follows some basic principles. One set has to do with the structure of the instrument. If it's too long, compliance is going to be diminished. A survey that takes more than 60 minutes can be a problem. The content flow of the items has to be logical and user-friendly. So, typically, surveys begin with straightforward demographic items and then move later into the more sensitive topics about personal behavior. Many of them have branching rules to help reduce response burden. At the item level, it’s important that both the item wording and the response options are relatively simple, be kept at a sixth grade or lower reading level, and it's important to avoid double negatives that can confuse people.
Here are four items from a popular epidemiological survey. Each item has a fixed number of responses and the range of responses depend on the complexity of the question and the interest in the researcher. So, to the left, item five, Where did you grow up mostly?, 10 response options are offered. This is a fairly complicated question but the one below it, What is your present marital status?, can be captured with four response options.
A second core purpose of measures is to assess the risk and protective factors of the respondent.
Measuring risk and protective factors is commonly important in research studies. These factors often underlie the onset of a given problem or they may contribute to that problem’s course or the clinical outcome of something being studied. A given risk or protective factor can represent the same dimension and sometimes can be accurately measured as either the risk or protective side of the dimension. For example, somebody's values can be viewed or measured as conventional values which would be an asset, or perhaps the scale would be measuring nonconventional values and scoring it as a risk.
Research has shown that some variables are more accurately measured as an asset rather than a risk and vice versa. This may have to do with measurement issues, willingness to report, response bias, variability, etcetera. And so, it’s useful to make sure an instrument or scale has determined whether the asset or risk direction that's being measured is the best way to assess such a factor or behavior. I have one example, here, self-esteem. One could technically measure it as either low self-esteem or high self-esteem, but it looks like from a lot of risk research with adolescents, for example, that measuring it at the low end, low self-esteem, provides a more accurate measure of health behaviors, then if one were trying to measure high self-esteem.
After determining the necessary domains of interest within risk and protective factors the investigator is faced with the challenging task of locating and selecting from available measures for assessing that specific dimension. The task of locating appropriate measures has been considerably facilitated by the emergence of several information service sources and resources for this purpose. We will review them a little later in the talk.
The third core purpose of measures that were going to talk about is measuring outcomes.
Now we turn to the third core purposes of measurement, that being to evaluate outcome. Outcome measures are intended to help measure what was accomplished by the particular program or research, what was done to accomplish those outcomes, and what specifically has been accomplished. There is a whole field of program assessment and evaluation research but I'm going to provide a straightforward basic four-step model of program evaluation. It is called the logic model.
Our interest is with the four circles at the bottom. All of them are related to measurement tasks to help with evaluating outcome. Let’s talk first about Step One, measuring goals.
So, in Step One, the main tasks are to identify the primary goals of the project, what is it that you hope to accomplish, and then also identify the target groups, the target individuals that are going to receive the program, and then what are the short- and long-term outcomes that are desired. So, the measures need to be organized around these three themes: goals, target group, and desired outcomes.
In Step Two, the focus is on process assessment.
Process assessment involves activities that are undertaken by the program staff or the researchers to accomplish the objectives. The purposes of process assessment are to help monitor the activities around which the program is organized. This can help the investigator make sure that parts of the plan program are being undertaken and not being neglected, and it can help provide information to manage the program or to change and add to activities. Another purpose is that this information can help provide data for accountability to any parties or groups that are interested in your efforts, such as funding sources. Actually, this information would be needed for progress reports. Also, process assessment provides information relevant to why the program worked or did not work. By providing and measuring information on what was done and what was accomplished you can try to identify reasons for why your outcomes were achieved or not achieved.
The next two steps, Steps Three and Four, pertain to assessing the outcomes and impacts of the program. For Step Three, we are referring to the short-term outcomes.
Basically, what is involved in this step is looking at the desired outcomes that are stated in Step One and looking for evidence regarding the extent to which the outcomes were achieved. Evidence here could include changes in the number of referrals, an increase in the number of students attending the activity, or increased publicity about the health issue that you're trying to address.
Step Four is an extension, often called impact assessment. Here, the goal is to measure the longer term or more significant outcomes of the program.
So, let's take for example a drug prevention program. The ultimate effects or the areas one would like to impact might include reduction in overall drug use, reduction in the rate of new students starting to use drugs, a decrease in DUI arrests, and a decrease in school disciplinary actions. Also, considered here would be the reduction of risk factors related to the target behavior. So, for a drug prevention program this might include reduction in school absences and a drop in school dropout rates. Information about other impact areas can be obtained from archival data from school and hospital records, for example.
The final topic for this presentation is to provide a summary of sources of existing tools.
Now let's turn to the last segment of this talk, sources of existing tools. Literally thousands of psychological and behavioral and health tests and measures are available today; so many that obtaining information about published tests and measures can be a problem. Let's talk about some relevant sources for those in the biomedical research field. A new source available for the public is called PhenX Toolkit. The website is phenxtoolkit.org. This web-based service is free and available to the public and is intended for use by investigators who are designing or expanding health-related studies. The toolkit consists of a catalog of high priority and scientifically rigorous measures for use in research efforts.
The PhenX Toolkit provides the following information: a brief description of each measure; the protocol or protocols for collecting the measure, with supporting images and tables; the basis and rationale for selecting that protocol for inclusion in the toolkit, such as why would such a measure be important for researchers; details about the personnel, training and equipment needed to collect the measure; other information such as any special procedures for collecting the information from the measure; and finally, any selected references that point you to the literature and provide documentation of the instrument’s psychometric properties.
This slide shows you the 21 separate domains which cover 295 measures that are included in the PhenX Toolkit. Domains range from alcohol, tobacco, and other substances to nutrition and dietary supplements, to oral health, to speech and hearing domains.
For those interested in researching prevention, I want to point you to an excellent resource, mentorfoundation.org/about_prevention. Several resources, including measures for building an evaluation plan or strengthening your current program evaluation are provided. Each resource is summarized and then followed by either its web-based source or the link to a supporting document. All of these resources are free. Some examples are the Center for Substance Abuse Prevention’s Prevention Toolkit; the Logic Model, described by the Kellogg Foundation; and a library of multiple sources from CDC.
A final set of sources is provided on this slide. Here are four reference volumes that are available to help you locate and learn about tests. Each resource is free. One of them is particularly strong, the first one, Mental Measurements Yearbook. This is published every few years and each volume contains comprehensive critical reviews of tests. In 1990, the reviews also became available online. For those interested in psychological based tests, I suggest the last entry there, PsychINFO at www.psych.org .
That concludes this talk. Thank you for your attention today. I hope you’ve found this presentation informative and useful. Feel free to e-mail me if you would like more information or have some questions.