Gordon Willlis, Ph.D.
National Cancer Institute
Developing and Testing Survey Questions, by Gordon Willis, National Cancer Institute, NIH. In this talk I present a brief overview of methods for developing and testing survey questionnaires of the type commonly used in population-based research at the National Institutes of Health. The principles I present will apply to a range of questions and are based on 25 years of experience with surveys involving knowledge, attitudes, risk behaviors, and health status.
Questionnaires are a popular way to collect data in health research. Always remember, though, that survey data ultimately come from people and are therefore limited due to errors in the self-reporting of information.
For example, this slide, summarizing demographic breakdowns in tobacco use, has the title ‘Prevalence of Current Smoking.’ It depicts differences in current smoking behavior as a function of region in the U.S., age, and educational level. We use this type of title a lot and it’s not necessarily wrong, but it is somewhat of an oversimplification. In truth, we don’t know the actual prevalence of smoking.
The title would be more accurate if expressed as the ‘Estimated Prevalence of Current Smoking as Determined through Self-report.’ There are a number of reasons why these estimates could be inaccurate due to a range of survey error types, including sampling error, non-response bias, and in particular, response error.
This talk will focus on just one of these, response error; that is, the difference between a true score and the one we obtain through survey self-reporting.
In particular, I will address the issue of how to go about minimizing response error that is associated with the questionnaire instrument; that is, the survey questions may have a number of defects that cause us to obtain incorrect information and that we would like to be able to avoid or to correct.
Although the job of putting together a questionnaire is sometimes seen as straightforward, it can be a fairly involved process. What we try to do is strike a balance between oversimplification, on the one hand, and being overwhelmed by complexity, on the other.
A good general developmental sequence is provided by Aday and Cornelius, who break down the questionnaire design and evaluation process into a series of steps. Based on this, I put together the sequence shown in this slide.
The first step is to consider your analytic objectives, what kind of data or statistical estimates do you want at the end of the process? It can help to put together empty table shells that make clear the cross-tabulations, or estimates, that you want once the data are statistically analyzed. This is much better than simply saying ‘we want information on X, Y, and Z,’ which is a common, but extremely vague, way for researchers to begin.
A simple example is given in this table, concerning inquiries about smoking status and checking for oral cancer by a dentist or a doctor. The investigators may decide to collect information on the frequency of preventive care visits within the past year to both a dentist and a doctor. They may also want to know how many of these visits included an inquiry concerning current smoking and a check for oral cancer. Putting these elements into an empty table will clarify what survey questions need to be developed in order to obtain the desired information.
One of the benefits of this step is that we can pause to consider a very basic question – Is a survey questionnaire even the best way to get these data? Maybe we are asking something that people can’t actually give us. For example, many people may not know if their dentist has checked their mouth for oral cancer if the dentist doesn’t mention this explicitly. Maybe we should instead be conducting a survey of dentists. Or, we may be asking something that can only be addressed through medical records. Often, survey researchers fall into the old trap best expressed by the saying “When all you have is a hammer, everything looks like a nail.” A survey questionnaire can be a good tool, but it’s not the only one in our toolbox.
Following this initial consideration of measurement objectives, a second step is to decide what we want to ask about more specifically. Before scripting survey questions, make a list of the topics to be covered in some detail. For example, visits to a dentist’s office in the past 12 months; types of services received; whether these were provided by a dentist, dental hygienist, or someone else; out of pocket expenses for services; and so on. These items are not yet survey questions, but the key conceptual elements that we want to collect information on.
Once that initial list is put together, then go to step 3 to create actual question wording, using what could be considered good questionnaire design principles. This is something I will focus on extensively in this talk.
Then, after crafting the questions themselves, one can evaluate or appraise them for classic questionnaire design defects. This is largely a process of expert review, and benefits from the contributions of others who are expert either in the design of survey questions, or in the topic being surveyed.
A final, and very important step, involves evaluating the questions by testing them out on real people. This can be done through the practice of Cognitive Interviewing, which I will describe, and several other methods that I will simply mention. That’s an overview of the process. I will next focus in a more detailed way on steps 4 and 5 in particular, appraisal and empirical testing.
Through our appraisal and testing procedures, there are particular types of problems we are attempting to ferret out and to fix before fielding our survey. The first consideration involves the specifics of administering it, in particular, survey administration mode. I will focus on mode because the problems that we tend to see occurring in survey questionnaires are largely mode-specific.
There are several modes that can be used. I have listed them here: telephone, in-person, mailed paper questionnaires, Internet, and increasingly, the use of personal digital assistants, smartphones, or other handheld computerized devices. Each of these has benefits and drawbacks. Increasingly, self-administered modes for population surveys seem to be becoming much more common than our interviewer-administered techniques.
Telephone surveys, in particular, have become very difficult, due to refusal of household members to even answer a landline phone, and to the proliferation of cell phones, which are not generally included in survey samples.
We can instead put together a questionnaire to be administered in-person, by a live interviewer. But that tends to be very expensive if attempted door-to-door and tends to work best in a more limited environment, such as a clinic.
We can avoid the need for an interviewer altogether by developing a self-administered questionnaire either in mailed paper form, or delivered by a computer, especially over the Internet. Paper questionnaires have the disadvantage of including skip patterns or sequencing instructions which some respondents find difficult to follow.
Internet, or web-based surveys, do avoid this problem, as the computer can handle all those skips. But, there is a strong feeling in the survey world that Internet surveys are simply not ready for prime-time given the fact that population coverage, the percentage of individuals who have access to the Internet and regularly do access it, is not anywhere near 100 percent. However, for certain subgroups or for special populations, such as students, the Internet can be a useful survey administration tool.
Finally, the use of smartphones, or PDAs, is in its infancy but may become more prevalent over time.
Turning now to the appraisal of the effects that survey questions have on respondents, there are several features of these questions that can give rise to error. Several cognitive aspects of the survey response process that we need to consider are captured by the Tourangeau Four-Stage model.
First, there is the possibility that the questions are simply not well encoded or understood. They therefore produce sources of comprehension or communication error. This tends to be a very large category of defects. For example, asking the question, “Have you ever received care from an oral surgeon?” depends on the respondent knowing what an oral surgeon actually is.
Second, even if respondents understand us, we may ask for information they do not have, either because they never had it or because they did but have forgotten. Asking the number of times one has ridden in a passenger airplane may be easy enough to understand, but that doesn’t mean that someone who travels a lot will know the exact or even approximate answer to that question.
Then, there are decision-related issues. Even if the respondent understands the question and knows the answer, that doesn’t necessarily mean that he or she will decide to give us that answer. A classic case would be asking the respondent about the number of sex partners he or she has had in the past 12 months when their spouse may be present listening intently to the interviewer-administered interview.
Then, there can be problems with response formatting, where the answer that is expected does not match the response that’s in the person’s head. We can ask about someone’s health and whether it is excellent, very good, good, fair, or poor, but the person may respond with something else, such as “it’s ok”, or “I wish it were better”, or “better than some, worse than others”, or even “none of your business.”
Beyond these major cognitive issues, there are a range of problems that exist within survey questions. We can identify these basic difficulties by making use of a systematic checklist system that brings potential problems to our attention. The one that I developed, along with Judy Lessler at Research Triangle Institute, is called the Question Appraisal System, or QAS. This system includes seven major categories of pitfalls with a number of subcategories. The QAS is available online at the listed NCI website address, as a simple 2- to 3-page form along with an accompanying manual.
This slide illustrates what the QAS checklist system looks like. It assumes interviewer-based administration, but most of these categories apply to self-administered questionnaires as well. In turn, we look for problems with:
- Instructions to survey respondents
- Question clarity
- Underlying assumptions that may be untrue
- Knowledge or memory limitations
- Sensitive or biased questions
- Problems with question response categories
- And formatting or question ordering problems.
Again, the QAS manual contains in-depth information on each of these problem categories with suggestions on how to rectify them.
Problems with survey questions are best illustrated by means of targeted examples, and there are several problem types that are worth highlighting because they are so common. One of the QAS subcategories concerns questions that are unclear and therefore not well understood, because they are too long, dense, and complex. Consider the following question, which was drafted for a national survey but thankfully never fielded. It asked:
“The last time that you were seen by a doctor, nurse, or other health professional, as part of a regular medical check-up, did you receive any tests specifically designed to diagnose the presence of certain types of cancer?”
When this question was tested on real people, a common response was simply “what?” However, you can get the idea just by reading it that it is long, convoluted, and contains too many sub-concepts. In fact, a good trick in reviewing questions that are meant to be interviewer-administered is to read them aloud. If they sound long, awkward, or even nonsensical when we listen to ourselves, that’s probably the way they’re going to sound to our survey respondent.
A ready solution to such problems is called decomposition. We break the question up into smaller chunks, and feed them to our respondents one chunk at a time.
Decomposing the question into its component parts does make it easier to swallow cognitively. So, we can break the previous question up into several parts, asking in turn when the respondent last got a medical check-up. Then, if he or she did so, whether they received cancer screening tests, and if they received any, what type they were.
Of course, even if we succeed in giving the question enough clarity, that still doesn’t mean the person knows the answer or be willing to share it with us. However, making the question easier to understand is a vitally important first step.
Another consideration, always, is survey administration mode. The original question form likely functions much better under self-administration, because literate respondents can normally read something that’s more complex than they can listen to if it is being read to them. Further, decomposing also tends to introduce skip pattern complexity, which we want to avoid for paper questionnaires. For interviewer-administration, decomposition of the questions would therefore be advisable; perhaps not so much for a paper questionnaire, however.
A related problem that impedes comprehension involves the case where particular terms or phrases are difficult to understand, even if the question as a whole is of reasonable length and not overly complex.
Members of general populations of patients, or household respondents, tend not to carry around in their heads accurate definitions of terms commonly used in the medical and health fields. They may not know what an ‘inpatient’ or an ‘outpatient’ is, and they may not know the difference between a colonoscopy and a sigmoidoscopy.
Solving problems of complex terminology would seem straightforward. We substitute simple language for complicated terms. So, instead of asking about inpatient status, we can ask whether the person was admitted to the hospital and stayed overnight. We can also take pains to describe the colonoscopy or the sigmoidoscopy, and this may help substantially. In the case of the sigmoidoscopy, when the patient was fully awake during the procedure, the simple mention of the use of a tube often produces instant recognition of what it is that we’re talking about.
There is a catch, however, to thinking that using simple, everyday language will necessarily make all things right. Unfortunately, the way we talk in normal everyday English or Spanish or likely any other language, tends to be hopelessly vague for purposes of survey question wording. So, many questions that are perfectly reasonable to ask a friend or a neighbor will create problems in a survey context. For example, asking whether one has been a ‘regular smoker’ leaves the definition of what this is wide-open, and totally up to the respondent. Instead we tend to ask, for example, whether he or she ‘has every smoked every day for 30 days.’
Second, asking whether anyone in the family has ‘a car’ leaves vague the definition first of family, which could be immediate or extended, and of what exactly a ‘car’ is. Should pickup trucks be included? What about vans? Even motorcycles? And what do we mean by ‘having’ a car? Does that mean we own it, rent it, borrowed it? As such, we tend to ask instead whether ‘anyone in the household now owns or leases any type of motor vehicle.’ This is not the way people tend to talk, but it is the way to write survey questions.
Finally, consider the example “Do you think that headaches can be effectively treated?” The problem with this concerns WHOSE headaches are being referred to. Chronic headache sufferers will sometimes feel that, oh, yes, headaches in general can be treated, but not theirs. We need to specifically indicate whether we want them to be talking about ‘most people’s headaches,’ ‘headaches generally,’ ‘your own headaches,’ or something else.
Because the issue of lack of clarity is so central to health survey questionnaires, I will present one further, real-life example. I once evaluated a question asking simply enough, “Have you had your blood tested for the AIDS virus?”
It became evident that ‘having your blood tested’ has two interpretations concerning the degree of active decision-making involved. This could mean either “I took the initiative to have my blood tested,” or “my blood was tested even though it wasn’t my idea,” for example, by soldiers being inducted into the military who fit the latter category.
In this case, I didn’t necessarily know how to rephrase the question to be more clear as we didn’t have sufficient clarity concerning the question objectives. So, I asked my collaborators at the sponsoring agency who responded they simply wanted to know whether HIV testing had been done. As a result, the wording asked “As far as you know, has your blood ever been tested for the AIDS virus?” in the actual survey.
This example brings out a critical point. Evaluating and testing our draft survey questions often has the critical effect of revealing that our own objectives may be somewhat vague and need to be further thought out and better specified. As such, the simple linear flow I presented earlier may sometimes be more complicated. We need to return to previous steps and reconsider certain issues.
Let’s move on to another pitfall, asking for information the respondent likely cannot easily come up with.
Note that most surveys require, or at least expect a fairly instant response. The respondent may remember the answer an hour from now, but that’s too late for our purposes. So, we instead need to consider what they know right now, or can easily come up with, even if we give them the option of looking at records.
This question, asking oncologists about enrollment in clinical trials, puts a very high burden on the respondent. The oncologist must be able to determine the number of women with whom he or she has had discussions regarding treatment trials and to be totaled over the past 12 months, but separately for all cancer treatment trials and for breast cancer treatment trials, and also for all women, but also only for Asian American women.
This is likely not something that the physician has access to, either in memory, or in any type of existing medical record. So, consider when it’s the case that our respondent will look at the question and simply think something like ‘give me a break, I can’t do this.’ We want to maintain credibility with our respondents, keep them motivated to answer carefully.
A further very important category can be labeled Logical Problems. The question may in fact be fine, but just not for the person being asked it at the time. Sometimes this is related to inappropriate assumptions and it’s often related to cultural factors.
In one study, tobacco researchers attempted to ask about switching from a stronger to a lighter cigarette. This works well for American smokers, who are used to relying on listings of tar and nicotine content on cigarette packs, and for whom the notion of a ‘light’ cigarette makes sense. However, it was found that for some Asian immigrants, the cigarettes they had smoked in their country of origin were not labeled in any of these ways. Hence, it was not possible for them to answer the question.
Often, such problems can only be spotted and resolved by having adequate knowledge of the circumstances in which your respondents have lived. In this sense, questionnaire design is sometimes similar to cultural anthropology.
A final large category of problems that is described in the Question Appraisal System checklist is that of question formatting. Self-administered questionnaires should avoid crowding, as it has been demonstrated that respondents are more likely to respond to a questionnaire that has room in the margins, has font that is comfortably large, and has an uncrowded look, even if this means that we have more pages overall in the questionnaire instrument.
Paper-based questionnaires sometimes have complex sequencing instructions as I’ve mentioned. For example, saying ‘If yes, go to Question 12.’ These are sometimes overlooked or incorrectly followed by respondents.
One effective approach to self-administered paper-based questionnaire formatting is to follow a format developed by Don Dillman at Washington State University. He advocates the use of two columns, with a heavy line in between these to indicate the respondent is supposed to go down one column, and then the other in turn. Further, the skip patterns are indicated both by visual cues, that is arrows next to the appropriate question, as well as written instructions, such as “Go To Question 3.”
For interviewer-administered questionnaires, remember that the respondent is listening, but generally not reading along. As such, some conventions that may work in self-administered formats, like the use of parentheses, do not apply for interviewer administration.
Further, long lists of response categories can be scanned when they are in self-administered questionnaires, but under interviewer administration each one must be read often, which can get tedious and makes the response choices difficult to remember. In such cases, we can sometimes give a list of the response categories to the respondent to look at, even if these must be mailed ahead of time, as has been done for some telephone interviews.
Finally, a pitfall to avoid is the use of what can be labeled Response Category Mismatch.
To illustrate response category mismatch, consider the question “How do you feel about your present weight?” Left open-ended like this, the respondent may not know that she is supposed to answer with either “Overweight”, “Underweight”, or “About the right weight.” Instead, we may get responses like “I feel just fine about my weight, thank you,” which clearly is not what we are looking for. Instead, we can help the respondent out by reading the response categories aloud to them.
A related problem concerns the use of unread response category ranges. Consider the question, “In the past ten years, how many times have you had a headache severe enough to cause you to stay in bed for more than half a day?” Asking this in a totally open-ended manner forces respondents to try to report an exact value. However, a person struggling to decide between 20 or 30 is effectively wasting time and effort, as the interviewer in this case is only going to enter “more than ten times” in any case. Again, reading the response category ranges serves to better communicate exactly how precise we would like the respondent to be.
Excessive length is a very important pitfall for Federal surveys. Long surveys cost more and may get lower response rates, especially if we need to tell the respondent up front how long this is expected to take.
So, how long is too long? It depends a lot on the nature of the questionnaire and on the respondent who we are talking to. I try to keep face-to-face interviews to around 30 minutes for an average length interview. I don’t generally like conducting telephone or web surveys that are longer on an average than 15 minutes; 10 would even be better.
I will now turn to the use of empirical testing to evaluate survey questions. Mainly I will describe Cognitive Interviewing, as that has become the most popular and well entrenched form of evaluating survey questionnaires for Federal surveys. Metaphorically, cognitive interviewing involves looking at the part of the iceberg that is submerged and that doesn’t readily appear until we look beneath the surface.
I will only touch the surface of a technique that has its own fairly elaborate set of procedures and considerations. To find out more, there is an older, but still useful, training manual on the NCI website listed. For a full treatment, you can consult my book “Cognitive Interviewing: A Tool for Improving Questionnaire Design,” published in 2005.
The cognitive testing process involves several steps, normally involving the production of a draft survey questionnaire that has undergone the type of appraisal process that I have talked about so far.
Through cognitive interviewing, we are searching for more subtle problems that emerge once we try the questions out on small samples of real people. We do not consider these subjects to be statistically representative of the larger targeted population, however, because our samples for cognitive testing are generally tiny and are, in any event, not statistically sampled. Rather, we look for variation in characteristics, such that we can cover as wide as possible a range of individuals and life situations.
These groups of subjects, sometimes no more than 10, are interviewed individually by specially trained cognitive interviewers. Interviews may be conducted in an established cognitive laboratory, as at the CDC National Center for Health Statistics, Census Bureau, or Bureau of Labor Statistics. Or, they may be conducted in a location where we need to go to find the types of people who will be helpful in identifying the underlying problems with our survey questions. Over the years I have traveled to a wide variety of locations to conduct cognitive interviews, including subjects’ homes, homeless shelters, various types of health clinics, such as community health centers, and elderly centers.
When conducting the interviews, we administer survey questions, usually by reading them aloud, as in an interviewer-administered survey questionnaire. However, in addition, we make use of two specialized techniques. First, we ask our subjects to Think-Aloud as they are answering and to simply say everything that comes to mind. Second, we ask our own Verbal Probes to gain additional insights into how people are thinking about and answering the tested questions.
The think-aloud activities within the cognitive interview are up to the subject. The subjects verbalize their thoughts to varying degrees and the cognitive interviewer simply, carefully listens.
Verbal probing, on the other hand, requires a lot of active control by the interviewer. The key to cognitive probing is to figure out how to ask probe questions that end up being informative.
There are several varieties of such probe questions. First, targeted comprehension probes can be used to investigate terms, such as the technical terms I mentioned earlier, that may not be well understood. For example, in one study, I evaluated a survey question asking whether anyone in the household now has dental sealants. Several subjects quickly answered, “Yes.” But when I probed by asking “What does the term ‘dental sealant’ mean to you?”, they answered by describing fillings for cavities, rather than protective coverings placed on the teeth to prevent decay. This is a good example of the use of cognitive testing to identify what are labeled ‘silent misunderstandings’ that simply do not come across unless we probe for them.
Secondly, we can also ask our subjects to paraphrase the question, or recite that question in their own words. If they produce something very different from what we are attempting to ask, that tends to be a problem.
Third, confidence judgments are used to investigate further how much people really know about what they are reporting. Often a seemingly confident response will mask a less than confident underlying judgment. For various reasons, survey respondents will tell us, for example, that their health insurance covers a list of specific services, such as mental health coverage. However, they’ll also tell us, once we probe, that they are simply guessing or assuming that they have such coverage.
Recall probes, next, ask directly how someone knows something. If we have just asked how many times they have gone to the dentist in the past 12 months, and they say three, that is a good opportunity to investigate where this number came from. Often, when we probe, we are told, for example, that, to quote, “I can remember going once to get a cavity filled, and I also generally go every six months, so that would make about three.” What this tells us is that there is a good degree of estimation going on, as opposed to actual recall of three visits. Finally, a very good generic probe is simply asking “How did you arrive at that answer,” and then, to listen.
I will provide several examples that hopefully will bring to life how probing can work to identify problems with tested survey questions.
I will illustrate one classic case of a survey question that presented multiple problems when evaluated through cognitive testing. Several decades ago, researchers at the National Center for Health Statistics were presented the task of evaluating survey questions on digestive disorders. One of these questions asked, “In the last year have you been bothered by pain in the abdomen?”
In this case, the sponsor or client, felt that the question was likely to be well enough understood. The cognitive interviewing staff had some doubts, however. First, it seemed that, ‘last year’ could be vague. So, this was investigated by asking subjects the probe “What time period were you thinking about, exactly?”
A small number of interviews revealed that ‘the last year’ had three distinct interpretations. First, this could mean ‘the last calendar year.’ So, for an interview done in 2012, the person may be led to think about the year 2011. Second, this came across to some as ‘at some time within the current year,’ that is, since January 1. Still others thought that the last year meant the past 365 days, or 12 months, counting back from the present. This third interpretation was the one desired, but that was evidently not clear.
As a second cognitive issue, the phrase ‘bothered by pain’ also seemed as though it could be interpreted in various ways. Again, we can investigate these by simply asking “What does ‘bothered by pain’ mean to you?”
This produced some very interesting responses, and I have seen this same phenomenon when I have tested this question. For example, a retired marine may say, “Well, it hurt like hell, but we were trained to be tough, so I didn’t let it bother me.” This is of course a red flag, because our objective is to find out about the experience of pain and not the person’s psychological reaction to it, or whether the pain additionally bothered them or not. In this case, the inclusion of the element ‘bothered by pain’ seemed to be misleading and extraneous.
Finally, as yet a third potential cognitive issue, the cognitive interviewers simply suspected that the comprehension of the term ‘abdomen’ may not be uniform. To assess this, they used a simple but elegant test.
This diagram, depicting a human torso, partitioned into 19 numbered sections, was presented to around a dozen subjects. They were then asked, “Where is your abdomen?” It turned out that no two people chose the same set of regions, providing adequate evidence that asking about ‘the abdomen’ is probably not a good idea.
So, these findings are all interesting, but what do we do about them? In this case, the cognitive interviewing team used these findings to propose an alternative approach. First, instead of asking about ‘the last year,’ they made clear that the relevant reference period was ‘the past 12 months.’
Second, they eliminated the phrase ‘bothered by pain’ and just asked about whether one had pain.
Finally, instead of asking about the abdomen, they were able to supply a diagram for the survey respondent, as this question was for an in-person interview that allows such use of helpful devices.
Note that it is not guaranteed that this version will be perfect, or that it won’t create additional problems. That is why an optimal approach is to conduct additional cognitive testing, now using this version on another round of subjects. The practice of conducting multiple rounds in this way, called iterative testing, is a major positive feature of the cognitive interviewing approach.
A question that skeptics sometimes ask is, how much difference do the changes we make based on cognitive interviewing really make? Sometimes our wording modifications are subtle, and it has been argued that although this may influence the responses we get, we also need to realize that a self-report survey is a blunt instrument and that we are simply looking to get responses that are in the ballpark. That is, a few percentage points of difference here and there may not be a big deal. So, the additional time, effort, and energy required to do cognitive testing may not really be worth it. My colleague, Susan Schechter and I, tested this proposition once, and I will present one piece of that research.
The issue concerned asking about duration of strenuous physical activity over the course of a typical day. The initial version that we tested asked, “On a typical day, how much time do you spend doing strenuous physical activities such as lifting, pushing, or pulling? Would you say, “None,” “Less than 1 hour,” “1-4 hours,” or “5 or more hours?” Cognitive testing suggested a problem. Subjects gave answers like ‘2 hours,’ but when we probed, it turned out they typically worked in an office, did nothing more strenuous during a normal day than perhaps reloading the photocopy machine. It appeared that the question presented a subtle form of bias. By asking ‘how much time’ one does this, the implication is that one in fact does. So, who wants to answer with ‘zero’ and look like, as one subject put it, a total couch potato?
Our use of iterative cognitive testing then allowed us to establish that an alternative version appeared to reduce reports of over-reporting. That is, when we asked whether someone spent any time doing such activities, they tended to say things like ‘not really, I work in an office.’ That is, what seemed to be a more reasonable and accurate response.
So, our recommendation was to reduce this bias by breaking the question up, first asking whether one spends ANY time doing strenuous physical activities and getting a ‘yes’ or ‘no’ answer. Only if the person says ‘yes,’ do we continue to the next part, which is identical to the originally tested question, and then, to ask ‘how much time’ they spend.
Of course, there are drawbacks and tradeoffs to many questionnaire design solutions. In this case, we end up adding a question. One senior staff member objected that it simply wasn’t worth it to add our new ‘filter’ question; that this may be arguably better, but would likely have so little effect on the overall estimates that, on balance, it just wasn’t worth the additional burden. Based on our cognitive testing, we did not agree, this seemed like a pretty large effect. So, well after the fact, we decided to empirically test this notion, for scientific and evaluation purposes.
We conducted our empirical test by embedding both versions of this survey question in several fielded surveys. I will describe two of these, although the findings were clear and consistent across all of our tests.
For the first experiment we conducted a split-sample experiment within an interviewer-administered pretest. In the second, we included both versions in a very different environment, a study of women’s health conducted at a clinic. In both cases, our hypothesis was that the use of the filtered version, that is, first asking if someone spends ANY time in strenuous activities, would decrease reporting bias, and so increase the frequency of reports of NO strenuous activity. If we were correct, then the reports in this table under 0 should be measurably higher for the filtered version (in red), than for the unfiltered version (in black).
In both cases, the effect was substantial. Including the extra filter question increased the percentage of NO reports of strenuous activity from 32 to 72 percent in one case, and from 4 percent to 49 in the other.
There are two lessons that we took from this. Most basically, small changes in survey question wording and format can have enormous effects on the data obtained. Second, we concluded that in this case, and for several other tested items I am not showing, hypotheses derived from cognitive testing were accurate as far as predicting the behavior of survey questions in a field survey environment.
I have just provided two examples of the use of cognitive interviewing, as it is typically conducted. However, there are a number of variations in this procedure. I will mention only one very basic one here.
The examples I presented rely on the procedure called Concurrent probing. We read a survey question to be evaluated to the subject. The person answers the question and we immediately probe to gain further information. Then, we go on and administer the next question to be tested. Again, this is Concurrent probing. An alternative to this is Retrospective probing, also called debriefing. In this case, we don’t stop to probe, but go all the way through the questionnaire, just asking questions as in a regular survey without any probes. Then, at the end we go back and probe all of these questions.
Each of these procedures has advantages and disadvantages and these are listed here. Concurrent probing allows us to ask about thinking when the memory is fresh, right after the subject has answered the question. This comes across as a very natural type of activity to most subjects.
However, one can argue that the interjection of these probes is a form of contamination. It makes the interview different than what will happen in the field, where no probing is done. So, do we really know what the effects of our probing are, especially on responses to questions that are tested later in the interview?
The desire to avoid such so-called reactivity effects may lead one to conclude that it’s much better not to disturb the interview and to probe only after the fact. However, this practice—debriefing, retrospective probing—introduces another potential problem. If we wait until well after the subject has answered any particular question to probe, how do we know they will remember what they were thinking about at the time they answered it? That is, will their responses to probe questions be useful or will they be subject to fabrication, or so-called Demand effects?
I will simply suggest that there is a place for both types of probing. I tend to use concurrent probing for interviewer-administered questionnaires and retrospective probing for self-administered ones, but this is certainly not an absolute rule.
Increasingly, cognitive interviewing is being applied to web surveys. For some years, evaluation of Internet web survey questionnaires has been conducted through Usability Testing. And there is a clear connection between cognitive testing and usability testing. In fact, sometimes these can hardly be distinguished.
Although cognitive testing theoretically deals with content, the functioning of the questions themselves, and usability testing strives to detect problems with the underlying vehicle, the web-based system, in practice both procedures tend to address both issues. When evaluating web surveys, it does make sense to conduct some cognitive testing of the items, particularly before they are programmed, and then to put together the web survey and follow up with usability testing.
Most importantly, designers of web surveys need to make sure not to focus only on the operation of the computerized instrument as a vehicle, but make sure they are also paying some attention to whether the questions that are posed are themselves working and making sense.
The next is an example of a complex web survey with potential cognitive and usability issues.
This National Cancer Institute Diet History Questionnaire prototype is a good example that shows how web surveys have been developing. Rather than only consisting of a linear administration of individual survey questions, it incorporates multiple visual elements, panels, and levels of organization. The question to be answered is at the top, in the middle, where the instruction says “Please check the box next to each food that you ate at least once in the past 12 months.”
However, on the left is a vertical panel that is typical of many web applications. This is a navigation panel that shows the person where he or she is in the questionnaire and contains additional information such as which subsections have been completed. As such, the system is flexible and allows the respondent to move around the questionnaire in the way they would prefer.
However, these potential improvements could come with some important costs. For one, is it even clear where to look, or is this confusing? Many people learn to look for question numbers as a guide to filling out a questionnaire, but the computer doesn’t need these and often they are not included with the questions. Further, do users even know what the panel on the left is for? Is there anything about it that’s confusing? Do the colors that are used match their expectations? Each of these questions gives rise to cognitive probes and issues for usability testers to investigate.
I will only mention one example of a clear finding from this testing of such a web application. It turned out that in this case there was a problem related to the use of color. Although red was meant to indicate only that the section had not been completed, this drew unnecessary attention and caused concern for tested individuals. This turned out to be because in many computer applications, red is used to signify error or a problem or something that needs fixing. So, red was changed to a more neutral color, such as gray.
Turning to another issue, an important consideration is how to analyze and write up the results of cognitive testing. This is a science in itself that can be best appreciated by looking at existing reports, as examples.
Fortunately, a compendium of cognitive interviewing project reports now exists. This is called Q-BANK and is maintained on the National Center for Health Statistics website. If you are interested in how cognitive testing of health survey questions and other survey questions has been done at U.S. Federal cognitive labs, or by their contractors, take a look at this resource.
Although I have focused on cognitive interviewing, there are several other means for pretesting survey questions. One is to conduct a formal field pretest to record the interviews and then to review these using behavior coding, a formal coding system that identifies and quantifies instances of problems in the interaction between interviewers and respondents.
Finally, once we have actual quantitative survey data, we can further evaluate the functioning of our questions and scales by making use of a range of psychometric techniques, such as item response theory or differential item functioning.
For anyone interested in more information, the resources on this slide may be of further interest. These include the resources that I have mentioned previously in this talk.
To close, I present a quote that I consider relevant to questionnaire design and pretesting, “The uncreative mind can spot wrong answers, but it takes a very creative mind to spot wrong questions.”
As questionnaire designers, our task is not only to spot wrong questions, but also to make them right. Hopefully, the methods I have presented in this talk will be useful in this regard.