David Victorson, Ph.D.
Northwestern University Feinberg School of Medicine
Hi, my name is David Victorson and this talk will focus on patient reported outcomes tools for measurement of health-related quality of life.
In this talk we’re going to provide an overview on what are patient reported outcomes, or PROs, and why they are important. We’ll discuss some challenges to PROs as well as best practices for creating them. And, finally, we’ll discuss ways in which PROs can be organized through conceptual models and organizational frameworks.
There are different types of patient outcomes assessment. For example, investigator-reported patient outcomes would be an investigator or clinician indicating their global impression, an observation through physical exam findings, or tests of functioning or performance. Another type of outcomes assessment is physiological; for example, FEV1, lab values, vital signs, tumor size, other clinical indicators. Another type of patient outcomes assessment is caregiver-reported; so, quality-of-life, caregiver burden, functional status. And, finally, there's patient reported outcomes assessment. Again, this would cover areas of health-related quality of life, symptoms and side effects, their global impression with life satisfaction or quality-of-life, satisfaction with treatment, and treatment adherence.
A patient reported outcome is the measurement of any aspect of a patient's health status that comes directly from the patient; in other words, without the interpretation of the patient's responses by a physician or anyone else.
Administration of PROs may be through paper and pencil forms, digital capture through the Internet or handheld device, or through an interviewer-administered questionnaire. The most commonly used PRO questionnaires assess one of the following: symptoms and impairments, functioning or disability, wellness or health, or health-related quality of life.
Measures can commonly be grouped into either generic or targeted. A generic measure would be those that are designed to be used by any population and they're meant to cover a broad aspect of the concept being measured, such as the SF-36 or the PROMIS Global Health. Targeted measures or disease-specific questionnaires are designed to assess those concerns that are most important for a given population.
PROs offer a unique perspective on treatment effectiveness. Physiological assessments often do not reflect how a patient functions or feels. PROs may be more reliable than an informal interview and some treatment effects are known only to the patient, such as their symptoms of fatigue, depression, pain, how well the patient feels, how the patient functions, and how the patient perceives their care or their treatment.
PROs are often used in clinical trials to describe patients or the disease severity, to assess trial eligibility, to look at treatment effects, to converge with other outcomes of interest, and also to provide a risk-benefit evaluation.
PROs are not without their challenges. Some have development limitations; maybe they weren’t developed with patient input. Others can be unreliable or unresponsive to change. Some have poor validity evidence and lack the necessary sensitivity or specificity. For others, a score difference or a score change over time may not carry with it any meaningfulness. Sometimes it’s difficult to know whether to use a single item or multiple items of a given construct. Sometimes it's challenging to know what the best recall period should be or the best response options, and even the best mode of administration, whether it should be paper and pencil or over the Internet. All of these things are important considerations to take into account when creating a new PRO measure.
So, now we're going to transition into looking at some of the best practices for how to create a PRO measurement tool.
One of the first things that most people do is to consult the extent literature. So, before you set out to develop a new tool, first see what's available. There are lots of good measures out there as well. If nothing exists however, or if there’s nothing that’s specific enough to what your interest is in, then begin by looking through the literature and conducting a review of the given condition that you're interested in, as well as what measurement tools might have a close approximation to it, looking at the most common symptoms, side effects, and issues.
It’s also common to consult with experts either through an open-ended survey or an interview, asking them questions about what PRO measures they typically use and why. What are the most important and frequent symptoms and quality of life impact that their patients experience because of that condition? And then, what issues are most challenging based on their own clinical experience? Sometimes during these expert interviews you might also ask the experts to rate or rank existing PRO questionnaires or items to get a good sense of how relevant they rate them to be, how clear they are and understandable, so that possibly you might be able to also draw from existing measures and items that are currently available.
Possibly one of the most important sources of information for new PROs is to consult the target audience or the patients themselves. One way to do this is through the use of focus groups. Focus groups are a well-established exploratory qualitative research approach that can adduce and direct discussion on topics related to a person’s experience within a given phenomenon. Focus groups also draw upon the collective experience, wisdom, and group dynamics and synergy from individual members that come together around a common goal. Focus groups can be very efficient and economical approaches, as you can get many members in one room in a brief period of time and your results can be a very rich source of information as well as a starting point for more in-depth work to follow.
In addition to focus groups, individual interviews are also commonly used. These can allow for greater and more in-depth exploration of different concepts and issues. Individual interviews can afford an opportunity to probe further and explore possibly sensitive topics that may have been moderated or withheld in focus groups because of the social influences or stigmas attached with discussing certain topics. Individual interviews can also be used to corroborate previous exploratory findings from other sources, such as the literature, expert interviews, or focus groups themselves.
When things like the literature review, expert interviews, focus groups, interviews, other surveys have been completed, the next step is to begin to triangulate those different data sources. And by triangulate I mean to compare and contrast and to integrate them to make sure that each source is being confirmed by others as well as contributing unique variance to the phenomenon that you're exploring. This can be done through a qualitative process, it can be done through a quantitative process, and most likely it will occur through some form of a mixed methods analysis approach, where you're combining both qualitative and quantitative analysis strategies. The end result of this is to come up with a robust, meaningful, measurable concept or concepts.
So, I mentioned briefly that in this process we begin by examining our different data sources in a systematic way. And, typically this is based on a theoretically-based qualitative analysis approach or a mixed method analysis approach, through constantly comparing sources with one another and comparing them with the literature and resulting in a multi-level coding and theme-creation process. I’m not going to endorse or sponsor any particular qualitative analysis approach or quoting procedure, but there are several out there and before this process begins it's either good to receive appropriate training in doing them or to bring a person onto your team whose skills are versed in different theoretically based qualitative or mixed methods approaches.
Throughout the process, inter-rater agreement should be measured and that can be done in a variety of ways. Some of the most common are to calculate Kappa coefficients of agreement or to also look at percentage of agreement between different raters.
In addition to inter-rater agreement, data saturation, or the point at which no new codes are created from your data; that point when you're looking at your findings and you really can't get anything new from them. That also needs to be documented throughout as well as when you believe that data saturation occurred and why, so that future researchers and reviewers can also get some inside perspective into how long it took, how many interviews were needed, and what your criterion were that led you to believe that you had reached data saturation.
These are both checks of reliability for the inter-rater agreement and validity for data saturation. Those are equivalent to those concepts in a qualitative or mixed methods analysis.
So, when you feel confident that you've identified a set of core concepts from the data, the next step is to begin to look at them and organize them in a way in which you can review relationships between concepts or you might even be able to identify which ones are the ones that you're actually interested in creating a new PRO measure.
One way to do this is to create a conceptual model or a theoretical representation that defines the concepts of interest, their interfaces, and their possible determinants. So, it’s a visual representation of a complex set of interrelationships of variables and it can help you really hone in on the areas of greatest importance for your new measure.
I’m showing you now on something that we published in the journal Value in Health, which is a modified version of Wilson and Cleary's quality-of-life conceptual model and we have taken, through our qualitative and mixed methods work in dyspnea, and we have modified it so that it can highlight some of the most important concepts that people with dyspnea experience.
I’m going to now focus your attention to the red box, or the red rectangle, where you can see actually the areas in which we were most interested and patients were most interested in terms of what their experiences were. So, dyspnea symptoms on the left-hand side, which included their intensity, their frequency, and their duration, as well as the functional limitations that occurred because of the dyspnea, the time extension that that related to in the task avoidance.
We also created measures to focus on the emotional response to dyspnea up at the top, as well as down at the bottom some checklists of different characteristics of the environment. But for the most part, we were mostly interested in measuring what's inside the middle of the box. This slide really is serving to show you that when you look at your data and analyze it and organize it in a meaningful way, you're likely to get more concepts and variables than you're actually planning on or interested in measuring. And, that's okay because each box helps tell a part of the story and what this is helping you do then is to really prioritize which ones are the most important ones that you should focus the rest of your attention on.
We’re going to transition now from identifying core concepts to actually taking some of those and turning them into items themselves. So, in this slide you can see that in terms of the anatomy of an item we typically break them into three different categories: the item context, the stem, and the response options. So, in this case the item context would let you know what time frame or context you're trying to get the respondent to think about as they respond to the item. In this case, it's Please indicate how true each statement has been for you during the past 7 days. The item stem is the item itself. This is the simply understood statement of whatever it is you're looking at. In this case, it's I have nausea. And the response options can vary depending on what kind of question it is but in this case it's a Likert scale from Not at all to Very much.
With the fundamental structure of an item in mind, these next several slides will focus on item writing guidelines. And the first is that context is relevant to the concept. It's important to choose an item context that is relevant to the concept being measured, whether it's the past 24 hours, 7 days, 2 weeks, or past month. This can be identified from the literature, previous measures, or from exploring this during focus groups or individual interviews to determine what time frame you should be asking patients to think back upon when they are thinking about how much they've been impacted by whatever it is that you're measuring.
Another item writing guideline is to use simple universal language, making sure that items are clear and unambiguous as much as possible; language that is simple and appropriate for the target population; and avoiding colloquialisms and activities that might not be familiar across different age groups, ethnic groups, or cultures.
Another item writing guideline is to try and make items as specific as possible, asking about specific versus general. For example, “I enjoy sports” versus “I enjoy watching college football.” Also in this is to avoid what is called double-barreled questions, questions that involve more than one question. So, “Do you approve or disapprove of abortion in cases of incest or threats to the mother's health.” In this case, this is probably better off split up into two or possibly even three different questions.
Another item writing guideline is to pay attention to phrasing, to whenever possible avoiding negatively phrased items, and to write items that require very little cognitive processing. We want to make sure that the item is quickly understood and answered by the participant without having to have them sit and weigh it too long. We also want to try to keep the grammar to a simple past or present tense if possible. So, an example of just looking at negative versus positively phrased items, “I don't have symptoms of nausea” versus “I have symptoms of nausea.” You can see where if you say “I don't have symptoms of nausea” and then you Strongly disagree or Disagree, it requires another step with the negative in that sentence and positively phrased statements are easier to cognitively process quicker and with greater ease.
Another guideline is to try to key response options to the type of item that you’re asking. So, there can be a variety of different item types, from opinions, knowledge, frequency of events or behaviors, to ratings. And, it’s important to try and make sure that the response options that you use are actually the best to respond to that type of an item. So, if you're using, if you have an opinion type item you don't necessarily want to use a frequency type of response option. It’s just another thing to keep in mind.
Once items have been written or compiled from other sources, the next step is typically to put them past item writing experts. And these may be content experts in a particular field, they could be physicians that have a great deal of knowledge about the clinical issue or disorder, or they could be psychometrics and measurement experts who really know how to write the best items.
What we typically have experts do is to review items and rate each one according to how relevant they are or how prevalent they are in a given population of patients. And we ask them to essentially let us know if there are irrelevant items, then, we tend to set them aside for further review. We don't omit them off the offhand unless it's just a horribly written item. But we tend to put them aside for further review and discussion.
And then, what's really important is to make sure to document all of the decisions that are made about these items including whether they are going to be retained or excluded, if they're going to be revised and in what way, so that there's at least one person tracking and making notes of all of these changes during this multistage process so that at the end anybody could go back and be able to review the historical account and evolution of different items.
In addition to expert review of the items, it's also very important to put these items back in front of patients so that they can ensure that the content, the response skills, and the instructions are understood as intended. This is done for item comprehension, to make sure that we have a better understanding about what processes they use to retrieve the information or make decisions, the influences of social desirability, and also to see why they choose certain responses that they do.
So, in our cognitive interview protocols we typically ask patients to paraphrase the item in their own words. We might ask them to define terminology that is used and to describe any lack of clarity or any confusion about the appropriateness of the item or their answer. We ask them to talk about how confident they are in their ability to provide an accurate answer to the item and really to describe how they arrived at their answer and how they got to a Strongly disagree or a Disagree and asking them what the difference would be between the two in their terms.
So, really it's a chance for an in open-ended, yet structured discussion with the patients after they take the question, to essentially ask them to tell us in their own words how they arrived at their answer and what things were going through their mind as they did so. As with the expert item review, any changes or modifications to items made during cognitive interviews should be duly noted and kept in the item history document.
So, once modifications are gotten from experts and from patients, the team members reunite and begin to review the items one-by-one making decisions about which ones they’ll continue to retain versus ones that might get excluded. During this process some rules should be established as to why you would choose to keep an item versus exclude an item and what modifications were done to the item.
After that happens, a second round of expert item review and patient review, a second round of patient cognitive interviews, should also be conducted with the newest set of items. This process should continue for as many reviews as necessary until the final items have been decided upon.
So, once a set of items has been finalized and is ready to take to validation field testing, another very helpful activity is to create a PRO conceptual framework. This framework can help illustrate the anticipated associations between the items within a PRO tool and their respective domains. It can clearly identify the concepts that are important to patients and how they should be measured and it represents the goals of treatment as concepts that are important in a specific disease and treatment context with a clear description of treatment benefit.
This is an example of a PRO conceptual framework that our group created in hormone refractory prostate cancer. And, essentially, what it shows you is the overall concept on the left-hand side which is broken down into the three primary domains that we’re interested in measuring. And then, that is connected, if you look at the set of yellow boxes, those are items from the survey tool itself. And, you can see which items are connecting with which domains and then what the relevant outcomes are from those items. So, this is really, essentially a conceptual map or a framework so that any person can look at the measure and really understand the intended areas that it's supposed to be measuring.
So, in sum, hopefully this talk was able to help provide some definitions and context to what PROs are and why PROs are important for clinical research; that it reviewed some challenges to PROS and offered some best practices for creating PROs; and finally, how to create a PRO conceptual model and PRO conceptual framework to help organize and display and understand relations between the most important concepts as well as how those concepts are operationalized through the items themselves.