** Boost your Grades with us today! **

# ASU Health & Medical Odds Ratio for Lorcaserin Producing Questions

## Description

Do the risk benefit assignment using the attached information. Remember to answer ALL of the questions! I know this may be a difficult assignment, but it will go quickly if you set up the formulas and put in the numbers.

Go back to the Powerpoint and use the formulas provided. It is relatively straightforward if you use the formulas.

Please show your work rather than just the end number so I can better assess how you did it.

Assignment Week 2, Risk Benefit Age, gender and weight matched patients were treated with loracserin, a new drug for weight loss, or placebo, in conjunction with a diet and exercise program. The tables below summarize results from different trials. An newer drug, semaglutide, was also studied. Table 1 shows weight loss results for patients that completed the trial Placebo Lorcaserin Weight loss> 10% body weight 243 748 Total participants completing trial 5083 5135 Placebo Semaglutide Weight loss> 10% body weight 12 68 Total participants completing trial 655 1306 Table 2 shows some adverse events. Placebo Lorcaserin Placebo Semaglutide Headache Nausea Suicidal Ideation Total Participants 15 37 81 198 19 35 114 544 11 21 83 124 5992 5995 655 1306 Table 3 shows outcomes for cardiovascular events for lorcarserin and semaglutide, another new weight loss drug. Placebo Lorcaserin Cardiovascular event Total 369 364 6000 6000 Placebo Semaglutide Cardiovascular event Total 70 107 655 1306 Analyze the benefits and risks of lorcaserin and semaglutide compared to placebo. In particular, answer the following questions: 1. What is the odds ratio for lorcaserin producing greater than 10% weight loss? 2. What is the odds ratio for semaglutide producing greater than 10% weight loss? 3. What is the Relative risk for each drug for: Lorcaserin Semaglutide a. Headache RR= b. Nausea RR= c. Suicidal ideation RR= 4. What is the relative risk or relative risk reduction for major cardiovascular events for each drug? Lorcaserin Semaglutide RRR= 5. Do you think the benefits of lorcaserin outweigh the risks given that the odds ratio for myocardial infarction is 1.44 for a patient with a BMI over 30, which all of these patients had at the beginning of the study? 6. What about the patients treated with semaglutide, who also had a BMI over 30 at the beginning of the study? 7. Which drug would you choose, or do you think neither has a good enough risk benefit ratio to be used? CMAJ 2005: Tips for Learners of Evidence-Based Medicine: A 5-Part Series Barratt A, WYer PC, Hatala R, McGinn T, Dans AL, Keitz S, Moyer V, Guyatt G. Tips for learners of evidence-based medicine: 1. relative risk reduction, absolute risk reduction and number needed to treat. Can Med Assoc J 2004; 171:353–358. Montori VM, Kleinbart J, Newman TB, Keitz S, Wyer PC, Moyer V, Guyatt G. Tips for learners of evidence-based medicine: 2. measures of precision (confidence intervals). Can Med Assoc J 2004; 171:611–615. McGinn T, Wyer PC, Newman TB, Keitz S, Leipzig R, Guyatt G. Tips for learners of evidence-based medicine: 3. measures of observer variability (kappa statistic). Can Med Assoc J 2004; 171:1369–1373. Hatala R, Keitz S, Wyer P, Guyatt G. Tips for learners of evidence-based medicine: 4. assessing heterogeneity of primary studies in systematic reviews and whether to combine their results. Can Med Assoc J 2005;172:661–665. Montori VM, Wyer P, Newman TB, Keitz S, Guyatt G. Tips for learners of evidencebased medicine: 5. the effect of spectrum of disease on the performance of diagnostic tests. Can med Assoc J 2005;172:385–390. Review Synthèse Tips for learners of evidence-based medicine: 1. Relative risk reduction, absolute risk reduction and number needed to treat Alexandra Barratt, Peter C. Wyer, Rose Hatala, Thomas McGinn, Antonio L. Dans, Sheri Keitz, Virginia Moyer, Gordon Guyatt, for the Evidence-Based Medicine Teaching Tips Working Group ß See related article page 347 P hysicians, patients and policy-makers are influenced not only by the results of studies but also by how authors present the results.1–4 Depending on which measures of effect authors choose, the impact of an intervention may appear very large or quite small, even though the underlying data are the same. In this article we present 3 measures of effect — relative risk reduction, absolute risk reduction and number needed to treat — in a fashion designed to help clinicians understand and use them. We have organized the article as a series of “tips” or exercises. This means that you, the reader, will have to do some work in the course of reading this article (we are assuming that most readers are practitioners, as opposed to researchers and educators). The tips in this article are adapted from approaches developed by educators with experience in teaching evidencebased medicine skills to clinicians.5,6 A related article, intended for people who teach these concepts to clinicians, is available online at www.cmaj.ca/cgi/content/full/171/4/353/DC1. Clinician learners’ objectives DOI:10.1503/cmaj.1021197 Understanding risk and risk reduction • Learn how to determine control and treatment event rates in published studies. • Learn how to determine relative and absolute risk reductions from published studies. • Understand how relative and absolute risk reductions usually apply to different populations. Balancing benefits and adverse effects in individual patients • Learn how to use a known relative risk reduction to estimate the risk of an event for a patient undergoing treatment, given an estimate of that patient’s risk of the event without treatment. • Learn how to use absolute risk reductions to assess whether the benefits of therapy outweigh its harms. Calculating and using number needed to treat • Develop an understanding of the concept of number needed to treat (NNT) and how it is calculated. • Learn how to interpret the NNT and develop an understanding of how the “threshold NNT” varies depending on the patient’s values and preferences, the severity of possible outcomes and the adverse effects (harms) of therapy. Tip 1: Understanding risk and risk reduction You can calculate relative and absolute risk reductions using simple mathematical formulas (see Appendix 1). However, you might find it easier to understand the concepts through visual presentation. Fig. 1A presents data from a hypothetical trial of a new drug for acute myocardial infarction, showing the 30-day mortality rate in a group of patients at high risk for the adverse event (e.g., elderly patients with congestive heart failure and anterior wall infarction). On the basis of information in Fig. 1A, how would you describe the Teachers of evidence-based medicine: See the “Tips for teachers” version of this article online at www.cmaj.ca/cgi/content/full/171/4/353/DC1. It contains the exercises found in this article in fill-in-theblank format, commentaries from the authors on the challenges they encounter when teaching these concepts to clinician learners and links to useful online resources. CMAJ • AUG. 17, 2004; 171 (4) © 2004 Canadian Medical Association or its licensors 353 Barratt et al effect of the new drug? (Hint: Consider the event rates in not most cases7,8), the absolute gains, represented by abpeople not taking the new drug and those who are taking it.) solute risk reductions, are not. In sum, the absolute risk reWe can describe the difference in mortality (event) duction becomes smaller when event rates are low, whereas rates in both relative and abthe relative risk reduction, or solute terms. In this case, “efficacy” of the treatment, ofthese high-risk patients had a ten remains constant. Risk and risk reduction: definitions relative risk reduction of 25% These phenomena may be and an absolute risk reduction factors in the design of drug Event rate: the number of people experiencing an of 10%. trials. For example, a drug event as a proportion of the number of people in the population Now, let’s consider Fig. 1B, may be tested in severely afwhich shows the results of a fected people in whom the Relative risk reduction: the difference in event second hypothetical trial of the absolute risk reduction is likerates between 2 groups, expressed as a proportion of the event rate in the untreated group; usually same new drug, but in a patient ly to be impressive, but is 7,8 constant across populations with different risks population with a lower risk for subsequently marketed for the outcome (e.g., younger pause by less severely affected Absolute risk reduction: the arithmetic difference tients with uncomplicated infepatients, in whom the absobetween 2 event rates; varies with the underlying risk of an event in the individual patient rior wall myocardial infarclute risk reduction will be tion). Looking at Fig. 1B, how substantially less. The absolute risk reduction becomes smaller would you describe the effect when event rates are low, whereas the of the new drug? The bottom line relative risk reduction, or “efficacy” of the The relative risk reduction treatment, often remains constant with the new drug remains at Relative risk reduction is 25%, but the event rate is lowoften more impressive than er in both groups, and hence absolute risk reduction. Furthe absolute risk reduction is only 2.5%. thermore, the lower the event rate in the control group, Although the relative risk reduction might be similar the larger the difference between relative risk reduction across different risk groups (a safe assumption in many if and absolute risk reduction. Risk for outcome of interest, % A 40 Risk for outcome of interest, % Absolute risk reduction (also called the risk difference) is the simple difference in the event rates (40% – 30% = 10%). 30 Relative risk reduction is the difference between the event rates in relative terms. Here, the event rate in the treatment group is 25% less than the event rate in the control group (i.e., the 10% absolute difference expressed as a proportion of the control rate is 10/40 or 25% less). 20 10 0 B Among high-risk patients in trial 1, the event rate in the control group (placebo) is 40 per 100 patients, and the event rate in the treatment group is 30 per 100 patients. Trial 1: highrisk patients Placebo Treatment 40 Among low-risk patients in trial 2, the event rate in the control group (placebo) is only 10%. If the treatment is just as effective in these low-risk patients, what event rate can we expect in the treatment group? 30 20 The event rate in the treated group would be 25% less than in the control group or 7.5%. Therefore, the absolute risk reduction for the low-risk patients (second pair of columns) is only 2.5%, even though the relative risk reduction is the same as for the high-risk patients (first pair of columns). 10 0 Trial 1: highrisk patients Trial 2: lowrisk patients Fig. 1: Results of hypothetical placebo-controlled trials of a new drug for acute myocardial infarction. The bars represent the 30day mortality rate in different groups of patients with acute myocardial infarction and heart failure. A: Trial involving patients at high risk for the adverse outcome. B: Trials involving a group of patients at high risk for the adverse outcome and another group of patients at low risk for the adverse outcome. 354 JAMC • 17 AOÛT 2004; 171 (4) Tips for learners of evidence-based medicine Tip 2: Balancing benefits and adverse effects in individual patients In prescribing medications or other treatments, physicians consider both the potential benefits and the potential harms. We have just demonstrated that the benefits of treatment (presented as absolute risk reductions) will generally be greater in patients at higher risk of adverse outcomes than in patients at lower risk of adverse outcomes. You must now incorporate the possibility of harm into your decision-making. First, you need to quantify the potential benefits. Assume you are managing 2 patients for high blood pressure and are considering the use of a new antihypertensive drug, drug X, for which the relative risk reduction for stroke over 3 years is 33%, according to published randomized controlled trials. Pat is a 69-year-old woman whose blood pressure during a routine examination is 170/100 mm Hg; her blood pressure remains unchanged when you see her again 3 weeks later. She is otherwise well and has no history of cardiovascular or cerebrovascular disease. You assess her risk of stroke at about 1% (or 1 per 100) per year.9 Dorothy is also 69 years of age, and her blood pressure is the same as Pat’s, 170/100 mm Hg; however, because she had a stroke recently, you assess her risk of subsequent stroke as higher than Pat’s, perhaps 10% per year.10 One way of determining the potential benefit of a new treatment is to complete a benefit table such as Table 1A. To do this, insert your estimated 3-year event rates for Pat and Dorothy, and then apply the relative risk reduction (33%) expected if they take drug X. It is clear from Table 1A that the absolute risk reduction for patients at higher risk (such as Dorothy) is much greater than for those at lower risk (such as Pat). Now, you need to factor the potential harms (adverse effects associated with using the drug) into the clinical decision. In the clinical trials of drug X, the risk of severe gastric bleeding increased 3-fold over 3 years in patients who received the drug (relative risk of 3). A population-based study has reported the risk of severe gastric bleeding for women in your patients’ age group at about 0.1% per year (regardless of their risk of stroke). These data can now be added to the table to allow a more balanced assessment of the benefits and harms that could arise from treatment (Table 1B). Considering the results of this process, would you give drug X to Pat, to Dorothy or to both? In making your decisions, remember that there is not necessarily one “right answer” here. Your analysis might go something like this: Pat will experience a small benefit (absolute risk reduction over 3 years of about 1%), but this will be considerably offset by the increased risk of gastric bleeding (absolute risk increase over 3 years of 0.6%). The potential benefit for Dorothy (absolute risk reduction over 3 years of about 10%) is much greater than the increased risk of harm (absolute risk increase over 3 years of 0.6%). Therefore, the benefit of treatment is likely to be greater for Dorothy (who is at higher risk of stroke) than for Pat (who is at lower risk). Assessment of the balance between benefits and harms depends on the value that patients place on reducing their risk of stoke in relation to the increased risk of gastric bleeding. Many patients might be much more concerned about the former than the latter. Table 1A: Benefit table* 3-yr event rate for stroke, % Patient group At lower risk (e.g., Pat) At higher risk (e.g., Dorothy) No treatment With treatment (drug X) Absolute risk reduction, % (no treatment – treatment) 3 30 2 20 1 10 *Based on data from a randomized controlled trial of drug X, which reported a 33% relative risk reduction for the outcome (stroke) over 3 years. Table 1B: Benefit and harm table 3-yr event rate for stroke, % Patient group At lower risk (e.g., Pat) At higher risk (e.g., Dorothy) No treatment 3-yr event rate for severe gastric bleeding, % With treatment Absolute risk reduction (drug X) (no treatment – treatment) No treatment With treatment (drug X) Absolute risk increase (treatment – no treatment) 3 2 1 0.3 0.9 0.6 30 20 10 0.3 0.9 0.6 *Based on data from randomized controlled trials of drug X reporting a 33% relative risk reduction for the outcome (stroke) over 3 years and a 3-fold increase for the adverse effect (severe gastric bleeding) over the same period. CMAJ • AUG. 17, 2004; 171 (4) 355 Barratt et al Number needed to treat: definitions Number needed to treat: the number of patients who would have to receive the treatment for 1 of them to benefit; calculated as 100 divided by the absolute risk reduction expressed as a percentage (or 1 divided by the absolute risk reduction expressed as a proportion; see Appendix 1) Number needed to harm: the number of patients who would have to receive the treatment for 1 of them to experience an adverse effect; calculated as 100 divided by the absolute risk increase expressed as a percentage (or 1 divided by the absolute risk increase expressed as a proportion) The bottom line When available, trial data regarding relative risk reductions (or increases), combined with estimates of baseline (untreated) risk in individual patients, provide the basis for clinicians to balance the benefits and harms of therapy for their patients. Tip 3: Calculating and using number needed to treat Some physicians use another measure of risk and benefit, the number needed to treat (NNT), in considering the consequences of treating or not treating. The NNT is the number of patients to whom a clinician would need to administer a particular treatment to prevent 1 patient from having an adverse outcome over a predefined period of time. (It also reflects the likelihood that a particular patient to whom treatment is administered will benefit from it.) If, for example, the NNT for a treatment is 10, the practitioner would have to give the treatment to 10 patients to prevent 1 patient from having the adverse outcome over the defined period, and each patient who received the treatment would have a 1 in 10 chance of being a beneficiary. If the absolute risk reduction is large, you need to treat only a small number of patients to observe a benefit in at least some of them. Conversely, if the absolute risk reduction is small, you must treat many people to observe a benefit in just a few. An analogous calculation to the one used to determine the NNT can be used to determine the number of patients who would have to be treated for 1 patient to experience an adverse event. This is the number needed to harm (NNH), which is the inverse of the absolute risk increase. How comfortable are you with estimating the NNT for a given treatment? For example, consider the following questions: How many 60-year-old patients with hypertension would you have to treat with diuretics for a period of 5 years to prevent 1 death? How many people with myocardial infarction would you have to treat with βblockers for 2 years to prevent 1 death? How many people with acute myocardial infarction would you have to treat with streptokinase to prevent 1 person from dying in the next 5 weeks? Compare your answers with estimates derived from published studies (Table 2). How accurate were your estimates? Are you surprised by the size of the NNT values? Physicians often experience problems in this type of exercise, usually because they are unfamiliar with the calculation of NNT. Here is one way to think about it. If a disease has a mortality rate of 100% without treatment and therapy reduces that mortality rate to 50%, how many people would you need to treat to prevent 1 death? From the numbers given, you can probably figure out that treating 100 patients with the otherwise fatal disease results in 50 survivors. This is equivalent to 1 out of every 2 treated. Since all were destined to die, the NNT to prevent 1 death is 2. The formula reflected in this calculation is as follows: the NNT to prevent 1 adverse outcome equals the inverse of the absolute risk reduction. Table 3 illustrates this concept further. Note that, if the absolute risk reduction is presented as a percentage, the NNT is Table 2: Benefit table for patients with cardiovascular problems Event rate, % Clinical question Control group Treatment group ARR, % NNT What is the reduction in risk of stroke within 5 years among 60-year-old patients with hypertension who are treated with diuretics?11 2.9 1.9 1.00 100 What is the reduction in risk of death within 2 years after MI among 60-year-old patients treated with β-blockers?12 9.8 7.3 2.50 40 What is the reduction in risk of death within 5 weeks after acute MI among 60-year-old patients treated with streptokinase?13 12.0 9.2 2.80 36 Note: MI = myocardial infarction, ARR = absolute risk reduction, NNT = number needed to treat. 356 JAMC • 17 AOÛT 2004; 171 (4) Tips for learners of evidence-based medicine Table 3: Calculation of NNT from absolute risk reduction* Form of absolute risk reduction Calculation of NNT Example Percentage (e.g., 2.8%) Proportion (e.g., 0.028) 100/ARR 1/ARR 100/2.8 = 36 1/0.028 = 36 *Using absolute risk reduction in last row of Table 2.13 100/absolute risk reduction; if the absolute risk reduction is expressed as a proportion, the NNT is 1/absolute risk reduction. Both methods give the same answer, so use whichever you find easier. It can be challenging for clinicians to estimate the baseline risks for specific populations. For example, some physicians may have little idea of the risk of stroke over 5 years among patients with hypertension. Physicians may also overestimate the effect of treatment, which leads them to ascribe larger absolute risk reductions and smaller NNT values than are actually the case.14 Now that you know how to determine the NNT from the absolute risk reduction, you must also consider whether the NNT is reasonable. In other words, what is the maximum NNT that you and your patients will accept as justifying the benefits and harms of therapy? This is referred to as the threshold NNT.15 If the calculated NNT is above the threshold, the benefits are not large enough (or the risk of harm is too great) to warrant initiating the therapy. Determinants of the threshold NNT include the patient’s own values and preferences, the severity of the outcome that would be prevented, and the costs and side effects of the intervention. Thus, the threshold NNT will almost certainly be different for different patients, and there is no simple answer to the question of when an NNT is sufficiently low to justify initiating treatment. The bottom line NNT is a concise, clinically useful presentation of the effect of an intervention. You can easily calculate it from the absolute risk reduction (just remember to check whether the absolute risk reduction is presented as a percentage or a proportion and use a numerator of 100 or 1 accordingly). Be careful not to overestimate the effect of treatments (i.e., use a value of absolute risk reduction that is too high) and thus underestimate the NNT. Conclusions Clinicians seeking to apply clinical evidence to the care of individual patients need to understand and be able to calculate relative risk reduction, absolute risk reduction and NNT from data presented in clinical trials and systematic reviews. We have described and defined these concepts and presented tabular tools and equations to help clinicians overcome common pitfalls in acquiring these skills. This article has been peer reviewed. From the School of Public Health, University of Sydney, Sydney, Australia (Barratt); the Columbia University College of Physicians and Surgeons, New York, NY (Wyer); the Department of Medicine, University of British Columbia, Vancouver, BC (Hatala); Mount Sinai Medical Center, New York, NY (McGinn); the Department of Internal Medicine, University of the Philippines College of Medicine, Manila, The Philippines (Dans); Durham Veterans Affairs Medical Center and Duke University Medical Center, Durham, NC (Keitz); the Department of Pediatrics, University of Texas, Houston, Tex. (Moyer); and the Departments of Medicine and of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, Ont. (Guyatt) Competing interests: None declared. Contributors: Alexandra Barratt contributed tip 2, drafted the manuscript, coordinated input from coauthors and reviewers and from field-testing and revised all drafts. Peter Wyer edited drafts and provided guidance in developing the final format. Rose Hatala contributed tip 1, coordinated the internal review process and provided comments throughout development of the manuscript. Thomas McGinn contributed tip 3 and provided comments throughout development of the manuscript. Antonio Dans reviewed all drafts and provided comments throughout development of the manuscript. Sheri Keitz conducted field-testing of the tips and contributed material from the field-testing to the manuscript. Virginia Moyer reviewed and contributed to the final version of the manuscript. Gordon Guyatt helped to write the manuscript (as an editor and coauthor). References 1. Malenka DJ, Baron JA, Johansen S, Wahrenberger JW, Ross JM. The framing effect of relative and absolute risk. J Gen Intern Med 1993;8:543-8. 2. Forrow L, Taylor WC, Arnold RM. Absolutely relative: How research results are summarized can affect treatment decisions. Am J Med 1992;92:121-4. 3. Naylor CD, Chen E, Strauss B. Measured enthusiasm: Does the method of reporting trial results alter perceptions of therapeutic effectiveness? Ann Intern Med 1992;117:916-21. 4. Fahey T, Griffiths S, Peters TJ. Evidence based purchasing: understanding results of clinical trials and systematic reviews. BMJ 1995;311:1056-60. 5. Jaeschke R, Guyatt G, Barratt A, Walter S, Cook D, McAlister F, et al. Measures of association. In: Guyatt G, Rennie D, editors. The users’ guides to the medical literature: a manual of evidence-based clinical practice. Chicago: AMA Publications; 2002. p. 351-68. 6. Wyer PC, Keitz S, Hatala R, Hayward R, Barratt A, Montori V, et al. Tips for learning and teaching evidence-based medicine: introduction to the series. CMAJ 2004;171(4):347-8. 7. Schmid CH, Lau J, McIntosh MW, Cappelleri JC. An empirical study of the effect of the control rate as a predictor of treatment efficacy in meta-analysis of clinical trials. Stat Med 1998;17:1923-42. 8. Furukawa TA, Guyatt GH, Griffith LE. Can we individualise the number needed to treat? An empirical study of summary effect measures in metaanalyses. Int J Epidemiol 2002;31:72-6. 9. SHEP Cooperative Research Group. Prevention of stroke by anti-hypertensive drug treatment in older persons with isolated systolic hypertension. Final results of the Systolic Hypertension in the Elderly Program (SHEP). JAMA 1991;265:3255-64. 10. SALT Collaborative Group. Swedish Aspirin Low-dose Trial (SALT) of 75mg aspirin as secondary prophylaxis after cerebrovascular events. Lancet 1991;338:1345-9. 11. Psaty BM, Smith NL, Siscovick DS, Koepsell TD, Weiss NS, Heckbert SR. Health outcomes associated with antihypertensive therapies used as first-line agents. A systematic review and meta-analysis. JAMA 1997;277: 739-45. 12. β-Blocker Health Attack Trial Research Group. A randomized trial of propranolol in patients with acute myocardial infarction. I. Mortality results. JAMA 1982;247:1707-14. 13. ISIS-2 Collaborative Group. Randomised trial of intravenous streptokinase, oral aspirin, both or neither among 17 187 cases of suspected acute myocardial infarction: ISIS-2. Lancet 1988;2:349-60. 14. Chatellier G, Zapletal E, Lemaitre D, Menard J, Degoulet P. The number needed to treat: a clinically useful nomogram in its proper context. BMJ 1996; 312:426-9. 15. Sinclair JC, Cook RJ, Guyatt GH, Pauker SG, Cook DJ. When should an effective treatment be used? Derivation of the threshold number needed to treat and the minimum event rate for treatment. J Clin Epidemiol 2001;54:253-62. Correspondence to: Dr. Peter C. Wyer, 446 Pelhamdale Ave., Pelham NY 10803, USA; fax 212 305-6792; pwyer@worldnet .att.net CMAJ • AUG. 17, 2004; 171 (4) 357 Barratt et al Members of the Evidence-Based Medicine Teaching Tips Working Group: Peter C. Wyer (project director), Columbia University College of Physicians and Surgeons, New York, NY; Deborah Cook, Gordon Guyatt (general editor), Ted Haines, Roman Jaeschke, McMaster University, Hamilton, Ont.; Rose Hatala (internal review coordinator), Department of Medicine, University of British Columbia, Vancouver, BC; Robert Hayward (editor, online version), Bruce Fisher, University of Alberta, Edmonton, Alta.; Sheri Keitz (field-test coordinator), Durham Veterans Affairs Medical Center and Duke University, Durham, NC; Alexandra Barratt, University of Sydney, Sydney, Australia; Pamela Charney, Albert Einstein College of Medicine, Bronx, NY; Antonio L. Dans, University of the Philippines College of Medicine, Manila, The Philippines; Barnet Eskin, Morristown Memorial Hospital, Morristown, NJ; Jennifer Kleinbart, Emory University, Atlanta, Ga.; Hui Lee, formerly Group Health Centre, Sault Ste. Marie, Ont. (deceased); Rosanne Leipzig, Thomas McGinn, Mount Sinai Medical Center, New York, NY; Victor M. Montori, Department of Medicine, Mayo Clinic College of Medicine, Rochester, Minn.; Virginia Moyer, University of Texas, Houston, Tex.; Thomas B. Newman, University of California, San Francisco, Calif.; Jim Nishikawa, University of Ottawa, Ottawa, Ont.; W. Scott Richardson, Wright State University, Dayton, Ohio; Mark C. Wilson, University of Iowa, Iowa City, Iowa Appendix 1: Formulas for commonly used measures of therapeutic effect Measure of effect Formula Relative risk (Event rate in intervention group) ÷ (event rate in control group) Relative risk reduction 1 – relative risk or (Absolute risk reduction) ÷ (event rate in control group) Absolute risk reduction (Event rate in intervention group) – (event rate in control group) Number needed to treat 1 ÷ (absolute risk reduction) Fred Sebastian Please, reader, can you spare some time? Our annual CMAJ readership survey begins September 20. By telling us a little about who you are and what you think of CMAJ, you’ll help us pave our way to an even better journal. For 2 weeks, we’ll be asking you to take the survey route on one of your visits to the journal online. We hope you’ll go along with the detour and help us stay on track. Chers lecteurs et lectrices, pourriez-vous nous accorder un moment? Le sondage annuel auprès des lecteurs du JAMC débute le 20 septembre. En nous parlant un peu de vous et de ce que vous pensez du JAMC, vous nous aiderez à améliorer encore le journal. Pendant deux semaines, lorsque vous rendrez visite au journal électronique, nous vous demanderons de passer une fois par la page du sondage. Nous espérons que vous accepterez de faire ce détour qui contribuera à nous garder sur la bonne voie. 358 JAMC • 17 AOÛT 2004; 171 (4) Review Synthèse Tips for learners of evidence-based medicine: 2. Measures of precision (confidence intervals) Victor M. Montori, Jennifer Kleinbart, Thomas B. Newman, Sheri Keitz, Peter C. Wyer, Virginia Moyer, Gordon Guyatt, for the Evidence-Based Medicine Teaching Tips Working Group DOI:10.1503/cmaj.1031667 I n the first article in this series,1 we presented an approach to understanding how to estimate a treatment’s effectiveness that covered relative risk reduction, absolute risk reduction and number needed to treat. But how precise are these estimates of treatment effect? In reading the results of clinical trials, clinicians often come across 2 related but different statistical measures of an estimate’s precision: p values and confidence intervals. The p value describes how often apparent differences in treatment effect that are as large as or larger than those observed in a particular trial will occur in a long run of identical trials if in fact no true effect exists. If the observed differences are sufficiently unlikely to occur by chance alone, investigators reject the hypothesis that there is no effect. For example, consider a randomized trial comparing diuretics with placebo that finds a 25% relative risk reduction for stroke with a p value of 0.04. This p value means that, if diuretics were in fact no different in effectiveness than placebo, we would expect, by the play of chance alone, to observe a reduction — or increase — in relative risk of 25% or more in 4 out of 100 identical trials. Although they are useful for investigators planning how large a study needs to be to demonstrate a particular magnitude of effect, p values fail to provide clinicians and patients with the information they most need, i.e., the range of values within which the true effect is likely to reside. However, confidence intervals provide exactly that information in a form that pertains directly to the process of deciding whether to administer a therapy to patients. If the range of possible true effects encompassed by the confidence interval is overly wide, the clinician may choose to administer the therapy only selectively or not at all. Confidence intervals are therefore the topic of this article. For a nontechnical explanation of p values and their limitations, we refer interested readers to the Users’ Guides to the Medical Literature.2 As with the first article in this series,1 we present the information as a series of “tips” or exercises. This means that you, the reader, will have to do some work in the course of reading the article. The tips we present here have been adapted from approaches developed by educators experienced in teaching evidence-based medicine skills to clinicians.2-4 A related article, intended for people who teach these concepts to clinicians, is available online at www. cmaj.ca/cgi/content/full/171/6/611/DC1. Clinician learners’ objectives Making confidence intervals intuitive • Understand the dynamic relation between confidence intervals and sample size. Interpreting confidence intervals • Understand how the confidence intervals around estimates of treatment effect can affect therapeutic decisions. Estimating confidence intervals for extreme proportions • Learn a shortcut for estimating the upper limit of the 95% confidence intervals for proportions with very small numerators and for proportions with numerators very close to the corresponding denominators. Tip 1: Making confidence intervals intuitive Imagine a hypothetical series of 5 trials (of equal duration but different sample sizes) in which investigators have experimented with treatments for patients who have a particular condition (elevated low-density lipoprotein cholesterol) to determine whether a drug (a novel cholesterollowering agent) would work better than a placebo to prevent strokes (Table 1A). The smallest trial enrolled only Teachers of evidence-based medicine: See the “Tips for teachers” version of this article online at www.cmaj.ca/cgi/content/full/171/6/611/DC1. It contains the exercises found in this article in fill-in-theblank format, commentaries from the authors on the challenges they encounter when teaching these concepts to clinician learners and links to useful online resources. CMAJ • SEPT. 14, 2004; 171 (6) © 2004 Canadian Medical Association or its licensors 611 Montori et al 8 patients, and the largest enrolled 2000 patients, and half of the patients in each trial underwent the experimental treatment. Now imagine that all of the trials showed a relative risk reduction for the treatment group of 50% (meaning that patients in the drug treatment group were only half as likely as those in the placebo group to have a stroke). In each individual trial, how confident can we be that the true value of the relative risk reduction is important for patients (i.e., “patient-important”)?5 If you were to look at the studies individually, which ones would lead you to recommend the treatment unequivocally to your patients? Most clinicians might intuitively guess that we could be more confident in the results of the larger trials. Why is this? In the absence of bias or systematic error, the results of a trial can be interpreted as an estimate of the true magnitude of effect that would occur if all possible eligible patients had been included. When only a few of these patients are included, the play of chance alone may lead to a result that is quite different from the true value. Confidence intervals are a numeric measure of the range within which such variation is likely to occur. The 95% confidence intervals that we often see in biomedical publications represent the range within which we are likely to find the underlying true treatment effect. To gain a better appreciation of confidence intervals, go back to Table 1A (don’t look yet at Table 1B!) and take a guess at what you think the confidence intervals might be for the 5 trials presented. In a moment you’ll see how your Table 1A: Relative risk and relative risk reduction observed in 5 successively larger hypothetical trials Control event rate Treatment event rate Relative risk, % Relative risk reduction, %* 2/4 10/20 20/40 50/100 500/1000 1/4 5/20 10/40 25/100 250/1000 50 50 50 50 50 50 50 50 50 50 *Calculated as the absolute difference between the control and treatment event rates (expressed as a fraction or a percentage), divided by the control event rate. In the first row in this table, relative risk reduction = (2/4 –1/4) ÷ 2/4 = 1/2 or 50%. If the control event rate were 3/4 and the treatment event rate 1/4, the relative risk reduction would be (3/4 – 1/4) ÷ 3/4 = 2/3. Using percentages for the same example, if the control event rate were 75% and the treatment event rate were 25%, the relative risk reduction would be (75% – 25%) ÷ 75% = 67%. estimates compare to 95% confidence intervals calculated using a formula, but for now, try figuring out intervals that you intuitively feel to be appropriate. Now, consider the first trial, in which 2 out of 4 patients who receive the control intervention and 1 out of 4 patients who receive the experimental treatment suffer a stroke. The risk in the treatment group is half that in the control group, which gives us a relative risk of 50% and a relative risk reduction of 50% (see Table 1A).1,6 Given the substantial relative risk reduction, would you be ready to recommend this treatment to a patient? Before you answer this question, consider whether it is plausible, with so few patients in the study, that the investigators might just have gotten lucky and the true treatment effect is really a 50% increase in relative risk. In other words, is it plausible that the true event rate in the group that received treatment was 3 out of 4 instead of 1 out of 4? If you accept that this large, harmful effect might represent the underlying truth, would you also accept that a relative risk reduction of 90%, i.e., a very large benefit of treatment, is consistent with the experimental data in these few patients? To the extent that these suggestions are plausible, we can intuitively create a range of plausible truth of “-50% to 90%” surrounding the relative risk reduction of 50% that was actually observed. Now, do this for each of the other 4 trials. In the trial with 20 patients in each group, 10 of those in the control group suffered a stroke, as did 5 of those in the treatment group. Both the relative risk and the relative risk reduction are again 50%. Do you still consider it plausible that the true event rate in the treatment group is 15 out of 20 rather than 5 out of 20 (the same proportions as we considered in the smaller trial)? If not, what about 12 out of 20? The latter would represent a 20% increase in risk over the control rate (12/20 v. 10/20). A true relative risk reduction of 90% may still be plausible, given the observed results and the numbers of patients involved. In short, given this larger number of patients and the lower chance of a “bad sample,” the “range of plausible truth” around the observed relative risk reduction of 50% might be narrower, perhaps from a relative risk increase of 20% (represented as –20%) to a relative risk reduction of 90%. You can develop similar intuitively derived confidence intervals for the larger trials. We’ve done this in Table 1B, which also shows the 95% confidence intervals that we cal- Table 1B: Confidence intervals (CIs) around the relative risk reduction in 5 successively larger hypothetical trials CI around relative risk reduction, % Control event rate Treatment event rate Relative risk, % Relative risk reduction, % Intuitive CI* Calculated 95% CI*† 2/4 10/20 20/40 50/100 500/1000 1/4 5/20 10/40 25/100 250/1000 50 50 50 50 50 50 50 50 50 50 –50 to 90 –20 to 90 0 to 90 20 to 80 40 to 60 –174 to 92 –14 to 79.5 9.5 to 73.4 26.8 to 66.4 43.5 to 55.9 *Negative values represent an increase in risk relative to control. See text for further explanation. †Calculated by statistical software. 612 JAMC • 14 SEPT. 2004; 171 (6) Tips for EBM learners: confidence intervals culated using a statistical program called StatsDirect (available commercially through www.statsdirect.com). You can see that in some instances we intuitively overestimated or underestimated the intervals relative to those we derived using the statistical formulas. The bottom line Confidence intervals inform clinicians about the range within which the true treatment effect might plausibly lie, given the trial data. Greater precision (narrower confidence intervals) results from larger sample sizes and consequent larger number of events. Statisticians (and statistical software) can calculate 95% confidence intervals around any estimate of treatment effect. would you recommend this treatment to your patients if the point estimate represented the truth? What if the upper boundary of the confidence interval represented the truth? Or the lower boundary? For all 3 of these questions, the answer is yes, provided that 1% is in fact the smallest patient-important difference. Thus, the trial is definitive and allows a strong inference about the treatment decision. In the case of trial 2 (see Fig. 1B), would your patients choose to undergo the treatment if either the point estimate or the upper boundary of the confidence interval represented the true effect? What about the lower boundary? The answer regarding the lower boundary is no, because the effect is less than the smallest difference that patients would consider large enough for them to undergo the treatment. Al- Tip 2: Interpreting confidence intervals You should now have an understanding of the relation between the width of the confidence interval around a measure of outcome in a clinical trial and the number of participants and events in that study. You are ready to consider whether a study is sufficiently large, and the resulting confidence intervals sufficiently narrow, to reach a definitive conclusion about recommending the therapy, after taking into account your patient’s values, preferences and circumstances. The concept of a minimally important treatment effect proves useful in considering the issue of when a study is large enough and has therefore generated confidence intervals that are narrow enough to recommend for or against the therapy. This concept requires the clinician to think about the smallest amount of benefit that would justify therapy. Consider a set of hypothetical trials. Fig. 1A displays the results of trial 1. The uppermost point of the bell curve is the observed treatment effect (the point estimate), and the tails of the bell curve represent the boundaries of the 95% confidence interval. For the medical condition being investigated, assume that a 1% absolute risk reduction is the smallest benefit that patients would consider to outweigh the downsides of therapy. Given the information in Fig. 1A, Treatment helps Treatment harms Trial 1 A -5 -3 -1 0 1 3 5 Trial 1 B Trial 2 -5 -3 -1 0 1 3 5 -1 0 1 3 5 Trial 3 C Trial 4 -5 -3 % Absolute risk reduction Fig. 1: Results of 4 hypothetical trials. For the medical condition under investigation, an absolute risk reduction of 1% (double vertical rule) is the smallest benefit that patients would consider important enough to warrant undergoing treatment. In each case, the uppermost point of the bell curve is the observed treatment effect (the point estimate), and the tails of the bell curve represent the boundaries of the 95% confidence interval. See text for further explanation. CMAJ • SEPT. 14, 2004; 171 (6) 613 Montori et al though trial 2 shows a “positive” result (i.e., the confidence interval does not encompass zero), the sample size was inadequate and the result remains compatible with risk reductions below the minimal patient-important difference. When a study result is positive, you can determine whether the sample size was adequate by checking the lower boundary of the confidence interval, the smallest plausible treatment effect compatible with the results. If this value is greater than the smallest difference your patients would consider important, the sample size is adequate and the trial result definitive. However, if the lower boundary falls below the smallest patient-important difference, leaving patients uncertain as to whether taking the treatment is in their best interest, the trial is not definitive. The sample size is inadequate, and further trials are required. What happens when the confidence interval for the effect of a therapy includes zero (where zero means “no effect” and hence a negative result)? For studies with negative results — those that do not exclude a true treatment effect of zero — you must focus on the other end of the confidence interval, that representing the largest plausible treatment effect consistent with the trial data. You must consider whether the upper boundary of the confidence interval falls below the smallest difference that patients might consider important. If so, the sample size is adequate, and the trial is definitively negative (see trial 3 in Fig. 1C). Conversely, if the upper boundary exceeds the smallest patient-important difference, then the trial is not definitively negative, and more trials with larger sample sizes are needed (see trial 4 in Fig. 1C). The bottom line To determine whether a trial with a positive result is sufficiently large, clinicians should focus on the lower boundary of the confidence interval and determine if it is greater than the smallest treatment benefit that patients would consider important enough to warrant taking the treatment. For studies with a negative result, clinicians should examine the upper boundary of the confidence interval to determine if this value is lower than the smallest treatment benefit that patients would consider important enough to warrant taking the treatment. In either case, if the confidence interval overlaps the smallest treatment benefit that is important to patients, then the study is not definitive and a larger study is needed. Table 2: The 3/n rule to estimate the upper limit of the 95% confidence interval (CI) for proportions with 0 in the numerator n 20 100 300 1000 614 Observed proportion 3/n Upper limit of 95% CI 0/20 0/100 0/300 0/1000 3/20 3/100 3/300 3/1000 0.15 or 15% 0.03 or 3% 0.01 or 1% 0.003 or 0.3% JAMC • 14 SEPT. 2004; 171 (6) Tip 3: Estimating confidence intervals for extreme proportions When reviewing journal articles, readers often encounter proportions with small numerators or with numerators very close in size to the denominators. Both situations raise the same issue. For example, an article might assert that a treatment is safe because no serious complications occurred in the 20 patients who received it; another might claim near-perfect sensitivity for a test that correctly identified 29 out of 30 cases of a disease. However, in many cases such articles do not present confidence intervals for these proportions. The first step of this tip is to learn the “rule of 3” for zero numerators,7 and the next step is to learn an extension (which might be called the “rule of 5, 7, 9 and 10”) for numerators of 1, 2, 3 and 4.8 Consider the following example. Twenty people undergo surgery, and none suffer serious complications. Does this result allow us to be confident that the true complication rate is very low, say less than 5% (1 out of 20)? What about 10% (2 out of 20)? You will probably appreciate that if the true complication rate were 5% (1 in 20), it wouldn’t be that unusual to observe no complications in a sample of 20, but for increasingly higher true rates, the chances of observing no complications in a sample of 20 gets increasingly smaller. What we are after is the upper limit of a 95% confidence interval for the proportion 0/20. The following is a simple rule for calculating this upper limit: if an event occurs 0 times in n subjects, the upper boundary of the 95% confidence interval for the event rate is about 3/n (Table 2). You can use the same formula when the observed proportion is 100%, by translating 100% into its complement. For example, imagine that the authors of a study on a diagnostic test report 100% sensitivity when the test is performed for 20 patients who have the disease. That means that the test identified all 20 with the disease as positive and identified none as falsely negative. You would like to know how low the sensitivity of the test could be, given that it was 100% for a sample of 20 patients. Using the 3/n rule Table 3: Method for obtaining an approximation of the upper limit of the 95% CI* Observed numerator 0 1 2 3 4 Numerator for calculating approximate upper limit of 95% CI 3 5 7 9 10 *For any observed numerator listed in the left hand column, divide the corresponding numerator in the right hand column by the number of study subjects to get the approximate upper limit of the 95% CI. For example, if the sample size is 15 and the observed numerator is 3, the upper limit of the 95% confidence interval is approximately 9 ÷ 15 = 0.6 or 60%. Tips for EBM learners: confidence intervals for the proportion of false negatives (0 out of 20), we find that the proportion of false negatives could be as high as 15% (3 out of 20). Subtract this result from 100% to obtain the lower limit of the 95% confidence interval for the sensitivity (in this example, 85%). What if the numerator is not zero but is still very small? There is a shortcut rule for small numerators other than zero (i.e., 1, 2, 3 or 4) (Table 3). For example, out of 20 people receiving surgery imagine that 1 person suffers a serious complication, yielding an observed proportion of 1/20 or 5%. Using the corresponding value from Table 3 (i.e., 5) and the sample size, we find that the upper limit of the 95% confidence interval will be about 5/20 or 25%. If 2 of the 20 (10%) had suffered complications, the upper limit would be about 7/20, or 35%. References The bottom line 7. 1. 2. 3. 4. 5. 6. 8. Although statisticians (and statistical software) can calculate 95% confidence intervals, clinicians can readily estimate the upper boundary of confidence intervals for proportions with very small numerators. These estimates highlight the greater precision attained with larger sample sizes and help to calibrate intuitively derived confidence intervals. Conclusions Clinicians need to understand and interpret confidence intervals to properly use research results in making decisions. They can use thresholds, based on differences that patients are likely to consider important, to interpret confidence intervals and to judge whether the results are definitive or whether a larger study (with more patients and events) is necessary. For proportions with extremely small numerators, a simple rule is available for estimating the upper limit of the confidence interval. This article has been peer reviewed. From the Department of Medicine, Mayo Clinic College of Medicine, Rochester, Minn. (Montori); the Hospital Medicine Unit, Division of General Medicine, Emory University, Atlanta, Ga. (Kleinbart); the Departments of Epidemiology and Biostatistics and of Pediatrics, University of California, San Francisco, San Francisco, Calif. (Newman); Durham Veterans Affairs Medical Center and Duke University Medical Center, Durham, NC (Keitz); the Columbia University College of Physicians and Surgeons, New York, NY (Wyer); the Department of Pediatrics, University of Texas, Houston, Tex. (Moyer); and the Departments of Medicine and of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, Ont. (Guyatt) Competing interests: None declared. Contributors: Victor Montori, as principal author, decided on the structure and flow of the article, and oversaw and contributed to the writing of the manuscript. Jennifer Kleinbart reviewed the manuscript at all phases of development and contributed to the writing of tip 1. Thomas Newman developed the original idea for tip 3 and reviewed the manuscript at all phases of development. Sheri Keitz used all of the tips as part of a live teaching exercise and submitted comments, suggestions and the possible variations that are described in the article. Peter Wyer reviewed and revised the final draft of the manuscript to achieve uniform adherence with format specifications. Virginia Moyer reviewed and revised the final draft of the manuscript to improve clarity and style. Gordon Guyatt developed the original ideas for tips 1 and 2, reviewed the manuscript at all phases of development, contributed to the writing as coauthor, and reviewed and revised the final draft of the manuscript to achieve accuracy and consistency of content as general editor. Barratt A, Wyer PC, Hatala R, McGinn T, Dans AL, Keitz S, et al. Tips for learners of evidence-based medicine: 1. Relative risk reduction, absolute risk reduction and number needed to treat. CMAJ 2004;171(4):353-8. Guyatt G, Jaeschke R, Cook D, Walter S. Therapy and understanding the results: hypothesis testing. In: Guyatt G, Rennie D, editors. Users’ guides to the medical literature: a manual of evidence-based clinical practice. Chicago: AMA Press; 2002. p. 329-38. Guyatt G, Walter S, Cook D, Jaeschke R. Therapy and understanding the results: confidence intervals. In: Guyatt G, Rennie D, editors. Users’ guides to the medical literature: a manual of evidence-based clinical practice. Chicago: AMA Press; 2002. p. 339-49. Wyer PC, Keitz S, Hatala R, Hayward R, Barratt A, Montori V, et al. Tips for learning and teaching evidence-based medicine: introduction to the series [editorial]. CMAJ 2004;171(4):347-8. Guyatt G, Montori V, Devereaux PJ, Schunemann H, Bhandari M. Patients at the center: in our practice, and in our use of language. ACP J Club 2004;140:A11-2. Jaeschke R, Guyatt G, Barratt A, Walter S, Cook D, McAlister F, et al. Measures of association. In: Guyatt G, Rennie D, editors. Users’ guides to the medical literature: a manual of evidence-based clinical practice. Chicago: AMA Press; 2002. p. 351-68. Hanley J, Lippman-Hand A. If nothing goes wrong, is everything all right? Interpreting zero numerators. JAMA 1983;249:1743-5. Newman TB. If almost nothing goes wrong, is almost everything all right? [letter]. JAMA 1995;274:1013. Correspondence to: Dr. Peter C. Wyer, 446 Pelhamdale Ave., Pelham NY 10803, USA; fax 212 305-6792; pwyer@worldnet .att.net Members of the Evidence-Based Medicine Teaching Tips Working Group: Peter C. Wyer (project director), College of Physicians and Surgeons, Columbia University, New York, NY; Deborah Cook, Gordon Guyatt (general editor), Ted Haines, Roman Jaeschke, McMaster University, Hamilton, Ont.; Rose Hatala (internal review coordinator), University of British Columbia, Vancouver, BC; Robert Hayward (editor, online version), Bruce Fisher, University of Alberta, Edmonton, Alta.; Sheri Keitz (field test coordinator), Durham Veterans Affairs Medical Center and Duke University Medical Center, Durham, NC; Alexandra Barratt, University of Sydney, Sydney, Australia; Pamela Charney, Albert Einstein College of Medicine, Bronx, NY; Antonio L. Dans, University of the Philippines College of Medicine, Manila, The Philippines; Barnet Eskin, Morristown Memorial Hospital, Morristown, NJ; Jennifer Kleinbart, Emory University School of Medicine, Atlanta, Ga.; Hui Lee, formerly Group Health Centre, Sault Ste. Marie, Ont. (deceased); Rosanne Leipzig, Thomas McGinn, Mount Sinai Medical Center, New York, NY; Victor M. Montori, Mayo Clinic College of Medicine, Rochester, Minn.; Virginia Moyer, University of Texas, Houston, Tex.; Thomas B. Newman, University of California, San Francisco, San Francisco, Calif.; Jim Nishikawa, University of Ottawa, Ottawa, Ont.; Kameshwar Prasad, Arabian Gulf University, Manama, Bahrain; W. Scott Richardson, Wright State University, Dayton, Ohio; Mark C. Wilson, University of Iowa, Iowa City, Iowa Articles to date in this series Barratt A, Wyer PC, Hatala R, McGinn T, Dans AL, Keitz S, et al. Tips for learners of evidence-based medicine: 1. Relative risk reduction, absolute risk reduction and number needed to treat. CMAJ 2004;171(4):353-8. CMAJ • SEPT. 14, 2004; 171 (6) 615 Correspondance ical journals [editorial]. CMAJ 1984;130:1412. 11. Bero LA, Galbraith A, Rennie D. The publication of sponsored symposiums in medical journals. N Engl J Med 1992;327:1135-40. Competing interests: None declared. DOI:10.1503/cmaj.1041329 thetical trial 2 in Fig. 1B should have been centred at 5% absolute risk reduction, as described in the text; instead, the figure showed trial 2 as being centred at about 6.5% absolute risk reduction. The corrected figure is presented here. Reference 1. Montori VM, Kleinbart J, Newman TB, Keitz S, Wyer PC, Moyer V, et al. Tips for learners of evidence-based medicine: 2. Measures of precision (confidence intervals). CMAJ 2004;171(6): 611-5. DOI:10.1503/cmaj.1041761 Online access to a for-profit CMAJ W ayne Kondro, quoting CMA Secretary-General Bill Tholl, reports that “Physicians will continue to receive their free subscription to CMAJ as a benefit of association membership ‘for the foreseeable future’” after CMA Publications is sold to CMA Holdings in January 2004.1 That’s all to the good — but what then of CMAJ’s worldwide readers? Will access to CMAJ remain free for all online users, despite the shift to for-profit status? I found it strange that this issue was not addressed in Kondro’s news article. Treatment helps Treatment harms Trial 1 A -5 -3 -1 0 1 3 5 Trial 1 B Trial 2 Adam L. Scheffler Independent researcher Chicago, Ill. -5 Reference 1. -3 -1 0 1 3 5 -1 0 1 3 5 Kondro W. CMAJ enters for-profit market. CMAJ 2004;171(11):1334. DOI:10.1503/cmaj.1041759 Trial 3 C [Editor’s note] C MAJ’s editors have addressed the topic of open access in this issue’s Editorial (see page 149). DOI:10.1503/cmaj.1041760 Trial 4 -5 -3 % Absolute risk reduction Correction I n part 2 of the series “Tips for learners of evidence-based medicine”1 the information in Fig. 1 did not fully correspond with the information provided in the text. Specifically, the data for hypo- 162 Fig. 1: Results of 4 hypothetical trials. For the medical condition under investigation, an absolute risk reduction of 1% (double vertical rule) is the smallest benefit that patients would consider important enough to warrant undergoing treatment. In each case, the uppermost point of the bell curve is the observed treatment effect (the point estimate), and the tails of the bell curve represent the boundaries of the 95% confidence interval. See the text1 for further explanation. JAMC • 18 JANV. 2005; 172 (2) Review Synthèse Tips for learners of evidence-based medicine: 3. Measures of observer variability (kappa statistic) Thomas McGinn, Peter C. Wyer, Thomas B. Newman, Sheri Keitz, Rosanne Leipzig, Gordon Guyatt, for the Evidence-Based Medicine Teaching Tips Working Group I DOI:10.1503/cmaj.1031981 magine that you’re a busy family physician and that you’ve found a rare free moment to scan the recent literature. Reviewing your preferred digest of abstracts, you notice a study comparing emergency physicians’ interpretation of chest radiographs with radiologists’ interpretations.1 The article catches your eye because you have frequently found that your own reading of a radiograph differs from both the official radiologist reading and an unofficial reading by a different radiologist, and you’ve wondered about the extent of this disagreement and its implications. Looking at the abstract, you find that the authors have reported the extent of agreement using the κ statistic. You recall that κ stands for “kappa” and that you have encountered this measure of agreement before, but your grasp of its meaning remains tentative. You therefore choose to take a quick glance at the authors’ conclusions as reported in the abstract and to defer downloading and reviewing the full text of the article. Practitioners, such as the family physician just described, may benefit from understanding measures of observer variability. For many studies in the medical literature, clinician readers will be interested in the extent of agreement among multiple observers. For example, do the investigators in a clinical study agree on the presence or absence of physical, radiographic or laboratory findings? Do investigators involved in a systematic overview agree on the validity of an article, or on whether the article should be included in the analysis? In perusing these types of studies, where investigators are interested in quantifying agreement, clinicians will often come across the kappa statistic. In this article we present tips aimed at helping clinical learners to use the concepts of kappa when applying diagnostic tests in practice. The tips presented here have been adapted from approaches developed by educators experienced in teaching evidence-based medicine skills to clinicians.2 A related article, intended for people who teach these concepts to clinicians, is available online at www. cmaj.ca/cgi/content/full/171/11/1369/DC1. Clinician learners’ objectives Defining the importance of kappa • Understand the difference between measuring agreement and measuring agreement beyond chance. • Understand the implications of different values of kappa. Calculating kappa • Understand the basics of how the kappa score is calculated. • Understand the importance of “chance agreement” in estimating kappa. Calculating chance agreement • Understand how to calculate the kappa score given different distributions of positive and negative results. • Understand that the more extreme the distributions of positive and negative results, the greater the agreement that will occur by chance alone. • Understand how to calculate chance agreement, agreement beyond chance and kappa for any set of assessments by 2 observers. Tip 1: Defining the importance of kappa A common stumbling block for clinicians is the basic concept of agreement beyond chance and, in turn, the importance of correcting for chance agreement. People making a decision on the basis of presence or absence of an element of the physical examination, such as Murphy’s sign, will sometimes agree simply by chance. The kappa statistic corrects for this chance agreement and tells us how much of the possible agreement over and above chance the reviewers have achieved. A simple example should help to clarify the importance of correcting for chance agreement. Two radiologists independently read the same 100 mammograms. Reader 1 is having a bad day and reads all the films as negative without looking at them in great detail. Reader 2 reads the Teachers of evidence-based medicine: See the “Tips for teachers” version of this article online at www.cmaj.ca/cgi/content/full/171/11/1369/DC1. It contains the exercises found in this article in fill-in-theblank format, commentaries from the authors on the challenges they encounter when teaching these concepts to clinician learners and links to useful online resources. CMAJ • NOV. 23, 2004; 171 (11) © 2004 Canadian Medical Association or its licensors 1369 McGinn et al films more carefully and identifies 4 of the 100 mammograms as positive (suspicious for malignancy). How would you characterize the level of agreement between these 2 radiologists? The percent agreement between them is 96%, even though one of the readers has, on cursory review, decided to call all of the results negative. Hence, measuring the simple percent agreement overestimates the degree of clinically important agreement in a fashion that is misleading. The role of kappa is to indicate how much the 2 observers agree beyond the level of agreement that could be expected by chance. Table 1 presents a rating system that is commonly used as a guideline for evaluating kappa scores. Purely to illustrate the range of kappa scores that readers can expect to encounter, Table 2 gives some examples of commonly reported assessments and the kappa scores that resulted when investigators studied their reproducibility. The bottom line If clinicians neglect the possibility of chance agreement, they will come to misleading conclusions about the reproducibility of clinical tests. The kappa statistic allows us to measure agreement above and beyond that expected by chance alone. Examples of kappa scores for frequently ordered tests sometimes show surprisingly poor levels of agreement beyond chance. Table 1: Qualitative classification of kappa values as degree of 3 agreement beyond chance Kappa value Degree of agreement beyond chance 0 0–0.2 0.2–0.4 0.4–0.6 0.6–0.8 0.8–1.0 None Slight Fair Moderate Substantial Almost perfect Kappa value Interpretation of T wave changes on an exercise stress test4 Presence of jugular venous distension5 Detection of alcohol dependence using CAGE questionnaire6 Presence of goitre7 Bone marrow interpretation by hematologist8 Straight leg raising test9 Diagnosis of pulmonary embolus by helical CT10 Diagnosis of lower extremity arterial disease by arteriography11 1370 What is the maximum potential for agreement between 2 observers doing a clinical assessment, such as presence or absence of Murphy’s sign in patients with abdominal pain? In Fig. 1, the upper horizontal bar represents 100% agreement between 2 observers. For the hypothetical situation represented in the figure, the estimated chance agreement between the 2 observers is 50%. This would occur if, for example, each of the 2 observers randomly called half of the assessments positive. Given this information, what is the possible agreement beyond chance? The vertical line in Fig. 1 intersects the horizontal bars at the 50% point that we identified as the expected agreement by chance. All agreement to the right of this line corresponds to agreement beyond chance. Hence the maximum agreement beyond chance is 50% (100% – 50%). The other number you need to calculate the kappa score is the degree of agreement beyond chance. The observed agreement, as shown by the lower horizontal bar in Fig. 1, is 75%, so the degree of agreement beyond chance is 25% (75% – 50%). Kappa is calculated as the observed agreement beyond chance (25%) divided by the maximum agreement beyond chance (50%); here, kappa is 0.50. Agreement expected by chance Table 2: Representative kappa values for common tests and clinical assessments Assessment Tip 2: Calculating kappa 0.25 0.56 0.75 0.82–0.95 0.84 0.82 0.82 0.39–0.64 JAMC • 23 NOV. 2004; 171 (11) 50% Observed agreement: Observed agreement above chance: Possible agreement above chance 75% 25% kappa = 25/50 = 0. 5 (moderate agreement) Fig. 1: Two observers independently assess the presence or absence of a finding or outcome. Each observer determines that the finding is present in exactly 50% of the subjects. Their assessments agree in 75% of the cases. The yellow horizontal bar represents potential agreement (100%), and the turquoise bar represents actual agreement. The portion of each coloured bar that lies to the left of the dotted vertical line represents the agreement expected by chance (50%). The observed agreement above chance is half of the possible agreement above chance. The ratio of these 2 numbers is the kappa score. Tips for EBM learners: kappa statistic The bottom line Kappa allows us to measure agreement above and beyond that expected by chance alone. We calculate kappa by estimating the chance agreement and then comparing the observed agreement beyond chance with the maximum possible agreement beyond chance. Tip 3: Calculating chance agreement A conceptual understanding of kappa may still leave the actual calculations a mystery. The following example is intended for those who desire a more complete understanding of the kappa statistic. Let us assume that 2 hopeless clinicians are assessing the presence of Murphy’s sign in a group of patients. They have no idea what they are doing, and their evaluations are no better than blind guesses. Let us say they are each guessing the presence and absence of Murphy’s sign in a 50:50 ratio: half the time they guess that Murphy’s sign is present, and the other half that it is absent. If you were completing a 2 × 2 table, with these 2 clinicians evaluating the same 100 patients, how would the cells, on average, get filled in? Fig. 2 represents the completed 2 × 2 table. Guessing at random, the 2 hopeless clinicians have agreed on the assessments of 50% of the patients. How did we arrive at the numbers shown in the table? According to the laws of chance, each clinician guesses that half of the 50 patients assessed as positive by the other clinician (i.e., 25 patients) have Murphy’s sign. How would this exercise work if the same 2 hopeless clinicians were to randomly guess that 60% of the patients had a positive result for Murphy’s sign? Fig. 3 provides the answer in this situation. The clinicians would agree for 52 of the 100 patients (or 52% of the time) and would disagree for 48 of the patients. In a similar way, using 2 × 2 tables for higher and higher positive proportions (i.e., how often Clinician 1 Clinician 2 Sign present Sign absent Total the observer makes the diagnosis), you can figure out how often the observers will, on average, agree by chance alone (as delineated in Table 3). At this point, we have demonstrated 2 things. First, even if the reviewers have no idea what they are doing, there will be substantial agreement by chance alone. Second, the magnitude of the agreement by chance increases as the proportion of positive (or negative) assessments increases. But how can we calculate kappa when the clinicians whose assessments are being compared are no longer “hopeless,” in other words, when their assessments reflect a level of expertise that one might actually encounter in practice? It’s not very hard. Let’s take a simple example, returning to the premise that each of the 2 clinicians assesses Murphy’s sign as being present in 50% of the patients. Here, we assume that the 2 clinicians now have some knowledge of Murphy’s sign and their assessments are no longer random. Each decides that 50% of the patients have Murphy’s sign and 50% do not, but they still don’t agree on every patient. Rather, for 40 patients they agree that Murphy’s sign is present, and for 40 patients they agree that Murphy’s sign is absent. Thus, they agree on the diagnosis for 80% of the patients, and they disagree for 20% of the patients (see Fig. 4A). How do we calculate the kappa score in this situation? Recall that if each clinician found that 50% of the patients had Murphy’s sign but their decision about the presence of the sign in each patient was random, the clinicians would be in agreement 50% of the time, each cell of the 2 × 2 table would have 25 patients (as shown in Fig. 2), chance agreeClinician 1 Clinician 2 Sign present Sign absent Total Sign present Sign absent Total 25 25 50 25 25 50 50 50 Fig. 2: Agreement table for 2 hopeless clinicians who randomly guess whether Murphy’s sign is present or absent in 100 patients with abdominal pain. Each clinician determines that half of the patients have a positive result. The numbers in each box reflect the number of patients in each agreement category. Sign present Sign absent Total 36 24 60 24 16 40 60 40 Fig. 3: As in Fig. 2, the 2 clinicians again guess at random whether Murphy’s sign is present or absent. However, each clinician now guesses that the sign is present in 60 of the 100 patients. Under these circumstances, of the 60 patients for whom clinician 1 guesses that the sign is present, clinician 2 guesses that it is present in 60%; 60% of 60 is 36 patients. Of the 60 patients for whom clinician 1 guesses that the sign is present, clinician 2 guesses that it is absent in 40%; 40% of 60 is 24 patients. Of the 40 patients for whom clinician 1 guesses that the sign is absent, clinician 2 guesses that it is present in 60%; 60% of 40 is 24 patients. Of the 40 patients for whom clinician 1 guesses that the sign is absent, clinician 2 guesses that it is absent in 40%; 40% of 40 is 16 patients. CMAJ • NOV. 23, 2004; 171 (11) 1371 McGinn et al ment would be 50%, and maximum agreement beyond chance would also be 50%. The no-longer-hopeless clinicians’ agreement on 80% of the patients is therefore 30% above chance. Kappa is a comparison of the observed agreement above chance with the maximum agreement above chance: 30%/50% = 60% of the possible agreement above chance, which gives these clinicians a kappa of 0.6, as shown in Fig. 4B. Table 3: Chance agreement when 2 observers randomly assign positive and negative results, for successively higher rates of a positive call Proportion positive (%) 50 52 58 68 82 A Clinician 2 Sign present Sign absent 40 10 10 40 Chance agreement is not always 50%; rather, it varies from one clinical situation to another. When the prevalence of a disease or outcome is low, 2 observers will guess that most patients are normal and the symptom of the disease is absent. This situation will lead to a high percentage of agreement simply by chance. When the prevalence is high, there will also be high apparent agreement, with most patients judged to exhibit the symptom. Kappa measures the agreement after correcting for this variable degree of chance agreement. Conclusions B Clinician 2 Clinician 1 Sign present Sign absent Sign present Sign absent 40 (25) 10 (25) 10 (25) 40 (25) Total 50 50 Total 50 50 κ = (observed agreement – agreement expected by chance) ÷ (100 – agreement expected by chance) = (80% – 50%) ÷ (100% – 50%) = 30% ÷ 50% = 0.6 Fig. 4: Two clinicians who have been trained to assess Murphy’s sign in patients with abdominal pain do an actual assessment on 100 patients. A: A 2 × 2 table reflecting actual agreement between the 2 clinicians. B: A 2 × 2 table illustrating the correct approach to determining the kappa score. The numbers in parentheses correspond to the results that would be expected were each clinician randomly guessing that half of the patients had a positive result (as in Fig. 2). 1372 Another way of expressing this formula: (Observed agreement beyond chance) ÷ (maximum possible agreement beyond chance) The bottom line Clinician 1 Sign present Sign absent (Observed agreement – agreement expected by chance) ÷ (100% – agreement expected by chance) Hence, to calculate kappa when only 2 alternatives are possible (e.g., presence or absence of a finding), you need just 2 numbers: the percentage of patients that the 2 assessors agreed on and the expected agreement by chance. Both can be determined by constructing a 2 × 2 table exactly as illustrated above. Agreement by chance (%) 50 60 70 80 90 Formula for calculating kappa JAMC • 23 NOV. 2004; 171 (11) Armed with this understanding of kappa as a measure of agreement between different observers, you are able to return to the study of agreement in chest radiography interpretations between emergency physicians and radiologists1 in a more informed fashion. You learn from the abstract that the kappa score for overall agreement between the 2 classes of practitioners was 0.40, with a 95% confidence interval ranging from 0.35 to 0.46. This means that the agreement between emergency physicians and radiologists represented 40% of the potentially achievable agreement beyond chance. You understand that this kappa score would be conventionally considered to represent fair to moderate agreement but is inferior to many of the kappa values listed in Table 2. You are now much more confident about going to the full text of the article to review the methods and assess the clinical applicability of the results to your own patients. The ability to understand measures of variability in data presented in clinical trials and systematic reviews is an important skill for clinicians. We have presented a series of tips developed and used by experienced teachers of evidence-based medicine for the purpose of facilitating such understanding. Tips for EBM learners: kappa statistic This article has been peer reviewed. From the Department of Medicine, Division of General Internal Medicine (McGinn), and the Department of Geriatrics (Leipzig), Mount Sinai Medical Center, New York, NY; the Columbia University College of Physicians and Surgeons, New York, NY (Wyer); the Departments of Epidemiology and Biostatistics and of Pediatrics, University of California, San Francisco, San Francisco, Calif. (Newman); Durham Veterans Affairs Medical Center and Duke University Medical Center, Durham, NC (Keitz); and the Departments of Medicine and of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, Ont. (Guyatt) Competing interests: None declared. Contributors: Thomas McGinn developed the original idea for tips 1 and 2 and, as principal author, oversaw and contributed to the writing of the manuscript. Thomas Newman and Roseanne Leipzig reviewed the manuscript at all phases of development and contributed to the writing as coauthors. Sheri Keitz used all of the tips as part of a live teaching exercise and submitted comments, suggestions and the possible variations that are described in the article. Peter Wyer reviewed and revised the final draft of the manuscript to achieve uniform adherence with format specifications. Gordon Guyatt developed the original idea for tip 3, reviewed the manuscript at all phases of development, contributed to the writing as a coauthor, and, as general editor, reviewed and revised the final draft of the manuscript to achieve accuracy and consistency of content. References 1. Gatt ME, Spectre G, Paltiel O, Hiller N, Stalnikowicz R. Chest radiographs in the emergency department: Is the radiologist really necessary? Postgrad Med J 2003;79:214-7. 2. Wyer PC, Keitz S, Hatala R, Hayward R, Barratt A, Montori V, et al. Tips for learning and teaching evidence-based medicine: introduction to the series [editorial]. CMAJ 2004;171(4):347-8. 3. Maclure M, Willett WC. Misinterpretation and misuse of the kappa statistic. Am J Epidemiol 1987;126:161-9. 4. Blackburn H. The exercise electrocardiogram: differences in interpretation. Report of a technical group on exercise electrocardiography. Am J Cardiol 1968;21:871-80. 5. Cook DJ. Clinical assessment of central venous pressure in the critically ill. Am J Med Sci 1990;299:175-8. 6. Aertgeerts B, Buntinx F, Fevery J, Ansoms S. Is there a difference between CAGE interviews and written CAGE questionnaires? Alcohol Clin Exp Res 2000;24:733-6. 7. Kilpatrick R, Milne JS, Rushbrooke M, Wilson ESB. A survey of thyroid enlargement in two general practices in Great Britain. BMJ 1963;1:29-34. 8. Guyatt GH, Patterson C, Ali M, Singer J, Levine M, Turpie I, et al. Diagnosis of iron-deficiency anemia in the elderly. Am J Med 1990;88:205-9. 9. McCombe PF, Fairbank JC, Cockersole BC, Pynsent PB. 1989 Volvo Award in clinical sciences. Reproducibility of physical signs in low-back pain. Spine 1989;14:908-18. 10. Perrier A, Howarth N, Didier D, Loubeyre P, Unger PF, de Moerloose P, et al. Performance of helical computed tomography in unselected outpatients with suspected pulmonary embolism. Ann Intern Med 2001;135:88-97. 11. Koelemay MJ, Legemate DA, Reekers JA, Koedam NA, Balm R, Jacobs MJ. Interobserver variation in interpretation of arteriography and management of severe lower leg arterial disease. Eur J Vasc Endovasc Surg 2001;21:417-22. Correspondence to: Dr. Peter C. Wyer, 446 Pelhamdale Ave., Pelham NY 10803, USA; fax 914 738-9368; pwyer@att.net Members of the Evidence-Based Medicine Teaching Tips Working Group: Peter C. Wyer (project director), College of Physicians and Surgeons, Columbia University, New York, NY; Deborah Cook, Gordon Guyatt (general editor), Ted Haines, Roman Jaeschke, McMaster University, Hamilton, Ont.; Rose Hatala (internal review coordinator), University of British Columbia, Vancouver, BC; Robert Hayward (editor, online version), Bruce Fisher, University of Alberta, Edmonton, Alta.; Sheri Keitz (field test coordinator), Durham Veterans Affairs Medical Center and Duke University Medical Center, Durham, NC; Alexandra Barratt, University of Sydney, Sydney, Australia; Pamela Charney, Albert Einstein College of Medicine, Bronx, NY; Antonio L. Dans, University of the Philippines College of Medicine, Manila, The Philippines; Barnet Eskin, Morristown Memorial Hospital, Morristown, NJ; Jennifer Kleinbart, Emory University School of Medicine, Atlanta, Ga.; Hui Lee, formerly Group Health Centre, Sault Ste. Marie, Ont. (deceased); Rosanne Leipzig, Thomas McGinn, Mount Sinai Medical Center, New York, NY; Victor M. Montori, Mayo Clinic College of Medicine, Rochester, Minn.; Virginia Moyer, University of Texas, Houston, Tex.; Thomas B. Newman, University of California, San Francisco, San Francisco, Calif.; Jim Nishikawa, University of Ottawa, Ottawa, Ont.; Kameshwar Prasad, Arabian Gulf University, Manama, Bahrain; W. Scott Richardson, Wright State University, Dayton, Ohio; Mark C. Wilson, University of Iowa, Iowa City, Iowa Articles to date in this series Barratt A, Wyer PC, Hatala R, McGinn T, Dans AL, Keitz S, et al. Tips for learners of evidence-based medicine: 1. Relative risk reduction, absolute risk reduction and number needed to treat. CMAJ 2004;171(4):353-8. Montori VM, Kleinbart J, Newman TB, Keitz S, Wyer PC, Moyer V, et al. Tips for learners of evidence-based medicine: 2. Measures of precision (confidence intervals). CMAJ 2004;171(6):611-5. CMAJ • NOV. 23, 2004; 171 (11) 1373 Review Synthèse Tips for learners of evidence-based medicine: 4. Assessing heterogeneity of primary studies in systematic reviews and whether to combine their results Rose Hatala, Sheri Keitz, Peter Wyer, Gordon Guyatt, for the Evidence-Based Medicine Teaching Tips Working Group DOI:10.1503/cmaj.1031920 C linicians wishing to quickly answer a clinical question may seek a systematic review, rather than searching for primary articles. Such a review is also called a meta-analysis when the investigators have used statistical techniques to combine results across studies. Databases useful for this purpose include the Cochrane Library (www. thecochranelibrary.com) and the ACP Journal Club (www. acpjc.org; use the search term “review”), both of which are available through personal or institutional subscription. Clinicians can use systematic reviews to guide clinical practice if they are able to understand and interpret the results. Systematic reviews differ from traditional reviews in that they are usually confined to a single focused question, which serves as the basis for systematic searching, selection and critical evaluation of the relevant research.1 Authors of systematic reviews use explicit methods to minimize bias and consider using statistical techniques to combine the results of individual studies. When appropriate, such pooling allows a more precise estimate of the magnitude of benefit or harm of a therapy. It may also increase the applicability of the result to a broader range of patient populations. Clinicians encountering a meta-analysis frequently find the pooling process mysterious. Specifically, they wonder how authors decide whether the ranges of patients, interventions and outcomes are too broad to sensibly pool the results of the primary studies. In this article we present an approach to evaluating potentially important differences in the results of individual studies being considered for a meta-analysis. These differences are frequently referred to as heterogeneity.1 Our discussion focuses on the qualitative, rather than the statistical, assessment of heterogeneity (see Box 1). Two concepts are commonly implied in the assessment of heterogeneity. The first is an assessment for heterogeneity within 4 key elements of the design of the original studies: the patients, interventions, outcomes and methods. This assessment bears on the question of whether pooling the results is at all sensible. The second concept relates to assessing heterogeneity among the results of the original studies. Even if the study designs are similar, the researchers must decide whether it is useful to combine the primary studies’ results. Our discussion assumes a basic familiarity with how investigators present the magnitude2,3 and precision4 of treatment effects in individual randomized trials. The tips in this article are adapted from approaches developed by educators with experience in teaching evidencebased medicine skills to clinicians.1,5,6 A related article, intended for people who teach these concepts to clinicians, is available online at www.cmaj.ca/cgi/content/full/172/5/ 661/DC1. Clinician learners’ objectives Qualitative assessment of the design of primary studies • Understand the concepts of heterogeneity of study design among the individual studies included in a systematic review. Qualitative assessment of the results of primary studies • Understand how to qualitatively determine the appropriateness of pooling estimates of effect from the individual studies by assessing (1) the degree of overlap of the confidence intervals around these point estimates of effect and (2) the disparity between the point estimates themselves. • Understand how to estimate the “true” value of the estimate of effect from a graphic display of the results of individual studies. Teachers of evidence-based medicine: See the “Tips for teachers” version of this article online at www.cmaj.ca/cgi/content/full/172/5/661/DC1. It contains the exercises found in this article in fill-in-theblank format, commentaries from the authors on the challenges they encounter when teaching these concepts to clinician learners and links to useful online resources. CMAJ • MAR. 1, 2005; 172 (5) © 2005 CMA Media Inc. or its licensors 661 Hatala et al Box 1: Statistical assessments of heterogeneity Meta-analysts typically use 2 statistical approaches to evaluate the extent of variability in results between studies: Cochran’s Q test and the I 2 statistic. Cochran’s Q test • Cochran’s Q test is the traditional test for heterogeneity. It begins with the null hypothesis that all of the apparent variability is due to chance. That is, the true underlying magnitude of effect (whether measured with a relative risk, an odds ratio or a risk difference) is the same across studies. • The test then generates a probability, based on a χ2 distribution, that differences in results between studies as extreme as or more extreme than those observed could occur simply by chance. • If the p value is low (say, less than 0.1) investigators should look hard for possible explanations of variability in results between studies (including differences in patients, interventions, measurement of outcomes and study design). • As the p value gets very low (less than 0.01) we may be increasingly uncomfortable about using single best estimates of treatment effects. • The traditional test for heterogeneity is limited, in that it may be underpowered (when studies have included few patients it may be difficult to reject the null hypothesis even if it is false) or overpowered (when sample sizes are very large, small and unimportant differences in magnitude of effect may nevertheless generate low p values). I 2 statistic • The I 2 statistic, the second approach to measuring heterogeneity, attempts to deal with potential underpowering or overpowering. I 2 provides an estimate of the percentage of variability in results across studies that is likely due to true differences in treatment effect, as opposed to chance. • When I 2 is 0%, chance provides a satisfactory explanation for the variability we have observed, and we are more likely to be comfortable with a single pooled estimate of treatment effect. • As I 2 increases, we get increasingly uncomfortable with a single pooled estimate, and the need to look for explanations of variability other than chance becomes more compelling. • For example, one rule of thumb characterizes I 2 of less than 0.25 as low heterogeneity, 0.25 to 0.5 as moderate heterogeneity and over 0.5 as high heterogeneity. Tip 1: Qualitative assessment of the design of primary studies Consider the following 3 hypothetical systematic reviews. For which of these systematic reviews does it make sense to combine the primary studies? • A systematic review of all therapies for all types of cancer, intended to generate a single estimate of the impact of these therapies on mortality. • A systematic review that examines the effect of different antibiotics, such as tetracyclines, penicillins and chloramphenicol, on improvement in peak expiratory flow rates and days of illness in patients with acute exacerbation of obstructive lung disease, including chronic bronchitis and emphysema.7 • A systematic review of the effectiveness of tissue plasminogen activator (tPA) compared with no treatment or placebo in reducing mortality among patients with acute myocardial infarction.8 Most clinicians would instinctively reject the first of these proposed reviews as overly broad but would be comfortable with the idea of combining the results of trials relevant to the third question. What about the second review? What aspects of the primary studies must be similar to justify combining their results in this systematic review? Table 1 lists features that would be relevant to the question considered in the second review and categorizes them according to the 4 key elements of study design: the patients, interventions, outcomes and methods of the primary studies. Combining results is appropriate when the biology is such that across the range of patients, interventions, outcomes and study methods, one can anticipate more or less the same magnitude of treatment effect. In other words, the judgement as to whether the primary studies are similar enough to be combined in a systematic review is based on whether the underlying pathophysiology would predict a similar treatment effect across the range of patients, interventions, outcomes and study methods of the primary studies. If you think back to the first systematic review — all therapies for all cancers — you probably recognize that there is significant variability in the Table 1: Relevant features of study design to be considered when deciding whether to pool studies in a systematic review (for a review examining the effect of antibiotics in patients with obstructive lung disease) Patients Patient age Patient sex Type of lung disease (e.g., emphysema, chronic bronchitis) 662 Interventions Outcomes Study methods Same antibiotic in all studies Same class of antibiotic in all studies Comparison of antibiotic with placebo Comparison of one antibiotic with another Death Peak expiratory flow Forced expiratory volume in the first second All randomized trials Only blinded randomized trials Cohort studies JAMC • 1er MARS 2005; 172 (5) Tips for EBM learners: heterogeneity pathophysiology of different cancers (“patients” in Table 1) and in the mechanisms of action of different cancer therapies (“interventions” in Table 1). If you were inclined to reject pooling the results of the studies to be considered in the second systematic review, you might have reasoned that we would expect substantially different effects with different antibiotics, different infecting agents or different underlying lung pathology. If you were inclined to accept pooling of results in this review, you might argue that the antibiotics used in the different studies are all effective against the most common organisms underlying pulmonary exacerbations. You might also assert that the biology of an acute exacerbation of an obstructive lung disease (e.g., inflammation) is similar, despite variability in the underlying pathology. In other words, we would expect more or less the same effect across agents and across patients. Finally, you probably accepted the validity of pooling results for the third systematic review — tPA for myocardial infarction — because you consider that the mechanism of myocardial infarction is relatively constant across a broad range of patients. left of the “no difference” line indicate that the treatment is superior to the control, whereas those to the right of the line indicate that the control is superior to the treatment. For each of the 4 studies represented in the figures, the dot represents the point estimate of the treatment effect (the value observed in the study), and the horizontal line represents the confidence interval around that observed effect. For which systematic review does it make sense to combine results? Decide on the answer to this question before you read on. You have probably concluded that pooling is appropriate A The bottom line • Similarity in the aspects of primary study design outlined in Table 1 (patients, interventions, outcomes, study methods) guides the decision as to whether it makes sense to combine the results of primary studies in a systematic review. • The range of characteristics of the primary studies across which it is sensible to combine results is a matter of judgment based on the researcher’s understanding of the underlying biology of the disease. Favours new treatment No difference Favours control Favours new treatment No difference Favours control B Tip 2: Qualitative assessment of the results of primary studies You should now understand that combining the results of different studies is sensible only when we expect more or less the same magnitude of treatment effects across the range of patients, interventions and outcomes that the investigators have included in their systematic review. However, even when we are confident of the similarity in design among the individual studies, we may still wonder whether the results of the studies should be pooled. The following graphic demonstration shows how to qualitatively assess the results of the primary studies to decide if meta-analysis (i.e., statistical pooling) is appropriate. You can find discussions of quantitative, or statistical, approaches to the assessment of heterogeneity elsewhere (see Box 1 or Higgins and associates9). Consider the results of the studies in 2 hypothetical systematic reviews (Fig. 1A and Fig. 1B). The central vertical line, labelled “no difference,” represents a treatment effect of 0. This would be equivalent to a risk ratio or relative risk of 1 or an absolute or relative risk reduction of 0.2 Values to the Fig. 1: Results of the studies in 2 hypothetical systematic reviews. The central vertical line represents a treatment effect of 0. Values to the left of this line indicate that the treatment is superior to the control, whereas those to the right of the line indicate that the control is superior to the treatment. For each of the 4 studies in each figure, the dot represents the point estimate of the treatment effect (the value observed in the study), and the horizontal line represents the confidence interval around that observed effect. CMAJ • MAR. 1, 2005; 172 (5) 663 Hatala et al for the studies represented in Fig. 1B but not for those represented in Fig. 1A. Can you explain why? Is it because the point estimates for the studies in Fig. 1A lie on opposite sides No difference Favours new treatment Favours control Fig. 2: Point estimates and confidence intervals for 4 studies. Two of the point estimates favour the new treatment, and the other 2 point estimates favour the control. Investigators doing a systematic review with these 4 studies would be satisfied that it is appropriate to pool the results. Pooled estimate of underlying effect Favours new treatment No difference Favours control Fig. 3: Results of the hypothetical systematic review presented in Fig. 1B. The pooled estimate at the bottom of the chart (large diamond) provides the best guess as to the underlying treatment effect. It is centred on the midpoint of the area of overlap of the confidence intervals around the estimates of the individual trials. 664 JAMC • 1er MARS 2005; 172 (5) of the “no difference” line, whereas those for the studies in Fig. 1B lie on the same side of the “no difference” line? Before you answer this question, consider the studies represented in Fig. 2. Here, the point estimates of 2 studies are on the “favours new treatment” side of the “no difference” line, and the point estimates of 2 other studies are on the “favours control” side. However, all 4 point estimates are very close to the “no difference” line, and, in this case, investigators doing a systematic review will be satisfied that it is appropriate to pool the results. Therefore, it is not the position of the point estimates relative to the “no difference” line that determines the appropriateness of pooling. There are 2 criteria for not combining the results of studies in a meta-analysis: highly disparate point estimates and confidence intervals with little overlap, both of which are exemplified by Fig. 1A. When pooling is appropriate on the basis of these criteria, where is the best estimate of the underlying magnitude of effect likely to be? Look again at Fig. 1B and make a guess. Now look at Fig. 3. The pooled estimate at the bottom of Fig. 3 is centred on the midpoint of the area of overlap of the confidence intervals around the estimates of the individual trials. It provides our best guess as to the underlying treatment effect. Of course, we cannot actually know the “truth” and must be content with potentially misleading estimates. The intent of a meta-analysis is to include enough studies to narrow the confidence interval around the resulting pooled estimate sufficiently to provide estimates of benefit for our patients in which we can be confident. Thus, our best estimate of the truth will lie in the area of overlap among the confidence intervals around the point estimates of treatment effect presented in the primary studies. What is the clinician to do when presented with results such as those in Fig. 1A? If the investigators have done a good job of planning and executing the meta-analysis, they will provide some assistance.6 Before examining the study results in detail, they will have generated a priori hypotheses to explain the heterogeneity in magnitude of effect across studies that they are liable to encounter. These hypotheses will include differences in patients (effects may be larger in sicker patients), in interventions (larger doses may result in larger effects), in outcomes (longer follow-up may diminish the magnitude of effect) and in study design (methodologically weaker studies may generate larger effects). The investigators will then have examined the extent to which these hypotheses can explain the differences in magnitude of effect across studies. These subgroup analyses may be misleading, but if they meet 7 criteria suggested elsewhere10 (see Box 2), they may provide credible and satisfying explanations for the variability in results. The bottom line • Readers can decide for themselves whether there is clinically important heterogeneity among the results of primary studies through a qualitative assessment of the graphic results. This assessment is based on the amount Tips for EBM learners: heterogeneity Box 2: Questions to ask when evaluating a subgroup 10 analysis in a …