Rating the strength of scientific evidence: relevance for quality [PDF]

Feb 1, 2004 - To summarize an extensive review of systems for grading the quality of research articles and rating the st

2 downloads 11 Views 351KB Size

Report

Download PDF

PNG Network

Recommend Stories

Relevance Weighting for Query Independent Evidence

Kindness, like a boomerang, always returns. Unknown

QUALITY RATING REPORT Julie Brecto Qualistar Rating

Courage doesn't always roar. Sometimes courage is the quiet voice at the end of the day saying, "I will

The low strength of 67P: evidence for a primordial nucleus?

No amount of guilt can solve the past, and no amount of anxiety can change the future. Anonymous

Is There Scientific Evidence for the Existence of God?

No matter how you feel: Get Up, Dress Up, Show Up, and Never Give Up! Anonymous

quality evidence

The butterfly counts not months but moments, and has time enough. Rabindranath Tagore

Scientific & Clinical Evidence cerabone

No amount of guilt can solve the past, and no amount of anxiety can change the future. Anonymous

Scientific & Clinical Evidence cerabone

Kindness, like a boomerang, always returns. Unknown

Quality Packaging Seal Strength

Keep your face always toward the sunshine - and shadows will fall behind you. Walt Whitman

Quality Rating and Improvement System

Be who you needed when you were younger. Anonymous

2017 CMS Quality Star Rating

The happiest people don't have the best of everything, they just make the best of everything. Anony

Idea Transcript

Sign In

Issues

More Content

Submit

Purchase

Alerts

About

Register

Advanced Search

All International Journal for Quality in Health Care

Rating the strength of scientific evidence: relevance for quality improvement programs Kathleen N. Lohr

Volume 16, Issue 1

International Journal for Quality in Health Care, Volume 16, Issue 1, 1 February 2004, Pages 9–18, https://doi.org/10.1093/intqhc/mzh005 Published: 01 February 2004 Article history

February 2004

Views

Article Contents

PDF

Cite

Permissions

Share

Abstract

Abstract

Objectives. To summarize an extensive review of systems for grading the quality of research articles and rating the strength of

Introduction

bodies of evidence, and to highlight for health professionals and decision-makers concerned with quality measurement and improvement the available ‘best practices’ tools by which these steps can be accomplished.

Evidence and evidence-based practice

Design. Drawing on an extensive review of checklists, questionnaires, and other tools in the field of evidence-based practice, this paper discusses clinical, management, and policy rationales for rating strength of evidence in a quality improvement

Grading quality and rating the strength of evidence

context, and documents best practices methods for these tasks. Results. After review of 121 systems for grading the quality of articles, 19 systems, mostly study design specific, met a priori

Methods

scientific standards for grading systematic reviews, randomized controlled trials, observational studies, and diagnostic tests; eight systems (of 40 reviewed) met similar standards for rating the overall strength of evidence. All can be used as is or adapted for

Results

particular types of evidence reports or systematic reviews.

Discussion

Conclusions. Formally grading study quality and rating overall strength of evidence, using sound instruments and procedures,

References

can produce reasonable levels of confidence about the science base for parts of quality improvement programs. With such information, health care professionals and administrators concerned with quality improvement can understand better the level of

Supplementary data < Previous

science (versus only clinical consensus or opinion) that supports practice guidelines, review criteria, and assessments that feed into quality assurance and improvement programs. New systems are appearing and research is needed to confirm the conceptual

Next >

and practical underpinnings of these grading and rating systems, but the need for those developing systematic reviews, practice guidelines, and quality or audit criteria to understand and undertake these steps is becoming increasingly clear.

Keywords: clinical practice guidelines, evidence-based practice, quality improvement, quality of care strength of evidence Issue Section: Examining the Evidence

Introduction Around the globe, a ‘trend to evidence’ appears to motivate the search for answers to markedly disparate questions about the costs and quality of health care, access to care, risk factors for disease, social determinants of health, and indeed about the air we breathe and the food we eat. We look for solutions to problems of rare or genetic disorders, seek guidance on the safest, most effective treatments for everything from the common cold to childhood cancers, and expect to be informed about the ‘best’ (or ‘worst’) hospitals and doctors in our cities and towns. The call is strong for science to help stave off premature death, needless disability, and wasteful expenditures of personal or government money. In making informed choices about health care, people increasingly seek credible evidence. Such evidence reflects ‘empirical observations...of real events, [that is,] systematic observations using rigorous experimental designs or nonsystematic observations (e.g. experience)...not revelations, dreams, or ancient texts’[1]. For situations as different as clinical care, policy-making, dispute resolution, and law [2,3], evidence needs to be seen as both relevant and reliable; science and collected bodies of evidence, however, need to be tempered by clinical acumen and political realities. In addressing issues of the quality of health care ‘the degree to which health services for individuals and populations increase the likelihood of desired health outcomes and are consistent with current professional knowledge’ ([4], p. 21) this mix of science and art is crucial. Quality assessment and improvement activities rest heavily on clinical practice guidelines (CPGs) and review and audit criteria. CPGs (‘systematically developed statements to assist practitioner and patient decisions about appropriate health care for specific clinical circumstances’ [5], p. 27) can improve health professionals’ knowledge by providing information and recommendations about appropriate and needed services for all aspects of patient management: screening and prevention, diagnosis, treatment, rehabilitation, palliation, and end-of-life care. When kept updated as technologies change, CPGs also influence attitudes about standards of care and, over time, shift practice patterns to make care more efficient and effective, thereby enhancing the value received for health care outlays. Moreover, evidence-based guidelines constitute a major element of quality assurance, quality improvement, medical audit, and similar activities for many health care settings: inpatient or residential (e.g. hospitals, nursing homes), outpatient (e.g. offices, ambulatory clinics, and private homes), and emergency departments or clinics. Users can convert them into medical review criteria to assess care generally in these settings or to target specific kinds of services, providers, settings, or patient populations for in-depth review [2,6]. Evidence-based practice brings pertinent, trustworthy information into this equation by systematically acquiring, analyzing, and transferring research findings into clinical, management, and policy arenas. The process involves: developing the question in a way that can be answered by a systematic review: specifying the populations, settings, problems, interventions, and outcomes of interest; stating criteria for eligibility (inclusion and exclusion) of literature to be considered before conducting literature searches, so as to avoid bias introduced by arbitrarily including or excluding certain studies; searching the literature to capture all the evidence about the question of interest; reviewing abstracts of publications to determine initial eligibility of studies; reviewing retained studies to determine final eligibility; abstracting data on these studies into evidence tables; determining the quality of studies and the overall strength of evidence; synthesizing and combining data from evidence tables, and deciding whether quantitative analyses (i.e. meta-analysis) are warranted; and writing a draft review, subjecting it to peer review, editing and revising, and producing the final review. This paper examines one evidence-based process—rating the quality and strength of evidence—to argue three points: 1. The confidence that those wishing to mount credible quality improvement (QI) efforts can assign to evidence rests in part on the quality of individual research efforts and the overall strength of those bodies of evidence; with such assurance, they can distinguish more clearly between good and bad information and between evidence and mere opinion. 2. Formal efforts to grade study quality and rate the strength of evidence can produce a reasonable level of confidence about that evidence. 3. Tools that meet acceptable scientific standards can facilitate these grading and rating steps.

Evidence and evidence-based practice Evidence-based practice Evidence-based medicine is ‘the integration of best research evidence with clinical expertise and patient values’ [7]. In clinical applications, providers use the best evidence available to decide, together with their patients, on suitable options for care. Such evidence comes from different types of studies conducted in various patient groups or populations. The emphasis is on melding scientific evidence of the highest caliber with sensitive appreciation of patients’ values and preferences—blending the science and art of medicine. One challenge for practitioners is that most medical recommendations today refer to groups of patients (‘women over age 50’), and they may or may not apply to a particular woman with a particular medical history and set of cultural values. Moreover, when evidence for an intervention is relatively weak, e.g. benefits and harms of prostate-specific antigen screening for prostate cancer [8] or the value of universal screening of newborns for hearing loss to improve long-term language outcomes [9], patients and providers are likely to give more emphasis to patients’ values and treatment costs. When evidence is strong, e.g. use of aspirin to prevent heart attacks, especially in high-risk patients [10], the value of screening for colorectal cancer [11], or the payoff from stopping smoking [12], patients’ values may carry less weight in treatment decisions, although their preferences for different outcomes always need to be taken into account. Even though health care management and administration is moving into an evidence-based environment (see for example EvidenceBased Healthcare, available at http://www.hbuk.co.uk/journals/ebhc), executives concerned with implementing proven or innovative QI programs face similar challenges. Numerous for-profit and non-profit organizations help hospitals, group practices, delivery systems, and large health plans implement and evaluate approaches to change organizational structures and behaviors to improve clinical and patient outcomes, enhance patient safety, attain better cost and cost-effectiveness goals, and address the ‘business case for quality’ question [13]. Other enterprises create evidence-based prescription information tools and web content with consumer health information. Yet other institutions focus on practice guidelines (e.g. http://www.guidelines.gov; http://medicine.ucsf.edu/resources/guidelines). In Europe, BIOMED-supported activities are a related effort to develop a tool for assessing guidelines (http://www.cordis.lu/biomed/home.html). Inventories of process and outcome measures add yet another dimension to these activities (http://www.qualitymeasures.ahrq.gov). Faster adoption of useful innovations, including QI programs, is seen as a particularly critical endeavor [14]. In all these arenas, sound evidence is critical. Evidence-based recommendations that take into account benefits and harms of health interventions give those responsible for QI planning and decisions grounds for adopting some technologies or programs and abandoning others, although the proposition that research can have a direct influence on such decision-making can be questioned [15–18]. The next frontier may lie in finding ways to organize knowledge bases better, or to set up independent centers or other efforts to support data collection, research, analysis, and modeling specifically pertinent to QI programs [19–22].

The nature of desirable evidence QI programs need information across the entire spectrum of biomedical, clinical, and health services research. Good evidence, applicable to all patients and care settings, is not available for much of medicine today. Perhaps no more than half, or even one-third, of services are supported by compelling evidence that benefits outweigh harms. Millenson claims, citing work from Williamson in the late 1970s [23], that ‘[m]ore than half of all medical treatments, and perhaps as many as 85 percent, have never been validated by clinical trials’ ([24], p. 15). According to an expert committee of the US Institute of Medicine, only about 4% of all services have strong strength of evidence and modest to strong clinical consensus and more than 50% of services had very weak or no evidence ([5], Tables 1 and 2). Although clinical and health services research have escalated in the intervening years, so has the technological armamentarium and spectrum of disease, suggesting major gaps still remain for research to fill and that major challenges lie ahead for the development of systematic reviews on clinical and health care delivery topics.

Table 1 Domains in the criteria for evaluating four types of systems to grade the quality of individual studies Systematic reviews

Randomized controlled trials

Observational studies

Diagnostic test studies

Study question

Study question

Study question

Study population

Search strategy

Study population

Study population

Adequate description of test

Inclusion and exclusion criteria

Randomization

Comparability of subjects

Appropriate reference standard

Interventions

Blinding

Exposure or intervention

Blinded comparison of test and standard

Outcomes

Interventions

Outcome measures

Avoidance of verification bias

Data extraction

Outcomes

Statistical analysis

Study quality and validity

Statistical analysis

Results

Data synthesis and analysis

Results

Discussion

Results

Discussion

Funding or sponsorship

Discussion

Funding or sponsorship

Funding or sponsorship

Source: West et al. (2002) [26]. Italics indicate elements of critical importance in evaluating grading systems according to empirical validation research or standard epidemiological methods. View Large

Table 2 Criteria for evaluating systems to rate the strength of bodies of evidence Quality

The aggregate of quality ratings for individual studies, predicated on the extent to which bias was minimized

Quantity

Numbers of studies, sample size or power, and magnitude of effect

Consistency

For any given topic, the extent to which similar findings are reported using similar and different study designs

Source: West et al. (2002) [26]. View Large

In this context, the absence of evidence about benefits (or harms) is not the same as evidence of no benefit (or harm). For deciding whether to render a medical service or cover a new technology, clinicians, administrators, guideline developers, and even patients must be alert to this distinction. ‘No evidence’ is a reason for caution in reaching judgments and clinical or policy decisions and for postponing definitive steps. In contrast, ‘evidence of no positive (or negative) impact’ may be a solid reason for taking conclusive steps in favor of or against amedical service. Evidence, even when available, is rarely definitive. The level of confidence that one might have in evidence turns on the underlying robustness of the research and the analyses done to synthesize that research. Users can, and of course often do, arrive at their own judgments about the soundness of practice guidelines or technology assessments and the science underpinning their conclusions and recommendations. Such judgments may differ considerably in the sophistication and lack of bias with which they were made, for any number of reasons: disputing which evidence is appropriate for assessment in the first place; examining only some of the evidence; disagreeing as to whether factors such as patient satisfaction and cost should be explicitly included in the assessment of the effectiveness of a diagnostic test or treatment; and differing in conclusions about the quality of the evidence. Without consensus on what constitutes sufficient evidence of acceptable quality, such disagreement is not surprising, but it can lead to public concern either that the evidence on many issues is ‘bad’ or that the experts somehow represent a collection of special interests and ought not wholly to be trusted. For that reason, groups producing systematic reviews, as the underpinnings to guidelines or quality and audit review criteria, are likely to be in the best position to evaluate the strength of the evidence they are assembling and analyzing. Nonetheless, they must be transparent about how they reached such judgments in the first place. Explicitly evaluating the quality of research studies and judging the strength of bodies of evidence is a central, inseparable part of this process.

Grading quality and rating the strength of evidence Defining quality and strength in evidence-based practice terms Grading the quality of individual studies and rating the strength of the body of evidence comprising those studies are the two linked topics for the remainder of this paper. Quality, in this context, is ‘the extent to which all aspects of a study’s design and conduct can be shown to protect against systematic bias, nonsystematic bias, and inferential error’ ([25], p. 472). An expanded view holds that quality concerns the extent to which a study’s design, conduct, and analysis have minimized biases in selecting subjects and measuring both outcomes and differences in the study groups other than the factors being studied that might influence the results [26]. In practical terms, one can grade studies only by examining the details that articles in the peer-reviewed literature provide. If studies are incompletely or inaccurately documented, they are likely to be downgraded in quality (perhaps fairly, perhaps not). New guidelines from international groups provide clear instructions on how systematic reviews (QUORUM), randomized controlled trials (CONSORT), observational studies (MOOSE), and studies of diagnostic test accuracy (STARD) ought to be reported [27–30]. These statements are not, however, direct tools for evaluating the quality of studies. Strength of evidence has a similar range of definitions, all taking into account the size, credibility, and robustness of the combined studies on a given topic. It ‘incorporates judgments of study quality [and] includes how confident one is that a finding is true and whether the same finding has been detected by others using different studies or different people’ [26]. ‘Closeness to the truth’, ‘size of the effect’, and ‘applicability (usefulness in...clinical practice)’ are the concepts used by some evidence-based experts to convey the idea of strength of evidence [7]. The US Preventive Services Task Force, for example, holds that the strength of evidence applies to linkages in an analytic framework for a clinical question that might run from screening to confirmatory diagnosis, treatment, intermediate outcomes (e.g. biophysical measures), and ultimately patient outcomes (e.g. survival, functioning, emotional well-being, and satisfaction) [31]. Criteria for judging evidentiary strength involve internal validity (the extent to which studies yield valid information about the populations and settings in which they were conducted), external validity (the extent to which studies are relevant and can be generalized to broader patient populations of interest), and coherence or consistency (the extent to which the body of evidence makes sense, given the underlying model for the clinical situation). Strength of evidence needs to be distinguished from the magnitude of effect or impact reported in research papers. How solid we believe a body of evidence is ought not to be confused with how dramatic the effects and outcomes have been. Very robust evidence in favor of small effects of clinical interventions may prove more telling in QI decision-making than weak evidence about ostensibly spectacular findings. Cutting across these considerations is the frequency or rarity of benefits or harms. Holding the amount or explanatory power of the evidence constant, weighing common small benefits against rare but catastrophic harms is a difficult, and sometimes subjective, tradeoff. Both conceptually and practically, quality and strength are related, albeit hierarchical, ideas. One must grade the quality of individual studies before one can draw affirmative conclusions about the strength of the aggregated evidence. These steps feed directly into grading health care recommendations relevant to QI programs. Although this paper confines itself to study quality and strength of evidence, this link to assigning levels of confidence in recommendations is a straightforward and important one. For example, the USPSTF clearly explains its methods in a linked model that runs from grading studies to assessing strength of evidence to grading its recommendations [31]. GRADE is a new international effort related to reporting requirements that aims to develop a comprehensive approach to grading evidence and guideline recommendations (Andy Oxman, Norwegian Directorate for Health and Social Welfare, Oslo, personal communication, 6 May 2003). In summary, grading studies and rating the strength of evidence matter because they can: clarify how certain one can be about research results and, thus, about conclusions, decisions, or recommendations drawn from that research; identify and perhaps alleviate problems of potential bias in the literature; and make transparent how taking quality of studies and strength of evidence into account affects aggregate findings and decisions to be made from those findings.

Methods General approach The US Agency for Healthcare Research and Quality (AHRQ) plays a significant role in evidence-based practice through its Evidence-based Practice Center (EPC) program and in quality of care [32]. In 1999, the US Congress directed AHRQ to examine systems to rate the strength of the scientific evidence underlying health care practices, research recommendations, and technology assessments and to make such methods or systems widely available. To fulfil this congressional charge, AHRQ commissioned the RTI International-University of North Carolina (RTI-UNC) EPC to produce an extensive evidence report that would: (i) describe systems that rate the quality of evidence in individual studies or grade the strength of entire bodies of evidence concerned with a single scientific question; and (ii) provide guidance on ‘best practices’ in this field today. To complete this work required establishing criteria for judging systems for grading quality and rating strength of evidence, identifying such systems from the world literature and internet sites, evaluating the systems against these criteria, and judging which systems passed sufficient muster that they might be characterized as best practices. We conducted extensive literature searches of MEDLINE for articles published between 1995 and 2000 and sought further information from existing bibliographies, other sources including websites of several international organizations, and our expert panel advisers. In all, we reviewed 1602 publication abstracts. We developed and refined sets of evaluation criteria, which covered attributes and domains that reflect accepted principles of health research and epidemiology, relying on empirical research in the peer-reviewed literature and standard epidemiological texts. In addition, we relied extensively on members of an international technical panel comprising seasoned researchers and noted experts in evidence-based practice to provide feedback on our overall approach, including specification of our evaluation criteria. We developed and completed descriptive tables, similar to evidence tables, by which to compare and characterize existing systems, using the attributes and domains that we believed any acceptable instrument for these purposes ought to cover. After determining which grading and rating systems adequately covered the domains of interest (i.e. tools that fully or partially met the evaluation criteria), we identified those systems that we believed could be used more or less ‘as is’ (or easily adapted) and displayed this information in tabular form. These methods are described in detail elsewhere [26].

Grading study quality For evaluating systems related to grading the quality of individual studies, the RTI-UNC EPC team defined domains for four types of research: systematic reviews (including ones that statistically combine data from individual studies), randomized controlled trials (RCTs), observational studies (which include a wide array of nonexperimental or quasi-experimental designs both with and without control or comparison groups), and investigations of diagnostic tests. As listed in Table 1, we specified both desirable domains and, of those, domains considered absolutely critical for a grading scheme to be regarded as acceptable (the latter are identified by italics). For example, for RCTs, adequate statement of the study question is a desirable domain that a grading scheme should cover, but adequate description of study population, randomization, and blinding are critical domains.

Rating strength of evidence To evaluate schemes to rate the strength of a body of evidence, we specified three sets of aggregate criteria (Table 2) that combine key aspects of the design, conduct, and analysis of multiple studies on a given topic. The quality of evidence is essentially a summation of the direct grading of individual articles. The quantity of evidence concerns several variables that reflect the magnitude of effects (benefits and harms) estimated in these studies. Finally, the coherence or consistency of results reflects the extent to which studies report findings that reflect effects of similar magnitude and direction or that report discrepant findings that nonetheless can be explained adequately by biological, population, setting, or other characteristics.

Report preparation The EPC team completed its evaluation and prepared a draft evidence report that was subjected to extensive external peer review, revised the report accordingly, and submitted the final to AHRQ. Subsequently, AHRQ organized a 1-day invitational conference of quality of care and other experts to discuss the ramifications of the report and avenues for dissemination to numerous audiences concerned with various aspects of health care delivery, including quality improvement. This paper was developed in response to the group’s general recommendations.

Results Grading study quality The EPC investigators assessed 121 grading systems against the domain-specific criteria specified a priori for systematic reviews, RCTs, observational studies, and diagnostic test studies and assigned scores of fully met, partially met, or not met (or no information). From these objective comparisons, the team classified 19 generic scales or checklists as ones that can be used in producing systematic evidence reviews, technology assessments, or other QI-related materials [33–51]. Tables 3a–3d depict the extent to which they met evaluation criteria.

Table 3a Evaluation of systems to grade the quality of systematic reviews Instrument

Critical domains in the evaluation criteria

Study question

Search strategy

Inclusion/exclusion

Data extraction

Study quality

Data synthesis/analysis

Funding

Irwig et al. (1994) [51]

•

•

•

•

•

•

Sacks et al. (1996) [33]

•

•

•

•

•

•

•

Auperin et al. (1997) [34]

•

•

•

•

•

Barnes and Bero (1998) [35]

•

•

•

•

•

Khan et al. (2000) [36]

•

•

•

•

•

•

Legend: • = yes; = partial; = not met or no information. Source: West et al. (2002) [26]. View Large

Table 3b Evaluation of systems to grade the quality of randomized controlled trials Instrument

Critical domains in the evaluation criteria

Study population

Randomization

Blinding

Interventions

Outcomes

Statistical analysis

Funding

Chalmers et al. (1981) 1 [37]

•

•

•

•

•

•

•

Liberati et al. (1986) 1 [38]

•

•

•

•

•

•

Reisch et al. (1989) 2 [39]

•

•

•

•

•

•

•

van der Heijden et al. (1996) 1 [40]

•

•

•

•

•

•

de Vet et al. 1 (1997) [41]

•

•

•

•

•

•

Sindhu et al. 1 (1997) [42]

•

•

•

•

•

•

Downs and Black 2 (1998) [43]

•

•

•

•

•

•

Harbour and Miller 2 (2001) [44]

•

•

•

•

•

•

1

Instruments for RCTs only.

2

Instruments for both RCTs and observational studies.

Source: West et al. (2002) [26]. View Large

Table 3c Evaluation of systems to grade the quality of observational studies Instrument

Critical domains in the evaluation criteria

Comparability of subjects

Exposure/intervention

Outcome measure

Statistical analysis

Funding

Reisch et al. 1 (1989) [39]

•

•

•

•

•

Spitzer et al. 2 (1990) [45]

•

•

•

•

Goodman et al. 2 (1994) [46]

•

•

•

•

Downs and Black 1 (1998) [43]

•

•

•

•

Zaza et al. (2000) 2 [47]

•

•

•

•

Harbour and Miller (2001) 1 [44]

•

•

•

•

1

Instruments for both RCTs and observational studies.

2

Instruments for observational studies only.

Source: West et al. (2002) [26]. View Large

Table 3d Evaluation of systems to grade the quality of diagnostic test studies Instrument

Critical domains in the evaluation criteria

Study population

Adequate description of test

Appropriate reference standard

Blinded comparison of test and reference

Avoidance of verification bias

Cochrane Methods Working Group on Systematic Review of Screening and Diagnostic Tests (1996) [48]

•

•

•

•

•

Lijmer et al. (1999) [49]

•

•

•

•

•

National Health and Medical Research Council (2000) [50]

•

•

•

•

•

Source: West et al. (2002) [26]. View Large

Rating strength of evidence After evaluating 40 systems for rating strength against the quality, quantity, and consistency criteria, we identified eight instruments that fully addressed all three domains for rating the strength of a body of evidence (Table 4) [31,52–58]. The team also identified an additional nine approaches that incorporated three domains either fully or partially [7,36,44,59–64].

Table 4 Evaluation of systems to rate strength of bodies of evidence System

Domain

Quality

Quantity

Consistency

Gyorkos et al. (1994) [52]

•

•

•

Clarke and Oxman (1999) [53]

•

•

•

West et al. (1999) [54]

•

•

•

Briss et al. (2000) [55]

•

•

•

Greer et al. (2000) [56]

•

•

•

Guyatt et al. (2000) [57]

•

•

•

NHS Research and Development

•

•

•

Centre of Evidence-Based Medicine (2001) [58]

Harris et al. (2001) [31]

•

•

•

Source: West et al. (2002) [26]. View Large

Discussion Tools to draw on Grading studies and rating strength of evidence can be done, and done well, with existing systems. For incorporating study quality and strength of evidence evaluations in systematic reviews, evidence reports, or technology assessments, groups can comfortably use one or more of these systems as a starting point. The EPC’s technical report describes and discusses the systems in more detail, because potential users need to take feasibility, ease of application, and certain other properties of these tools into account in selecting among them. The core conclusion remains: these systems constitute an acceptable set of tools available today for this critical step in developing products applicable to QI initiatives. Agreement in principle about these ideas across scientists in several countries attests to the sturdiness of the core elements and concepts for assessing quality of studies and strength of evidence. Outcome measures, for example, are thought to be adequate when they are reliable (giving roughly the same answers when administered twice in short order), valid (measuring what they purport to measure), and clinically sensible. The factor of funding and sponsorship has been empirically validated more than once.

No one best approach The EPC team offered other conclusions and observations about the state of the art, and science, of these tasks. Possibly most important is that there is no one ‘best approach’. Acceptable methods for grading the quality of studies must take the original study design into account; approaches suitable for RCTs or observational studies will not be applicable for diagnostic tests, for instance. Even systems that are said to be applicable to both RCTs or observational research may prove to be difficult to use and yield less precise or reliable judgments than desired. RCTs minimize selection bias, an important potential problem in observational studies. However, effectiveness and observational studies usually have larger total numbers of subjects and reflect more culturally, ethnically, and socially diverse patient populations and practice settings. No system for evaluating either quality or strength, no matter how good it seems to be, can completely resolve the inherent tension between these strengths (or weaknesses) of efficacy and effectiveness research. Users should match the topic and types of studies under review to an appropriate grading tool; one size will not fit all.

Future research, development, and evaluation Even with these various rating and grading systems on the shelf, those in the QI world need to appreciate the work still needed to develop additional tools, provide better advice on how to use existing tools, and generate empirical documentation of the reliability and validity of new or extant systems. The extent to which these grading and rating steps influence guideline conclusions and recommendations needs to be evaluated. Until these research gaps are bridged, those wishing to produce authoritative systematic reviews, technology assessments, or QI and audit criteria will be hindered in their efforts. Future studies should: (i) address technical measurement issues; (ii) clarify the applicability of different systems to new, different, or less traditional clinical or policy topics; (iii) determine what factors make a difference in final quality scores for individual studies and, by extension, in judgments about the strength of bodies of evidence; and (iv) possibly most important, ascertain the impact of this process on conclusions, recommendations for QI programs, and ultimate health and policy outcomes [26]. Clinicians, managers, and QI leaders all face escalating demands on their time in an environment of increasingly complex decisionmaking like that reflected in Figure 1. Sorting out the science that enables practitioners, QI experts, and the public to make informed decisions is time-consuming and challenging substantively, given the accelerating pace of scientific discovery and production of peerand non-peer-reviewed literature. They can turn to evidence-based systematic reviews, guidelines, and recommendations for help, but they must have confidence in this information base if they are to proceed with conviction and authority and if they are to be held accountable for the resulting clinical or policy choices they make.

Figure 1 The environment for decision-making for quality improvement. Adapted with permission [65]. View large

Download slide

The environment for decision-making for quality improvement. Adapted with permission [65].

Two critical tasks in developing defensible evidence-based reviews, which form the basis of practice guidelines, quality review and audit criteria, and similar materials, are to grade the quality of individual studies and then to rate the strength of the overall body of evidence. When evidence-based reviews and recommendations incorporate these steps, decision-makers from the national policy level to the individual physician–patient relationship can have greater assurance that their choices will be well-informed, well-grounded, and appropriate to the challenges ahead.

References Eddy DM. The use of evidence and cost effectiveness by the courts: How can it improve health care? J Health Polit Policy Law 2001; 26: 387–408. Mulrow CD, Lohr KN. Proof and policy from medical research evidence. J Health Polit Policy Law 2001; 26: 249– 266. Kassirer JP, Cecil JS. Inconsistency in evidentiary standards for medical testimony: disorder in the courts. J Am Med Assoc 2002; 288: 1382–1387. Lohr KN (ed). Medicare: A Strategy for Quality Assurance . Institute of Medicine. Washington, DC: National Academy Press, 1990. Field MJ, Lohr KN (eds). Guidelines for Clinical Practice: From Development to Use . Washington, DC: National Academy Press, 1992. Lohr KN, Eleazar K, Mauskopf J. Health policy issues and applications for evidence-based medicine and clinical practice guidelines. Health Policy 1998; 46: 1–19. Sackett DL, Straus SE, Richardson WS et al. Evidence-Based Medicine: How to Practice and Teach EBM . 2nd edn. London: Churchill Livingstone, 2000. Harris R, Lohr KN. Screening for prostate cancer: an update of the evidence. Ann Intern Med 2002; 137: 917– 929. Thompson DC, McPhillips H, Davis RL, Lieu TL, Homer CJ, Helfand M. Universal newborn hearing screenings: summary of evidence. J Am Med Assoc 2001; 286: 2000–2010. Hayden M, Pignone M, Phillips C, Mulrow C. Aspirin for the primary prevention of cardiovascular events: a summary of the evidence for the U.S. Preventive Services Task Force. Ann Intern Med 2002; 136: 161–172. Pignone M, Rich M, Teutsch SM, Berg AO, Lohr KN. Screening for colorectal cancer in adults at average risk: a summary of the evidence for the U.S. Preventive Services Task Force. Ann Intern Med 2002; 137: 132–141. Surgeon General’s Report. Reducing the Health Consequences of Smoking – 25 Years of Progress . 1989. Available at: http://www.cdc.gov/tobacco/sgr_1989./index.htm (last accessed 10 November 2003). Google Scholar Leatherman S, Berwick D, Iles D et al. The business case for quality: case studies and an analysis. Health Aff (Millwood) 2003; 22: 17–30. Berwick DM. Disseminating innovations in health care. J Am Med Assoc 2003; 289: 1969–75. Black N. Evidence based policy: proceed with care. Br Med J 2001; 323: 275–279. Donald A. Research must be taken seriously. Br Med J 2001; 323: 278–279. Walshe K, Rundall T. Evidence-based management: from theory topractice in health care. Milbank Mem Fund Q 2001; 79: 429–457. Lavis JN, Ross SE, Hurley JE et al. Examining the role of health services research in public policymaking. Milbank Q 2002; 80: 125–154. Ham C, Hunter DJ, Robinson R. Evidence based policymaking. Br Med J 1995; 310: 71–72. Gray JAM. Evidence-Based Healthcare . London: Churchill Livingstone, 1999. Woolf SH. The need for perspective in evidence-based medicine. J Am Med Assoc 1999; 282: 2358–2365. Sturm R. Evidence-based health policy versus evidence-based medicine. Psychiatr Serv 2002; 53: 1499. Williamson JW. Assessing and Improving Health Care Outcomes: The Health Accounting Approach to Quality Assurance . Cambridge, MA: Ballinger Publishing Co., 1978. Millenson MM. Beyond the Managed Care Backlash. Medicine in the Information Age. PPI Policy Report No. 1 . Washington, DC: Progressive Policy Institute, 1997. Lohr KN, Carey TS. Assessing ‘best evidence’: issues in grading the quality of studies for systematic reviews. Jt Comm J Qual Improv 1999; 25: 470–479. West SL, King V, Carey TS et al. Systems to Rate the Strength of Scientific Evidence . Evidence Report, Technology Assessment No. 47. AHRQ Publication No. 02-E016. Rockville, MD: Agency for Healthcare Research and Quality, 2002. Moher D, Cook DJ, Eastwood S, Olkin I, Rennie D, Stroup DF. Improving the quality of reports of meta-analyses of randomised controlled trials: the QUOROM statement. Quality of Reporting of Meta-analyses. Lancet 1999; 354: 1896–1900. Moher D, Schulz KF, Altman D. The CONSORT statement: revised recommendations for improving the quality of reports ofparallel-group randomized trials. J Am Med Assoc 2001; 285: 1987–1991. Stroup DF, Berlin JA, Morton SC et al. Meta-analysis of observational studies in epidemiology: a proposal for reporting. Meta-analysis Of Observational Studies in Epidemiology (MOOSE) group. J Am Med Assoc 2000; 283: 2008–2012. The STARD Group. The STARD Initiative – Towards Complete and Accurate Reporting of Studies on Diagnostic Accuracy . November, 2001. Availableat:http://www.consort-statement.org/stardstatement.htm (2 December 2002, date last accessed). Google Scholar Harris RP, Helfand M, Woolf SH et al. Current methods of the US Preventive Services Task Force: A review of the process. AmJPrev Med 2001; 20: 21–35. Hurtado MP, Swift EK, Corrigan JC (eds). Envisioning the National Health Care Quality Report . Institute of Medicine. Washington, DC: National Academy Press, 2001. Sacks HS, Reitman D, Pagano D, Kupelnick B. Meta-analysis: an update. Mt Sinai J Med 1996; 63: 216–224. Auperin A, Pignon JP, Poynard T. Review article: critical review of meta-analyses of randomized clinical trials in hepatogastroenterology. Alimentary Pharmacol Ther 1997; 11: 215–225. Barnes DE, Bero LA. Why review articles on the health effects of passive smoking reach different conclusions. J Am Med Assoc 1998; 279: 1566–1570. Khan KS, Ter Riet G, Glanville J, Sowden AJ, Kleijnen J. Undertaking Systematic Reviews of Research on Effectiveness. CRD’s Guidance for Carrying Out or Commissioning Reviews . York, UK: University of York, NHS Centre for Reviews and Dissemination, 2000. Chalmers TC, Smith H Jr, Blackburn B et al. A method for assessing the quality of a randomized control trial. Control Clin Trials 1981; 2: 31–49. Liberati A, Himel HN, Chalmers TC. A quality assessment of randomized control trials of primary treatment of breast cancer. J Clin Oncol 1986; 4: 942–951. Reisch JS, Tyson JE, Mize SG. Aid to the evaluation of therapeutic studies. Pediatrics 1989; 84: 815–827. van der Heijden GJ, van der Windt DA, Kleijnen J, Koes BW, Bouter LM. Steroid injections for shoulder disorders: a systematic review of randomized clinical trials. Brit J Gen Pract 1996; 46: 309–316. de Vet HCW, de Bie RA, van der Heijden GJMG, Verhagen AP, Sijpkes P, Kipschild PG. Systematic reviews on the basis of methodological criteria. Physiotherapy 1997; 83: 284–289. Sindhu F, Carpenter L, Seers K. Development of a tool to rate the quality assessment of randomized controlled trials using a Delphi technique. J Adv Nurs 1997; 25: 1262–1268. Downs SH, Black N. The feasibility of creating a checklist for the assessment of the methodological quality both of randomised and non-randomised studies of health care interventions. J Epidemiol Community Health 1998; 52: 377–384. Harbour R, Miller J. A new system for grading recommendations in evidence based guidelines. Br Med J 2001; 323: 334–336. Spitzer WO, Lawrence V, Dales R et al. Links between passive smoking and disease: a best-evidence synthesis. A report of the Working Group on Passive Smoking. Clin Invest Med 1990; 13: 17–42; discussion 43–46. Goodman SN, Berlin J, Fletcher SW, Fletcher RH. Manuscript quality before and after peer review and editing at Annals of Internal Medicine. Ann Intern Med 1994; 121: 11–21. Zaza S, Wright-De Aguero LK, Briss PA et al. Data collection instrument and procedure for systematic reviews in the Guide to Community Preventive Services. Task Force on Community Preventive Services. Am J Prev Med 2000; 18: 44–74. The Cochrane Methods Working Group on Systematic Review of Screening and Diagnostic Tests. Recommended Methods . Updated 6 June 1996. Available at http://www.cochrane.org/cochrane/sadtdoc1.htm (last accessed 10 November 2003). Google Scholar Lijmer JG, Mol BW, Heisterkamp S et al. Empirical evidence of design-related bias in studies of diagnostic tests. J Am Med Assoc 1999; 282: 1061–1066. National Health and Medical Research Council. How to Review the Evidence: Systematic Identification and Review of the Scientific Literature . Canberra, Australia: NHMRC, 2000. Google Scholar Irwig L, Tosteson AN, Gatsonis C et al. Guidelines for meta-analyses evaluating diagnostic tests. Ann Intern Med 1994; 120: 667–676. Gyorkos TW, Tannenbaum TN, Abrahamowicz M et al. An approach to the development of practice guidelines for community health interventions. Can J Public Health 1994; 85 (suppl. 1): S8–S13. Clarke M, Oxman AD. Cochrane Reviewer’s Handbook 4.0 . The Cochrane Collaboration, 1999; Issue 1. West SL, Garbutt JC, Carey TS et al. Pharmacotherapy for Alcohol Dependence. Evidence Report/Technology Assessment No. 5. AHCPR Publication No. 99-E004. Rockville, MD: Agency for Health Care Policy and Research, 1999. Briss PA, Zaza S, Pappaioanou M et al. Developing an evidence-based guide to community preventive services —methods. The Task Force on Community Preventive Services. Am J Prev Med 2000; 18: 35–43. Greer N, Mosser G, Logan G, Halaas GW. A practical approach to evidence grading. Jt Comm J Qual Improv 2000; 26: 700–712. Guyatt GH, Haynes RB, Jaeschke RZ et al. Users’ guides to the medical literature: XXV. Evidence-based medicine: principles for applying the users’ guides to patient care. Evidence-Based Medicine Working Group. J Am Med Assoc 2000; 284: 1290–1296. NHS Research and Development Centre of Evidence-Based Medicine. Levels of evidence. accessed. Available at: http://www.york.ac.uk/inst/crd/ (12 January 2001, date last accessed). Google Scholar How to read clinical journals: IV. To determine etiology or causation. Can Med Assoc J 1981; 124: 985–990. Google Scholar Guyatt GH, Cook DJ, Sackett DL, Eckman M, Pauker S. Grades of recommendation for antithrombotic agents. Chest 1998; 114: 441S–444S. Guyatt GH, Sackett DL, Sinclair JC, Hayward R, Cook DJ, Cook RJ. Users’ guides to the medical literature. IX. A method for grading health care recommendations. Evidence-Based Medicine Working Group. J Am Med Assoc 1995; 274: 1800–1804. van Tulder MW, Koes BW, Bouter LM. Conservative treatment of acute and chronic nonspecific low back pain. A systematic review of randomized controlled trials of the most common interventions. Spine 1997; 22: 2128– 2156. Hoogendoorn WE, van Poppel MN, Bongers PM, Koes BW, Bouter LM. Physical load during work and leisure time as risk factors for back pain. Scand J Work Environ Health 1999; 25: 387–403. Ariens GA, van Mechelen W, Bongers PM, Bouter LM, van der Wal G. Physical risk factors for neck pain. Scand J Work Environ Health 2000; 26: 7–19. Haynes RB, Devereaus PJ, Guyatt GH. Clinical expertise in the era of evidence-based medicine and patient choice. ACP J Club 2002; 136: A11–A14.

© International Society for Quality in Health Care and Oxford University Press 2004; all rights reserved

Supplementary data AddSuppFiles-2 - ppt file AddSuppFiles-1 - jpg file

View Metrics

Email alerts New issue alert Advance article alerts Article activity alert

Receive exclusive offers and updates from Oxford Academic

Related articles in Web of Science Google Scholar

Citing articles via Web of Science (91) Google Scholar CrossRef

Latest Most Read Most Cited Healthcare improvement measures in risk management and patient satisfaction Unpacking the black box of improvement Practical recommendations for the evaluation of improvement initiatives Salzburg Global Seminar Session 565 —‘Better Health Care: how do we learn about improvement?’ Research versus practice in quality improvement? Understanding how we can bridge the gap

About International Journal for Quality in Health Care Editorial Board

Twitter

Journals Career Network

LinkedIn Purchase

Author Guidelines

Recommend to your Library

Contact ISQua

Advertising and Corporate Services

Facebook

Online ISSN 1464-3677 Print ISSN 1353-4505 Copyright © 2018 International Society for Quality in Health Care and Oxford University Press

About Us Contact Us Careers Help Access & Purchase Rights & Permissions Open Access Connect Join Our Mailing List OUPblog Twitter Facebook YouTube Tumblr Resources Authors Librarians Societies Sponsors & Advertisers Press & Media Agents Explore Shop OUP Academic Oxford Dictionaries Oxford Index Epigeum OUP Worldwide University of Oxford Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

Copyright © 2018 Oxford University Press

Privacy Policy

Cookie Policy

Legal Notices

Site Map

Accessibility

Get Adobe Reader

Rating the strength of scientific evidence: relevance for quality [PDF]

Recommend Stories

Idea Transcript

Helpful Links

Smile Life

Get in touch