Why should clinicians care about Bayesian methods?

doi:10.1016/S0378-3758(00)00232-9

Journal of Statistical Planning and Inference

Volume 94, Issue 1, 1 March 2001, Pages 43-58

https://doi.org/10.1016/S0378-3758(00)00232-9 Get rights and content

Abstract

There is a growing awareness of Bayesian methods within the medical research community, and increasing discussion of their potential applications. This interest has, however, so far failed to convert into the routine use of such methods by working clinicians. I argue that attempts to encourage the use of Bayesian methods by highlighting the deficiencies of conventional (frequentist) inference have not succeeded because these deficiencies typically have minor practical consequences, while their more serious effects can usually be explained away by appeal to other statistical issues. As a result, Bayesian methods have not appeared to offer practical pay-offs big enough to justify the cost of acquiring the necessary expertise. In an attempt to remove this “cost–benefit” hurdle, I outline a simple Bayesian technique that can be used alongside frequentist methods to address an issue of routine practical concern to working clinicians: the credibility of new research findings.

Introduction

Medical researchers are heavy users of inferential methods. Leading medical journals are replete with clinical and epidemiological studies whose conclusions are backed by quantitative inferential concepts such as p-values and confidence intervals. Small wonder, then, that there is growing interest in Bayesian inference, which is said to have a number of key advantages over conventional methods (see, e.g. Lilford and Braunholtz, 1996; Bland and Altman, 1998; Spiegelhalter et al., 1999).

Yet despite the increasing awareness of their existence, the actual application of Bayesian methods remains very much the exception rather than the rule among working clinicians. There are many potential explanations for this, from concern about the much-discussed use of prior belief in Bayesian analysis, to the simple reluctance of old dogs to learn new (statistical) tricks.

I would argue, however, that the fundamental reason why Bayesian methods have yet to become widely adopted is that they have yet to pass the “cost–benefit analysis” any hard-pressed professional performs when hearing of some new technique: will the cost of acquiring the necessary expertise be compensated for by real practical benefit? Many potential users of Bayesian methods, it would seem, have concluded that the effort of switching from the familiar old frequentist methods to Bayesian techniques is unlikely to give sufficient return on (intellectual) investment.

Of the two factors involved in such a cost–benefit analysis, the cost of adopting Bayesian techniques can certainly seem high to non-specialists: the typical introductory text on Bayesian inference has a mathematical content well above that of the equivalent texts for conventional statistics. Yet I would argue that this is not the principal reason why Bayesian techniques have yet to become widely adopted: their mathematical basis is within the reach of anyone — given the motivation. Rather, I suspect the chief reason is the failure of Bayesians to provide a ready answer to a rather obvious question: if the standard methods of inference are so awful, how come the whole scientific enterprise has not collapsed around our ears?

In what follows, I suggest that while the failings of conventional inference are real enough, they typically have either little practical impact on the assessment of evidence, or are easily explained away as the result of other statistical deficiencies. As such, avoiding them hardly provides much motivation for switching to Bayesian methods.

If the “stick” of supposedly dire consequences of using frequentist methods has not succeeded in promoting the widespread use of Bayesian methods, what might? In this paper, I suggest a possible “carrot”, which focuses instead on one of the major practical advantages of Bayesian methods: their ability to take explicit, quantitative account of extant knowledge. Specifically, I outline a Bayesian method of assessing quantitatively the credibility of a new research finding. The technique is easy to apply, and produces a measure of credibility whose interpretation is both straightforward and transparent. As such, it may provide a gentle “entrée” to Bayesian modes of thought, which allows the non-specialist to appreciate their power without losing all contact with familiar frequentist methods.

I begin, however, by reviewing briefly the concerns raised over the years about frequentist inference, and why these concerns have had so little impact among working clinicians.

Section snippets

Five reasons to fret about frequentist methods

Frequentist concepts such as p-values have faced criticism almost since the advent of “significance testing” in the 1920s. The criticisms have ranged from qualms about the interpretation and arbitrariness of the well-known p=0.05 criterion for significance, through concern that p-values exaggerate real “significance”, to claims that frequentist methods conceal the ineluctable presence of subjectivity in inference. Indeed, so sustained has this criticism been that one wonders how frequentist

Confidence intervals: better than p-values?

So far, I have focused primarily on the deficiencies of p-values, used to evaluate the significance of point-null hypotheses. However, criticism of p-values is not a uniquely Bayesian sport: frequentists have themselves argued against the use of p-values, on the grounds that they fail to convey information about effect size or sample size. It is entirely possible for researchers to claim a “statistically significant” effect and yet for the effect to be so small as to have no significance in the

What is required to change old habits?

Thus far I have outlined the principal reasons put forward by advocates of Bayesian methods for worrying about conventional statistical inference. I have also suggested why these failings have had so little impact outside the statistical community: while they are real and of conceptual importance, they have not led to inferential meltdown in routine research.

If Cassandra-like warnings of the dire consequences of using frequentist methods have not convinced working clinicians of the merits of

Credibility: a suitable case for Bayesian treatment

While the outcomes of medical research are now routinely stated quantitatively via such measures as odds ratios, this is rarely the case for their inherent credibility — that is, the level to which the finding is both plausible in the light of current knowledge, and backed by persuasive weight of evidence. Even where a finding is controversial or unexpected, assessments of its credibility almost always consist of broad-brush qualitative arguments based on previous research or experience. Such

Derivation of the critical prior interval (CPI)

To fix ideas, I shall henceforth focus on log-normally distributed ORs as the means of summarising both the new data and the assessment of credibility. In deriving the Critical Prior Interval, I shall also assume this interval to be symmetrical in log-odds about a mean value of zero. The credibility assessment is thus being made on the basis of “cautious reasonable scepticism” (Kass and Greenhouse, 1989), consistent with the judiciously sceptical stance conventionally adopted in scientific

Credibility assessment: a worked example

As an example of credibility assessment in action, consider a recent case-control study of the impact of simple lifestyle modifications on cardiovascular risk (Spencer et al., 1999). This found that among men aged 27–64, taking non-vigorous exercise and avoidance of added salt were both associated with statistically significant reduced risks of acute myocardial infarction (AMI). For non-vigorous exercise, an OR of 0.5 with a standard 95% CI of (0.4,0.7) was found, while avoidance of added salt

Conclusion

The relatively minor inroads that Bayesian analysis has made into mainstream medical journals is pretty disheartening for those who (like the author) believe it has much to offer working clinicians. I have argued here that the blame lies principally with the strategy adopted by many Bayesians (including, again, myself) for convincing others to take up Bayesian methods: Cassandra-like warnings of the terrible fate that awaits those who persist in their frequentist habits. In this paper, I have

Acknowledgements

It is a pleasure to thank David Spiegelhalter, Dennis Lindley and Mark Selinger for their constructive comments on an early draft, and Iain Chalmers, Richard Lilford and Catherine Elsworth for valuable discussions.

References (23)

J. Berger et al.
Testing precise hypotheses
Statist. Sci.
(1987)
J. Berger et al.
Testing a point null hypothesis: the irreconcilability of P-values and evidence
J. Amer. Statist. Assoc.
(1987)
J.A. Berlin et al.
A meta-analysis of physical activity in the prevention of coronary heart disease
Amer. J. Epidemiol.
(1990)
J.M. Bland et al.
Bayesians and frequentists
British Med. J.
(1998)
G.J. Bourke et al.
Interpretation and Uses of Medical Statistics.
(1985)
W. Edwards et al.
Bayesian statistical inference for psychological research
Psychol. Rev.
(1963)
M. Egger et al.
Meta-analysis: spurious precision? Meta-analysis of observational studies
British Med. J.
(1998)
I.J. Good
Probability and the Weighing of Evidence.
(1950)
J. He et al.
Dietary sodium intake and subsequent risk of cardiovascular disease in overweight adults
J. Amer. Med. Assoc.
(1999)
S. Heyes et al.
Starting Statistics in Psychology and Education.
(1993)

T.M. Hines

Comprehensive review of biorhythm theory

Psychol. Rep.

(1998)

Cited by (54)

Coronary Artery Bypass Surgery Without Saphenous Vein Grafting: JACC Review Topic of the Week
2022, Journal of the American College of Cardiology
Citation Excerpt :
The overall HRs and standard errors were provided to the central analysis team, whereby they were pooled with results from the literature review to produce a combined treatment effect estimate. We also conducted a Bayesian23-25 normal-normal hierarchal model meta-analysis to further explore the robustness of the results, as well as to calculate the probability of treatment effect lying on a particular range of values, such as HR <1.00. Results are presented using the median μ and credible intervals.
Approximately 95% of patients of any age undergoing contemporary, coronary bypass surgery will receive at least 1 saphenous vein graft (SVG). It is recognized that SVG will develop progressive and accelerated atherosclerosis, resulting in a stenosis, and in occlusion that occurs in 50% by 10 years postoperatively. For arterial conduits, there is little evidence of progressive failure as for SVG. Could avoidance of SVG (total arterial revascularization [TAR]) lead to a different late (>5 year) survival? A literature review of 23 studies (N = 100,314 matched patients) at a mean 8.8 years postoperative found reduced all-cause mortality for TAR (HR: 0.77; 95% CI: 0.71-0.84; P < 0.001). An expanded analysis with a new unpublished data set (N = 63,288 matched patients) was combined with the literature review (N = 127,565). It found reduced all-cause mortality for TAR (HR: 0.78; 95% CI: 0.72-0.85; P < 0.001). Additional Bayesian analysis found a very high probability of a TAR-associated reduction all-cause mortality.
Brain Oxygen–Directed Management of Aneurysmal Subarachnoid Hemorrhage. Temporal Patterns of Cerebral Ischemia During Acute Brain Attack, Early Brain Injury, and Territorial Sonographic Vasospasm
2022, World Neurosurgery
Citation Excerpt :
The demographic, medical, and clinical information, admission Glasgow Coma Scale (GCS) score, Hunt and Hess grade,50 World Federation of Neurological Surgeons (WFNS) score,51 Fisher grade,52 and Glasgow Outcome Scale score at discharge and at 6 months53 were entered in an administrative database. The predictors of outcome were examined using Bayesian credibility analysis of the odds ratio (OR) calculations.54 The critical care parameters (recorded hourly for each patient) were extracted from the ICU electronic medical records retrospectively and the daily median value for heart rate, Fio2, % saturation of oxygen, Pao2, Paco2, hemoglobin, SBP, MAP, CPP, ICP, and Pbto2 were calculated.
Neurocritical management of aneurysmal subarachnoid hemorrhage focuses on delayed cerebral ischemia (DCI) after aneurysm repair.
This study conceptualizes the pathophysiology of cerebral ischemia and its management using a brain oxygen–directed protocol (intracranial pressure [ICP] control, eubaric hyperoxia, hemodynamic therapy, arterial vasodilation, and neuroprotection) in patients with subarachnoid hemorrhage, undergoing aneurysm clipping (n = 40).
The brain oxygen–directed protocol reduced Lbo₂ (Pbto₂ [partial pressure of brain tissue oxygen] <20 mm Hg) from 67% to 15% during acute brain attack (<24 hours of ictus), by increasing Pbto₂ from 11.31 ± 9.34 to 27.85 ± 6.76 (P < 0.0001) and then to 29.09 ± 17.88 within 72 hours. Day-after-bleed, Fio₂ change, ICP, hemoglobin, and oxygen saturation were predictors for Pbto₂ during early brain injury. Transcranial Doppler ultrasonography velocities (>20 cm/second) increased at day 2. During DCI caused by territorial sonographic vasospasm (TSV), middle cerebral artery mean velocity (V_m) increased from 45.00 ± 15.12 to 80.37 ± 38.33/second by day 4 with concomitant Pbto₂ reduction from 29.09 ± 17.88 to 22.66 ± 8.19. Peak TSV (days 7–12) coincided with decline in Pbto₂. Nicardipine mitigated Lbo₂ during peak TSV, in contrast to nimodipine, with survival benefit (P < 0.01). Intravenous and cisternal nicardipine combination had survival benefit (Cramer Φ = 0.43 and 0.327; G² = 28.32; P < 0.001). This study identifies 4 zones of Lbo₂ during survival benefit (Cramer Φ = 0.43 and 0.3) TSV, uncompensated; global cerebral ischemia, compensated, and normal Pbto₂. Admission Glasgow Coma Scale score (not increased ICP) was predictive of low Pbto₂ (β = 0.812, R² = 0.661, F_1,30 = 58.41; P < 0.0001) during early brain injury. Coma was the only credible predictor for mortality (odds ratio, 7.33/>4.8∗; χ² = 7.556; confidence interval, 1.70–31.54; P < 0.01) followed by basilar aneurysm, poor grade, high ICP and Lbo₂ during TSV. Global cerebral ischemia occurs immediately after the ictus, persisting in 30% of patients despite the high therapeutic intensity level, superimposed by DCI during TSV.
We propose implications for clinical practice and patient management to minimize cerebral ischemia.
Predicting Duration of Outpatient Physical Therapy Episodes for Individuals with Spinal Cord Injury Based on Locomotor Training Strategy
2022, Archives of Physical Medicine and Rehabilitation
Citation Excerpt :
Bayesian inferences follows this process using 3 steps: (1) identify your previously held opinion on what you're interested in learning more about through your research question (ie, specify the prior probability of the parameter of interest, referred to as a “prior”; (2) collect and summarize the observed outcome using a likelihood function; (3) produce a posterior distribution representing your updated position about the unknown parameter, referred to as the “posterior.”51 The posterior distribution allows investigators to make statements (inferences) regarding certainty of the parameter of interest (eg, a mean estimate or proportion) for the population of interest (vs the sample distribution using more conventional statistical approaches).52-56 The uncertainty about the parameter is represented by the Bayesian credible interval (CI).
To characterize individuals with spinal cord injuries (SCI) who use outpatient physical therapy or community wellness services for locomotor training and predict the duration of services, controlling for demographic, injury, quality of life, and service and financial characteristics. We explore how the duration of services is related to locomotor strategy.
Observational study of participants at 4 SCI Model Systems centers with survival. Weibull regression model to predict the duration of services.
Rehabilitation and community wellness facilities at 4 SCI Model Systems centers.
Eligibility criteria were SCI or dysfunction resulting in motor impairment and the use of physical therapy or community wellness programs for locomotor/gait training. We excluded those who did not complete training or who experienced a disruption in training greater than 45 days. Our sample included 62 participants in conventional therapy and 37 participants in robotic exoskeleton training.
Outpatient physical therapy or community wellness services for locomotor/gait training.
SCI characteristics (level and completeness of injury) and the duration of services from medical records. Self-reported perceptions of SCI consequences using the SCI-Functional Index for basic mobility and SCI-Quality of Life measurement system for bowel difficulties, bladder difficulties, and pain interference.
After controlling for predictors, the duration of services for the conventional therapy group was an average of 63% longer than for the robotic exoskeleton group, however each visit was 50% shorter in total time. Men had an 11% longer duration of services than women had. Participants with complete injuries had a duration of services that was approximately 1.72 times longer than participants with incomplete injuries. Perceived improvement was larger in the conventional group.
Locomotor/gait training strategies are distinctive for individuals with SCI using a robotic exoskeleton in a community wellness facility as episodes are shorter but individual sessions are longer. Participants’ preferences and the ability to pay for ongoing services may be critical factors associated with the duration of outpatient services.
Most ankle sprain research is either false or clinically unimportant: A 30-year audit of randomized controlled trials
2021, Journal of Sport and Health Science
Citation Excerpt :
Alternatives to FPR have been discussed by Colquhoun.23 Perhaps the most clinically intuitive option is the use of a reverse Bayesian approach,54 where the observed p value is used to calculate the prior probability required to achieve a specific or minimal FPR (e.g., 5%). This then allows the researcher to determine whether the calculated prior probability is plausible or not.30
Lateral ankle sprain is the most common musculoskeletal injury. Although clinical research in this field is growing, there is a broader concern that clinical trial outcomes are often false and fail to translate into patient benefits.
We audited 30 years of experimental research related to lateral ankle sprain management (n = 74 randomized controlled trials) to determine if reports of treatment effectiveness could be validated beyond statistical certainty.
A total of 77% of trials reported positive treatment effects, but there was a high risk of false discovery. Most trials were unregistered and relied solely on statistical significance, or lack of statistical significance, rather than on interpreting key measures of minimum clinical importance (e.g., minimal detectable change, minimal clinically important difference).
Future clinical trials must adopt higher standards of reporting and data interpretation. This includes consideration of the ethical responsibility to preregister their research and interpretation of clinical outcomes beyond statistical significance.
Issues with criteria to create blacklists: An epidemiological approach
2020, Journal of Academic Librarianship
Screening criteria are a vital part of society, medicine and publishing. In this paper, a new framework, based on epidemiological principles, is developed to assess the effectiveness of criteria that are used to detect predatory behavior, or to assess whether a journal or publisher is predatory, and create blacklists. Applying epidemiological measures such as specificity, sensitivity, prevalence rates, the likelihood ratio, posterior and prior probabilities and odds, as well as Bayesian analysis, we elaborate on the false discovery rate and work towards assessing the likelihood that an entity is in fact predatory when screening criteria are used. We applied the framework to three different prevalence cases: a low prevalence rate where all journals are screened for predatory behavior assuming Jeffrey Beall's criteria are used; a higher rate when only open access journals are assessed; the highest rate where only Walt Crawford grayOA journals were screened for deceptive publishing practices. In all cases, we found a very high false discovery rate even when using reasonable values for the sensitivity and specificity rate for Beall's screening criteria.
Investigating the relationship between sleep and migraine in a global sample: a Bayesian cross-sectional approach
2023, Journal of Headache and Pain

View all citing articles on Scopus

View full text

Why should clinicians care about Bayesian methods?

Abstract

Introduction

Section snippets

Five reasons to fret about frequentist methods

Confidence intervals: better than p-values?

What is required to change old habits?

Credibility: a suitable case for Bayesian treatment

Derivation of the critical prior interval (CPI)

Credibility assessment: a worked example

Conclusion

Acknowledgements

Testing precise hypotheses

Statist. Sci.

Testing a point null hypothesis: the irreconcilability of P-values and evidence

J. Amer. Statist. Assoc.

A meta-analysis of physical activity in the prevention of coronary heart disease

Amer. J. Epidemiol.

Bayesians and frequentists

British Med. J.

Interpretation and Uses of Medical Statistics.

Bayesian statistical inference for psychological research

Psychol. Rev.

Meta-analysis: spurious precision? Meta-analysis of observational studies

British Med. J.

Probability and the Weighing of Evidence.

Dietary sodium intake and subsequent risk of cardiovascular disease in overweight adults

J. Amer. Med. Assoc.

Starting Statistics in Psychology and Education.

Comprehensive review of biorhythm theory

Psychol. Rep.