The enlightenment philosopher David Hume wrote, “in our reasonings concerning matter of fact, there are all imaginable degrees of assurance, from the highest certainty to the lowest species of moral evidence. A wise man, therefore, proportions his belief to the evidence.” As I wrote in my last essay, the scientific method built on evidence, has advanced human knowledge of the natural world by leaps and bounds. Science is the new arbiter of truth. In medicine, approximately $180 billion is spent annually in the United States to generate evidence for medical truths. While there are on the order of a million medical papers published each year, quantity does not equal quality. Broadly, the goodness of evidence depends on how directly it demonstrates the effectiveness of the treatment at hand and also how resistant it is to bias and confounding. In pursuit of these goals, the evidence based medicine (EBM) movement has developed a pyramid ranking methodology to classify the evidence into ranked categories. In this scheme, double blinded randomized controlled trials (RCT) are considered the gold standard of medical research and evidence collection. Even large amounts of evidence from lower rungs in the hierarchy cannot supersede the evidence from a RCT.
In a RCT, a study population is divided into two or more groups at random and an intervention is allocated to one group and a placebo (future essay) to the control group. Theoretically, this process of randomization protects against the pitfalls associated with bias and confounding. It facilitates the random distribution of characteristics between the groups, thus differences between groups can be attributed to the specific intervention being tested. Quantitative estimates of the probability of error or differences based on chance are calculated by the measure called the p-value. In reality, all of these methodologies are subject to the vagaries of chance and the influences of industry. While P-values < 0.05 is the standard of correlation , it also implies that 1 in 20 RCTs showing a correlation arrive at the result by chance. More insidiously, according to respected opinions and estimates, industry biases are the rule rather than the exception and positive results from a study should be expected. Using the Bayes’ rule, the philosopher of medicine Jacob Stegenga states we “ought to have a high prior probability of that evidence” and “when presented with evidence for the hypothesis we ought to have a low estimation of the likelihood of that hypothesis.” Positive correlations from industry funded RCTs are the rule and should be viewed with more than average skepticism.
Although RCTs do not suffer from the pitfalls of confounding they do suffer from pitfalls of external validity – errant extrapolations. By definition, RCTs are performed in controlled, protocolled, and structured environments with a narrow set of inclusion and exclusion criteria. They are also often performed on small sample sizes (results of many studies are pooled in a meta-analysis). All of these factors in effect, reduce their validity to the general population living in a messy, wild, and unprotocolled world. In fact, in some ways the better the RCT methodologically, the less real-world applicability of the result. The better the results in a controlled setting, the less applicability there is to real life clinical scenarios. As the philosopher of science Nancy Cartwright says, the inferential chasms between “it works somewhere” to “it works everywhere” and “it works for me” (future essay) cannot be minimized or dismissed. If a RCT shows an effect in a controlled setting with narrow inclusion criteria and wide exclusion characteristics, does it mean that the results can be generalized to a population with a near infinite variation in the combination of traits living in a largely uncontrolled and messy world?
The pursuit of evidence in the service of medical truths is a multibillion dollar industry that generates on the order of a million research papers annually. To be able to parse through this research so we can “proportion our beliefs to the evidence” is beyond the cognitive capacity of any single person. Therefore, out of necessity, the EBM has created the evidence pyramid to rank the quality and usability of evidence. Nonetheless, this pyramid is at best instrumentally, contextually (future essay), or heuristically useful. Reliable and usable evidence can be generated from any of the rungs of the hierarchy and each utilizing evidence generated at each level has its trade-offs. Controlled and randomized studies can reduce the effects of confounding but often do not translate to real word messiness. Thus generalizations to populations or individuals must be made with care and rigor. In contrast, retrospective evidence (next essay) can suffer from confounding but can yield invaluable evidence on medical interventions gathered during routine clinical care and may more accurately reflect the general population seeking treatment for a particular condition.
Readings:
[…] thereby minimizing the role of physician expertise and intuition in the process. Thereafter, randomized controlled trials were ossified as the gold standard of evidence generation and minimized the role of […]
[…] as key performance indicators that drive system and individual behavior. Evidence generated on unrepresentative populations in unrepresentative contexts with poorly defined diseases are prematurely sanctified under the banner of standards of care. […]