Research is read more than it is written.
It is a truth fundamental to research that new knowledge is advanced through the acquisition and interpretation of existing knowledge.
There are numerous estimates of the number of research papers published every year, with one claim of 30,000 journals serving two million articles a year. SciHub, an open access research paper repository, lists over 83 million papers. Quacquarelli Symonds, a company specialising in higher education analysis, considered 5,500 research institutes for their 2021 World University Rankings. Some scientists publish more than 70 papers a year; a frequency of about one every five days.
Clearly, not all of these can be of equal quality, nor are all likely to stand much scrutiny. The challenge is how to quickly evaluate research claims and assess whether the results can inform your own work.
How does a scientist read and evaluate research?
Any scholarly work rests on the quality of the data informing the analysis. The objective of this lesson is to learn how to read and review a research paper, with specific focus on sample randomisation, and assessing a level of confidence using core statistical and logical techniques.
The two papers we will refer to are these:
Budoff, Matthew J., Deepak L. Bhatt, April Kinninger, Suvasini Lakshmanan, Joseph B. Muhlestein, Viet T. Le, Heidi T. May, et al. 2020. “Effect of Icosapent Ethyl on Progression of Coronary Atherosclerosis in Patients with Elevated Triglycerides on Statin Therapy: Final Results of the EVAPORATE Trial.” European Heart Journal. doi:10.1093/eurheartj/ehaa652.
Packer Milton, Anker Stefan D., Butler Javed et al., 2020. ``Cardiovascular and Renal Outcomes with Empagliflozin in Heart Failure'', New England Journal of Medicine, August 2020. doi:10.1056/NEJMoa2022190
Don't worry if you don't understand them on first read. Our objective is to learn how to evaluate research even when you are not immersed in the topic.
In late 2017, while on a cruise vacation which stopped at the French Caribbean island of Martinique, a British couple reported a burning rash on their buttocks. The medical team on the ship prescribed antibiotics and an antifungal agent, which did nothing to improve their symptoms. On return to the UK, their conditions had worsened and they went to a hospital in Cambridge for further treatment. There they were diagnosed as suffering from cutaneous larva migrans, a condition caused by nematode hookworms of the Ancylostomatidae family.
Their condition had worsened by then, and the hookworms had spread to their lungs, causing a persistent cough, shortness of breath and pain. They were prescribed ivermectin, and - since the diagnosis was relatively unusual - their doctors asked if they could write up the case as a study for the British Medical Journal. Approval was granted, and the paper appeared as Cutaneous larva migrans with pulmonary involvement \cite{maslin_cutaneous_2018}. The paper included a number of photographs of the couple's buttocks to facilitate diagnosis.
Almost immediately, the study was picked up and presented by British tabloid newspapers, including the Daily Mail (warning: graphic content). This, obviously, horrified the couple and they contacted the BMJ to ask that the paper be retracted.
The BMJ considered the request, and retracted the paper:
With no admission of liability, BMJ has removed this article voluntarily at the request of the patient concerned.
Given the legitimate medical interest in well-founded research that supports diagnosis of a risky condition, should a patient have the right to retract permission they have already granted?
What are the limits and considerations for privacy and anonymity?
Researchers, particularly those working with human subjects, have a duty of care towards their stakeholders that arises from rights they, or their stakeholders, may hold.
Negative rights are also called rights of non-interference; the right to act, think or speak without being interrupted or interfered with. These are rights which oblige others not to do things to the rights-possessor. Privacy, the right to deny others knowledge of certain information, falls within negative rights. \cite{baggini_ethics_2007}
Conversely, a positive right imposes a duty upon others, such as protection from harm or - as in a medical environment - a duty of care. These are then rights of obligation.
The intersection of these rights may be codified in research practice, but different countries have different legislative environments.
As researchers, to who or what do we owe these rights?
We may differentiate between moral agents and moral subjects.
A moral agent is one who is able to act morally or immorally, who can choose between these paths of action, and who can be judged to have acted well or badly on moral grounds. Computers are not, even if they are programmed to do good or evil things, because they have no control - no agency - over their choices. Similarly, a 2015 US court case in response to whether animals can be subjected to medical experiments, judged that chimpanzees are property with no writ of habeas corpus. That does not mean that nothing else has rights, just that they don't have agency.
A moral subject is something, or someone, for which things can go well or badly, which has welfare interests, and whose interests a moral agent has a responsibility to consider.
Is a geographical feature a moral subject? To the extent that a mountain, a valley, or a gorge has no point of view, no. However, that does not mean that destruction of such an object won't impact other moral agents or moral subjects.
In most cases, moral agents and subjects are interchangeable. A moral agent is always also a moral subject, but not all moral subjects are moral agents. If a baby were to topple a vase which fell on a person's head and killed them, one wouldn't declare that the baby had made a conscious decision to do so. A baby has no conscious understanding of right or wrong so is not capable of moral agency, yet it is a moral subject.
This consideration may be governed by statute with legal consequences for ignoring them, but ethics must go further than codified law - especially when researchers work at the outer boundary of knowledge and codified practice - and consider both the ethical and moral case for an action.
What is legal may not be moral and what is moral may not be legal.
In May 2020, Rio Tinto, a mining company, destroyed the Juukan Gorge, a 46,000-year-old Aboriginal heritage site, to clear way for a mine.
They had full legal permission to do so. However, they ignored the pleas of the traditional owners of gorge, the Puutu Kunti Kurrama and Pinikura people, causing them quantifiable harm. Rio Tinto themselves conducted a research survey in 2014 at the Juukan Gorge "that gathered more than 7,000 artefacts, including a plaited belt made from human hair that DNA testing revealed belonged to the direct ancestors of PKKP alive today, and tools and grinding stones which showed those tools had been in use far earlier than archeologists previously believed."
Rio Tinto had a legal right to destroy the heritage site, but did they have a moral one? And even if they had sympathy with the plea, they may believe that the value of the ore they extract from the site justifies its destruction.
These are means/ends trade-offs and it is essential that they be evaluated together so as to arrive at a valid conclusion as to a course of action.
In 2012, Facebook conducted a week-long experiment on 689,000 of its users \cite{kramer_experimental_2014}. People were randomly assigned to one of two groups which either received more positive stories or more negative stories in their news feed than they would regularly receive. People who received more positive stories tended to behave more positively, themselves posting more positively, while the converse occured for those exposed to more negative stories. The research suggests that social media creates a social contagion acting to transfer emotions to others and reinforce those emotions.
None of the participants in the study were aware of their involvement, and none had even been approached to request formal consent.
Writing in When and Why Is Research without Consent Permissible?, Gelinas et al declare two reasons when consent can be ignored:
This is an express claim that the ends, the research outcomes, outweigh the means, the consent required of the participants. Clearly, though, you'd want to be extremely clear that whoever is considering the means/ends be completely impartial, or that each party has some representation.
Follow-up questions are: who decides if no rights are harmed, or if the infringement is minor? What constitutes a minor harm, and who decides - and on what grounds - that the social value is greater than any harm caused?
Any review would need to be conducted by impartial agents and consider all of the actions and potential consequences.
The authors declared that Facebook's existing user terms gave this consent, and that the non-Facebook employees leading the study didn't have access to individual user information anyway.
The researchers claimed what they did was ethical because data was anonymous. They addressed negative rights - that moral subjects have a right to privacy - but not positive rights - that moral subjects have a right not to be harmed. They ignored whether it was ethical to deliberately manipulate the emotional state of users without their consent \cite{shaw_facebooks_2015}. "The main result is that exposure to negative posts made people feel worse; if valid, this means that hundreds of thousands of people were made less happy by the study."
Unnecessary involvement in a study can cause harm to healthy and unhealthy people. The least one would hope from any research is that participants - subjects - are informed and consent to their participation, and that they are not harmed as a result of their consent.
Shaw's review highlights that the extreme number of participants in the Facebook study was far in excess of that required to produce a statistically meaningful result and implies that far more people, of unpredictable backgrounds, were effected than should have been. This sample would also include children under the age of 13 and people particularly susceptable to anxiety. Shaw describes that the research itself may be invalid since its design and methods resulted in a tiny effect size, meaning there's little statistical meaning from the results; a claim of a large risk of harm relative to a low return for social value.
Perhaps this research really is of such pressing importance that harm to participants could be ignored, although this isn't supported by the quality of the results, but this was not considered during ethical review or by the researchers themselves.
Which leads to the review of the BMJ decision. Clearly, the authors had received consent from their patients. However, the patients themselves expressed fear of harm after publication. It is easy to see how this fear could be realised. We know the date (late 2017), the location of the cruise ship (Martinique), the approximate age and ethnicity of the patients (from the photographs, even of their posteriors), the hospital where they were attended (Cambridge), and the names of the physicians who attended them. It wouldn't be particularly difficult for their friends and colleagues to figure out it was them, and at that point they are no longer anonymous.
The potential for harm to their privacy and to their dignity is obvious. It could be argued that retracting the paper after international news coverage is a bit late, but the principle is important too.
Protocols of randomized trials specify inclusion and exclusion criteria to determine the population under study. Exclusion criteria typically focus on identifying subjects who might be harmed by the study intervention, those for whom benefit is doubtful and those who are unlikely to provide useful data. Inclusion criteria tend to focus on risk: all trials identify the population at risk for the study event, some trials additionally specify criteria to define a study population at high-risk. \cite{vickers_selecting_2006}
People may give their permission to participate in a research study, but it isn't indefinite or all-encompasing. They may expect discomfort, but not active harm. A person's rights as a moral subject - both positive and negative - are critical to ensuring valid and repeatable research, but also as a marker to the overall approach of the researchers themselves.
Researchers who ensure their subjects' rights are considered throughout the research process are also ensuring the validity of their study. Randomisation - of which more in 2.2 Curation - relies on anonymity. If agents of the research process have any bias - any expectations of behaviour or effect from certain categories of patients - they may, consciously or not, steer their measurements towards an expected result.
These biases can be measured in the results and may render research conclusions unverifiable and invalid. The risks of considering your sample population in isolation from other factors extant to your study may compromise your research. For example, an inclusion study of only smokers at high risk for developing lung cancer may – in a group of 100 – have a disproportionately high (or low) number of people who are simultaneously obese. If adverse events are concentrated in the obese study participants then your results may be biased upward or downward, depending on the sample bias.
The tension between the difficulty of recruiting, and maintaining, study participants, and the need to ensure a representative sample often confounds results.
These seven criteria require consideration to ensure that patients’ interests are accounted for during recruitment into a study, and during and after their participation (\cite{emanuel_what_2000} with text in this section via the NIH). While this is focused on health research, the reality is that these concerns are true for any study involving moral subjects.
Answers to the research question should contribute to scientific understanding of health or improve our ways of preventing, treating, or caring for those with a given disease. Only if society will gain useful knowledge — which requires sharing results, negative, positive and neutral — can exposing moral subjects to the risk and burden of research be justified.
A study should be designed in a way that will get an understandable answer to a valuable research question. This includes considering whether the question researchers are asking is answerable, whether the research methods are valid and feasible, and whether the study is designed with a clear scientific objective and using accepted principles, methods, and reliable practices. It is also important that statistical plans be of sufficient power to definitively test the objective and for data analysis. Invalid research is unethical because it is a waste of resources and exposes subjects to risk for no purpose
The primary basis for recruiting and enrolling groups and individuals should be the scientific goals of the study — not vulnerability, privilege, or other factors unrelated to the purposes of the study. Consistent with the scientific purpose, people should be chosen in a way that minimizes risks and enhances benefits to individuals and society. Groups and individuals who accept the risks and burdens of research should be in a position to enjoy its benefits, and those who may benefit should share some of the risks and burdens. Specific groups or individuals (for example, women or children) should not be excluded from the opportunity to participate in research without a good scientific reason or a particular susceptibility to risk.
Uncertainty about the degree of risks and benefits associated with a drug, device, or procedure being tested is inherent in clinical research — otherwise there would be little point to doing the research. And by definition, there is more uncertainty about risks and benefits in early-phase research than in later research. Depending on the particulars of a study, research risks might be trivial or serious, might cause transient discomfort or long-term changes. Risks can be physical (death, disability, infection), psychological (depression, anxiety), economic (job loss), or social (for example, discrimination or stigma from participating in a certain trial). Has everything been done to minimize the risks and inconvenience to research subjects, to maximize the potential benefits, and to determine that the potential benefits to individuals and society are proportionate to, or outweigh, the risks? Research volunteers often receive some health services and benefits in the course of participating, yet the purpose of clinical research is not to provide health services.
These benefits may often risk compromising informed consent in populations at risk to poor healthcare provision since their incentive to participate may be to secure healthcare, rather than to support the study.
To minimize potential conflicts of interest and make sure a study is ethically acceptable before it even starts, an independent review panel with no vested interest in the particular study should review the proposal and ask important questions, including: Are those conducting the trial sufficiently free of bias? Is the study doing all it can to protect research volunteers? Has the trial been ethically designed and is the risk–benefit ratio favourable?
For research to be ethical, most agree that individuals should make their own decision about whether they want to participate or continue participating in research. This is done through a process of informed consent in which individuals:
There are exceptions to the need for informed consent from the individual — for example, in the case of a child, of an adult with severe Alzheimer's, of an adult unconscious by head trauma, or of someone with limited mental capacity. Ensuring that the individual's research participation is consistent with his or her values and interests usually entails empowering a proxy decision maker to decide about participation, usually based on what research decision the subject would have made, if doing so were possible.
Individuals should be treated with respect from the time they are approached for possible participation — even if they refuse enrollment in a study — throughout their participation and after their participation ends. This includes:
Individuals participating in a study are often at their most vulnerable and exposing information about themselves they may find difficult to share even with those closest to them. A researcher who neglects to remember that moral subjects require ethical consideration is likely to neglect other critical aspects of their research.
In lesson 1, we considered whether the research question itself could be considered ethical. To this we add a second:
When assessing whether research is valid, if a researcher ignores the rights of their study subjects, what other ethical considerations have they ignored?
The role of a data scientist is to support in producing an answer to a research question that is robust, stands up to scrutiny, and is supported by ethical measurement data acquired during the study process.
A published paper is a subset of the information curated and methods employed from a process. Any review - if it is thorough - will raise queries that may not have ready answers. Researchers may need to follow up on mundane requests for more details about assumptions in definitions or methods, or a regulatory audit triggered as a result of a notifiable event if a participant experiences harm. You may even be asked for individual consents from study participants.
The strength and organisation of those answers will rely entirely on how data was acquired and curated throughout the project. Questions that are not asked at the outset of a study are very difficult to answer after the study has been completed and published. Memories fade, equipment degrades, callibration shifts, circumstances change.
The data management lifecycle seeks not only to document what happened, but why and how decisions were made about what to capture.
While the process for creating, maintaining and archiving new data are often presented as a cycle, it is more like a spiral. Each cycle spiralling upwards, and resulting in more information. However, the effectiveness of each step is defined by the needs of its users, and its relevance in terms of the process or events it reflects.
Before data can be collected, there are a range of things which must be known:
The creator of the data would best know what the data are about and should assign keywords as descriptors. These data about data are called metadata. The term is ambiguous, as it is used for two fundamentally different concepts:
Descriptive metadata permits discovery of the object. Structural metadata permits the data to be applied, interpreted, analysed, restructured, and linked to other, similar, datasets. Metadata can permit interoperability between different systems. An agreed-upon structure for querying the aboutness of a data series can permit unrelated software systems to find and use remote data.
Beyond metadata, there are also mechanisms for the structuring of relationships between hierarchies of keywords. These are known as ontologies and, along with metadata, can be used to accurately define and permit discovery of data.
Adding metadata to existing data resources can be a labour-intensive and expensive process. This may become a barrier to implementing a comprehensive knowledge management system.
Generic and commonly-used metadata schemas include Dublin Core and DataScite. These generic types of descriptive metadata include:
REQUIRED | RECOMMENDED | OPTIONAL |
---|---|---|
Title | Tag(s) | Last update |
Description | Terms of use | Update frequency |
Theme(s) | Contact email | Geographical coverage |
Publishing body | Temporal coverage | |
Validity | ||
Related resources | ||
Regulations |
Metadata standards differ from authority to authority (and from institution to institution) and need to be agreed as part of your protocol development. Some of the questions you will need to answer during protocol development include:
The answers to these questions will define the data management mechanism, as well as requirements for collecting and collating data during the study.
This part of the process is where the data are transcribed, translated, checked, validated, cleaned and managed.
This presents the greatest risk for data consistency. Any format change or manipulation, or even copying a file from one system to another, introduces the potential for data corruption. Similarly, it also increases the potential for data - whether erroneous or not - to be accidentally released to users or the public before it is ready.
This is also known as data wrangling or Extract-Transform-Load (ETL).
Analysis is the reason that data are collected and where data are interpreted, combined with other datasets to produce meta-analysis, and where analysis becomes the story you wish to tell derived from the data. Since the purpose of analysis is to inform collective or individual behaviour, influence policy, or support economic activity, amongst many others, it is essential for publication to include release of the evidence which informs.
Data needs to be preserved from corruption, as well as being available for later use once the initial analysis is complete. Long-term storage requires that the metadata be well-defined and exceptionally useful to ensure that understanding what the data describe is still possible long after its initial collection.
Data may end up being stored in multiple formats or across multiple systems.
It is essential that primacy is established (i.e. which dataset has priority over the others) and that the various formats are kept in alignment. It is critical that your software system maintain an audit trail. This will support and facilitate data quality management, any post-study surveillance requirements, as well as your regulatory application process.
It is beyond the terms of any course to recommend a specific software suite to manage your data requirements, but there are a host of software systems and services which can support the archival process. \cite{chait_technical_2014}.
The class of software used in clinical research are known as Clinical Data Management Systems (CDMS). Most CDMS are commercial and proprietary, but a number are open source. If your institution already has such a system, your path is clear, learn it and use it. If you are in the position where you will need to secure and implement your own, then a combination of budget and technical requirements will need to be considered. Any of the main systems will support your needs, but you will be responsible for ensuring that it supports compliance with the protocols you have agreed.
Even where data are only released within an institution - and not for the public - there will always be others who will want to use your data, or would benefit from it if they know it exists. The greatest inefficiency in data management arises when research is repeated because a different department needs the same data but did not know it already existed.
Release is not simply about making data available to others, it is also about creating a predictable process for that release. Regularly collected data (such as inflation rates) need a predictable release cycle since many companies base their investment decisions on the availability of such information. Publishing a release calendar for your stakeholders (and sticking to it) permits them to plan their own analysis, or response to your analysis.
Access implies that you need a centralised database which is accessible to your stakeholders. In the case of open data, how will data be moved from internal servers to a public repository?
Responsibility for this process needs to be assigned and measured.
Once data are released, the question arises as to how long it will be available? Research data should, ordinarily, be available in perpetuity. Time-series data become more useful the longer it has been collected. Suddenly removing data can cause tremendous disruption. If appropriate systems and support have not been put in place then retention can become a very expensive problem.
Importantly, in order to support reuse, clear copyright and licensing which permit data to be freely reused for any purpose is essential.
Datasets can become very large and may only be accessed infrequently. This can become problematic for long-term storage. A process of archival - where data can be stored more cheaply but still be accessible - may need to be considered.
If a person told you that their mobile phone can handle 2.5 megawatts of power, or that a critical temperature sensor on a large industrial smelter reported an 80°C drop in a second, or that a patient had just lost 6 litres of blood and was in need of urgent care ... you should be skeptical.
Small electronic circuits, like mobile phones, only handle a few watts, not megawatts. Conservation of energy means large blocks of metal, like a smelter, can't lose so much energy in seconds. The sensor is more likely to be faulty. And a human body only contains about 5 litres of blood, so anyone who loses 6 litres and is still alive is a miracle.
The data management process is there to support data collection, but it is the responsibility of the data scientist to sanity check data going in. There are physical and technical limits to measurements and you need to have a comprehensive understanding of your research domain to ensure that measurements reflect reality.
Sensors can need calibration. Software can contain "fat finger" errors, where significant digits are accidentally added to scaling factors or constants. Data can be recorded incorrectly at the point of collection.
The least one expects from data management systems and analytical scripts is that they do not contribute to the likely sources of error. The best way to get a handle on what can go wrong in your software systems, and how best to catch these issues, is to test it with synthetic data. These are data which are similar to that which you will use in your study but which are produced deterministically by an algorithm.
Our synthetic data could simply be a list of random numbers, but it is better to generate a simulacrum of the data we expect to collect. Natural data can naturally have weird outliers, but your controlled data - unless you deliberately introduce it - should not. This permits you to test your systems and see that they don't introduce anything unexpected.
The case study presented in the BMJ would not be considered a high-powered study since it relates the experiences of two people who are also married to each other. This is a single data point and considered anecdotal. The plural of anecdote is not evidence so, while such a single-point case study can strongly suggest a direction for future research, it doesn't of itself constitute evidence. This is another indication of why the BMJ were comfortable retracting it.
Any studies based on a limited number of data points can suffer from any or all of the following \cite{downey_think_2014}:
For research conclusions to be meaningful, the study process must be repeatable by anyone who follows the same methodology. We need a formal and defined method for collecting data \cite{vu_introductory_2020}.
A population is the entirety of a set of subjects which could be included in an area of study. For example, all people susceptible to diabetes, all buffalo on a migration route, or all plants pollinated by a single species of bees in a meadow. There may be practical or ethical reasons not to acquire information from each member of that population, and so we select a subset - a sample - of that population.
To be statistically meaningful, that subset needs to be representative of the characteristics of the total population. There are an enormous number of ways that sampling can go wrong, and many assumptions that go into specifying the necessary sample size required to produce a representative result.
Randomisation of the sample selection process is critical to ensure representation and avoid accidentally introducing bias. Even with randomisation, it's easy to inadvertantly bias a sample:
Similarly, through the research part of a study, it is critical to maintain a lack of bias. An expectation of particular results - confirmation bias - can lead to distortion of measurements.
Research begins with a hypothesis: a proposed explanation, on the basis of limited evidence, as to how something may occur. The limited evidence are grounds to define explanatory and response variables.
Until the completion of the study, we have no idea if the explanatory variables have any connection with the response variables. For example, does a terrorist attack in a city centre cause businesses in other cities to move to the suburbs? Our explanatory variables will be a terrorist attack and our response variables will be commercial vacancy, but we have no idea if they're connected.
Observational studies are performed in a way that does not interfere with the behaviour of study subjects, permitting associations to naturally arise over time. These associations may strongly correlate with our explanatory and response variables, but they aren't necessarily causative.
There are two types of observational studies:
Experimental studies are performed to find out if a variable causes a response, that is, whether it is causative. We divide our population sample into experimental groups and control groups. The control is treated almost as an observational study, while the experiment experiences an intervention in the explanatory variable.
Our hypothesis becomes a set of tests:
Here it is critical that our experimental and control groups are randomly assigned and representative. Usually, we also want to ensure that none of the groups, those studying the groups, or applying interventions are aware of who is in which group. That all agents and subjects are blinded so as to avoid any bias in measurement, experiment, or intervention.
The conclusion of our study also requires that we have defined endpoints; that we have stated up-front what conditions must be met to conclude our measurements. If we change them during the study, we're changing the experiment in a non-deterministic way. We could decide arbitrarily and so bias our results.
When we intervene in this way it should be clear that the potential for unethical or immoral actions are great. If we want to know whether terrorism causes office vacancy, there is no ethical way to intervene. You can't blow up a building just to see if people in a different city decide to move to a less congested location. If we want to test a new heart drug and state that the drug "fails" if the average treatment time exceeds 20 days, then change that to 40 days mid-way through the study, we can be accused of favouring our experimental hypothesis.
The nature of an experiment will impose its own ethical requirements and necessity for impartial review. Documenting our plan, our methodology, and our results are critical to avoiding bias and producing a valid, repeatable result.
Any experimental research on a sampled population featuring different study groups, each experiencing different experimental effects, can be summarised like this:
$$ \text{About the same } \to \text{Different} $$Randomisation of assignment to the groups ensures that statistical methods, including estimates and errors associated with those estimates, are reliable.
Our research depends on our ability to identify appropriate randomisation techniques to sample from study populations.
We can demonstrate this by generating a predictable set of data to work with.
Synthetic data can be as simple as a list of random numbers produced according to some distribution pattern, or as complex as a procedurally-generated simulation using machine learning techniques \cite{dahmen_synsys_2019}. As this course progresses, we will learn ever-more complex methods of producing synthetic data. For now, we will generate a profile for a patient at a hospital.
Our research question for this lesson asks us to consider patients receiving care for long-term coronary-related diseases. These are a subset of all patients and we can define inclusion and exclusion criteria for acceptance of patients in our study sample.
Defining our fields and metadata:
Finally, we can include participants if they have at least one of the criteria (weight >10% of set weight, smokes, family history, disease) but exclude them if they have none or all. The exclusion of those with all the criteria may seem odd, but we also don't want patients who may die during the study purely as a consequence of being particularly frail. In a small sample size, such deaths may be confounding variables in that they bias the result towards a particular hypothesis. We want to know if our intervention has an effect, and this effect must stand out clearly. Patients who are too ill may also be too ill for us to measure.
We will use numpy
, matplotlib
and pandas
during this lesson. If you haven't already, ensure they are installed (review Lesson 1 if you need) and let's begin to code.
# Import the libraries we're going to use. If any of these are not installed, you can run
# `pip install` to import them into your development environment
from matplotlib import pyplot as plt
%matplotlib inline
import pandas as pd
import numpy as np
import random, uuid
Before we get too much further, let's define what is meant by a normal distribution. This is a type of continuous probability distribution for a real-valued random variable.
Consider, for example, a set of heights. We talk about someone being of "average" height and, while there are particularly short or tall people, most people are likely to be somewhere near the average height. That distribution is normal. If a group were particularly short or tall, it may be statistically interesting and tell you about that population.
We can create and visualise a synthetic normal distribution using numpy
normal:
# Generate random normal distribution for an average height of 160 cm
# https://numpy.org/doc/stable/reference/random/generated/numpy.random.normal.html
heights = np.random.normal(loc=160, scale=3, size=10000).astype(int)
# Plot
axis = np.arange(start=min(heights), stop = max(heights) + 1)
plt.hist(heights, bins = axis)
(array([2.000e+00, 1.000e+00, 1.000e+01, 2.100e+01, 5.400e+01, 1.440e+02, 2.400e+02, 4.130e+02, 6.710e+02, 8.760e+02, 1.185e+03, 1.328e+03, 1.378e+03, 1.159e+03, 8.920e+02, 6.920e+02, 4.630e+02, 2.450e+02, 1.220e+02, 5.400e+01, 3.400e+01, 1.400e+01, 2.000e+00]), array([148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171]), <BarContainer object of 23 artists>)
The chart is a histogram and displays data density. The x-axis consists of bins for a height-range. We created that with the axis
variable which produced a set of buckets (150 to 151, 151 to 152, etc). Where a value falls on the boundary of a bin (e.g. 151) it will be assigned to the lower bin. The y-axis is a count of the number of points falling within that bin range.
A symmetrical distribution like this is normal, but they can also be skewed in various ways that we will get to later.
We're going to create a normally-distributed patient population. Age isn't usually normally distributed (in Western Europe it tends towards a greater number of older people, and in emerging market countries tends to be younger), but a hospital population does tend to be a subset of older people.
# Define a function to generate a random patient population
# This returns a list of dictionaries, where each dictionary
# is a set of fields defining a patient.
def create_patient_population(**kwargs):
"""
Return a population for a group of patients defined by having:
id: a unique anonymous string
age: normal distribution
gender: boolean
weight: normal distribution, varies for male or female
smokes: boolean
family_history: boolean, has family history
disease: boolean, has the disease
color: group membership indicator
included: meets inclusion criteria
coordinates: random coordinates for plotting on a scatter plot, not meaningful
Args:
total: int, default 10000
male_weight: int, default 100
fmale_weight: int, default 85
groups: int, default 4
Returns:
list of dicts
"""
patient_total = kwargs.get("total", 10000)
male_loc_weight = kwargs.get("male_weight", 100)
female_loc_weight = kwargs.get("fmale_weight", 85)
colour_list = [F"C{n}" for n in range(kwargs.get("groups", 4))]
# generate the colour coordinates to plot on a grid
cc = int(len(colour_list)/2)
# if the count is uneven
if len(colour_list)%2: cc = int((len(colour_list)-1)/2)
cc_coords = [i+1 for i in range(cc)]*2
# And add in the extra group
if len(colour_list)%2: cc_coords.append(cc+1)
cc_coords.sort()
colour_coordinates = {
F"C{n}": ((20*c, 20) if n%2 else (20*c, 40)) for n, c in enumerate(cc_coords)
}
distance = 20
# Create the patient population
random_age = np.random.normal(loc=55, scale=3, size=patient_total).astype(int)
patient_population = []
for age in random_age:
# Create patient fields
gender = random.choice(["Male", "Female"])
loc_weight = female_loc_weight
if gender == "Male": loc_weight = male_loc_weight
weight = int(np.random.normal(loc=loc_weight, scale=3))
smokes = random.choice([True, False])
family_history = random.choice([True, False])
disease = random.choice([True, False])
color = random.choice(colour_list)
# Inclusion criteria
included = any([smokes, family_history, disease, weight >= loc_weight * 1.1])
# Exclusion criteria
if all([smokes, family_history, disease, weight >= loc_weight * 1.1]):
included = False
patient_population.append({
"id": str(uuid.uuid4()), # Generate a random unique string as a patient ID
"age": age,
"weight": weight,
"gender": gender,
"smokes": smokes,
"family_history": family_history,
"disease": disease,
"color": color,
"coordinates": (
np.random.uniform(colour_coordinates[color][0], colour_coordinates[color][0] + distance),
np.random.uniform(colour_coordinates[color][1], colour_coordinates[color][1] + distance)
),
"included": included,
"selected": False
})
# Shuffle the ordered list to create a random population
random.shuffle(patient_population)
return patient_population
%time patient_population = create_patient_population()
Wall time: 123 ms
A scatter plot uses Cartesian coordinates, usually for two variables (assumed to be dependent). A third variable can be shown by changing the colour or size of the point data on the chart. It's frequently a useful way to investigate any sort of correlation between the data. Correlation doesn't necessarily mean anything.
When we created our random patient population, you'll note we created a random set of coordinates for them as well, including a color
field so that we could draw them.
# Draw our included patients as a scatter plot
def get_coordinates(population_list):
"""
Restructure a population list into a set of coordinate lists for plotting in numpy.
Returns:
x, y, c lists of values
"""
x = [p["coordinates"][0] for p in population_list if p.get("included", True)]
y = [p["coordinates"][1] for p in population_list if p.get("included", True)]
c = [p["color"] for p in population_list if p.get("included", True)]
return x, y, c
plt.figure(figsize=(6,6))
plt.axis("off")
# Plot them all by first converting the list of dicts to a list of values
x, y, c = get_coordinates(patient_population)
plt.scatter(x, y, s=50, facecolors="none", edgecolors=c, alpha=0.3)
plt.show()
Our population is randomly distributed across the plain. If this were a hall, then we're looking down on people standing around. Each colour represents patients from a different hospital. Hospitals may have similar patients, they may focus on particularly difficult diseases, we don't know.
We want to select a random - but representative - sample of this population to run our experiment.
Let's assume we wanted to pick 80 people. There are four ways we could go.
Simple random sampling where we simply choose people from the whole population irrespective of any group they may belong to.
plt.figure(figsize=(6,6))
plt.axis("off")
# Plot them all by first converting the list of dicts to a list of values
x, y, c = get_coordinates(patient_population)
plt.scatter(x, y, s=50, facecolors="none", edgecolors=c, alpha=0.1)
# Get a simple random selection of the total population using random.sample
# https://docs.python.org/3/library/random.html#random.sample
sample_patients = random.sample(patient_population, 80)
x, y, c = get_coordinates(sample_patients)
plt.scatter(x, y, s=50, facecolors=c, edgecolors=c)
plt.show()
Randomisation in this way means we haven't got much control over the distribution between groups. We may want to ensure that we have equal sample sizes in each group.
Stratified sampling overcomes this concern by randomly sampling from each group. In this case, we decide to have 20 samples from each for a total of 80.
plt.figure(figsize=(6,6))
plt.axis("off")
# Plot them all by first converting the list of dicts to a list of values
x, y, c = get_coordinates(patient_population)
plt.scatter(x, y, s=50, facecolors="none", edgecolors=c, alpha=0.1)
# There are four groups, so let's randomly select 20 patients from each
for i in range(4):
sample_patient_group = list(filter(lambda d: d["color"] == F"C{i}", patient_population))
sample_patients = random.sample(sample_patient_group, 20)
x, y, c = get_coordinates(sample_patients)
plt.scatter(x, y, s=50, facecolors=c, edgecolors=c)
plt.show()
Hopefully you realise how difficult it is to tell these things by eye, and the density of the point data means that you could easily have points overlapping each other, making visual assessment even more challenging.
Now, maybe we have problems with some of these hospitals. Maybe they're difficult to get to. Maybe you haven't received approval to work with them. Or maybe you want to go further and create subsets even within these groups. This is where clustering comes in.
Cluster sampling requires us to split the population into many groups. We then sample a fixed number of clusters and include all patients in each cluster. This could be as simple as all the patients attending a particular clinic during a specific week. Multistage sampling goes further, first splitting into groups, then sampling from each group.
Let's demonstrate that with multistage sampling from two hospitals.
plt.figure(figsize=(6,6))
plt.axis("off")
# Plot them all by first converting the list of dicts to a list of values
x, y, c = get_coordinates(patient_population)
plt.scatter(x, y, s=50, facecolors="none", edgecolors=c, alpha=0.1)
# There are four groups, so let's randomly select 20 patients from each
for i in range(4):
if i in [0,3]:
sample_patient_group = list(filter(lambda d: d["color"] == F"C{i}", patient_population))
sample_patients = random.sample(sample_patient_group, 40)
x, y, c = get_coordinates(sample_patients)
plt.scatter(x, y, s=50, facecolors=c, edgecolors=c)
plt.show()
Randomised experiments are usually structured around five principles \cite{dietz_openintro_2015}:
Continuing our example with our synthetic data, we can select two groups, divide them into blocks, and then randomly assign them to either the treatment or control arms of a study:
# Let's assume the two colours we select each represent a specific group
# perhaps one is low-risk and the other is high-risk
low_risk_group = list(filter(lambda d: d["color"] == F"C0", patient_population))
low_risk_patients = random.sample(low_risk_group, 40)
high_risk_group = list(filter(lambda d: d["color"] == F"C3", patient_population))
high_risk_patients = random.sample(high_risk_group, 40)
# This all-risk patient sample represents our study participants
all_risk_patients = low_risk_patients + high_risk_patients
# Let's recreate the random coordinates and plot them:
for p in all_risk_patients:
p["coordinates"] = (
np.random.uniform(),
np.random.uniform()
)
plt.figure(figsize=(4,4))
plt.xticks(color="none")
plt.yticks(color="none")
plt.title("Sample Population")
# Plot them all by first converting the list of dicts to a list of values
x, y, c = get_coordinates(all_risk_patients)
plt.scatter(x, y, s=50, facecolors="none", edgecolors=c)
plt.show()
# Create blocks (yes, I know they were in blocks when we started, but let's imagine they weren't)
low_risk_group = list(filter(lambda d: d["color"] == F"C0", all_risk_patients))
high_risk_group = list(filter(lambda d: d["color"] == F"C3", all_risk_patients))
# Split each group in half and create new 'treatment' and 'control' groups
# There are 40 members in each group, so ...
control_group = low_risk_group[:20] + high_risk_group[:20]
treatment_group = low_risk_group[20:] + high_risk_group[20:]
plt.figure(figsize=(6,3))
# Plot them all by first converting the list of dicts to a list of values
plt.subplot(1, 2, 1)
plt.xticks(color="none")
plt.yticks(color="none")
x, y, c = get_coordinates(control_group)
plt.scatter(x, y, s=50, facecolors="none", edgecolors=c)
plt.title("Control Group")
# Plot them all by first converting the list of dicts to a list of values
plt.subplot(1, 2, 2)
plt.xticks(color="none")
plt.yticks(color="none")
x, y, c = get_coordinates(treatment_group)
plt.scatter(x, y, s=50, facecolors="none", edgecolors=c)
plt.title("Treatment Group")
plt.show()
The distribution is presented below. Since it is a small sample size, you should expect a fair amount of variation between the two groups. How much would you think would be too much? How would you go about figuring that out?
def get_age_distribution(population_list):
"""
Restructure a population list into a set of distribution lists for plotting in numpy.
Returns:
x, y, c lists of values
"""
x = [p["age"] for p in population_list]
bins = np.arange(start=min(x), stop = max(x) + 1)
return x, bins
plt.figure(figsize=(10,5))
# Plot them all by first converting the list of dicts to a list of values
plt.subplot(1, 2, 1)
control_age, bins = get_age_distribution(control_group)
plt.hist(control_age, bins = bins)
plt.title("Control Group")
# Plot them all by first converting the list of dicts to a list of values
plt.subplot(1, 2, 2)
treatment_age, bins = get_age_distribution(treatment_group)
plt.hist(treatment_age, bins = bins)
plt.title("Treatment Group")
plt.show()
print(F"Control mean: {sum(cntrl_age)/len(cntrl_age):.2f}")
print(F"Treatment mean: {sum(treat_age)/len(treat_age):.2f}")
Control mean: 54.27 Treatment mean: 54.88
We allocated each patient a random id on selection and have known nothing about the patients, other than where they were sampled from, during the randomisation process. We sampled from each cluster, then mixed the clusters together to create a single random sample, then randomly allocated to our experimental groups. There is no information in the patient record which would indicate what arm of the experiment they're in.
If the randomisation process has been done correctly, you should be able to flip a coin to pick which group becomes the control and which the treatment arm and it should have no consequence to the conduct or results of the experiment. Conversely, if results change depending on which group is assigned to which arm, you have some bias during the randomisation proces, or during the experiment itself, which nullifies the value of your experiment.
Data curation must deliver statistically valid randomised study groups. Only then can the experiment be run and - once complete - can the data be presented for analysis.
We start any analysis by exploring our data to understand its shape, distribution and variance.
The mean (or average) is a measure of the centre of the distribution of a data series. This is written as $\bar{x}$ or $\mu$ (mu) and is the sum of all of the observations divided by the number of observations:
$$ \bar{x} = \frac{x_1 + x_2 + ... + x_n}{n} $$where $x_1, x_2 ... x_n$ represent the $n$ observed values.
Earlier, we looked at histograms which are used to visualise data density. Data can be divided into bins. Age, for example, is commonly divided by decile (0 to 10, 10 to 20, etc) where each observation increments the count of its appropriate bin (53 falls inside the 50 to 60 bin).
These distributions can be normal as shown earlier, where the observations are symmetric about the central axis, or they can be skewed. Data which trail off in a particular direction are said to have long tails. If the tail falls on the left side of the chart it is left skewed and if the tail is on the right it is right skewed.
Viewing our data gives us an understanding of what our sample population, and our research observations, look like. If you want to run the next example, remember to install scipy
(pip install scipy
):
# A demonstration of skewness, using skewnorm to randomly generate data
# https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.skewnorm.html
from scipy.stats import skewnorm
fig, axs = plt.subplots(1, 3, figsize=(16,4))
skewset = [[-6, "Left skewed"], [0, "Symmetric"], [6, "Right skewed"]]
for n, [a, lbl] in enumerate(skewset):
x = np.linspace(skewnorm.ppf(0.01, a),
skewnorm.ppf(0.99, a), 100)
rv = skewnorm(a)
axs[n].plot(x, rv.pdf(x), "k-", lw=2, label=lbl)
vals = skewnorm.ppf([0.001, 0.5, 0.999], a)
r = skewnorm.rvs(a, size=1000)
axs[n].hist(r, density=True, histtype="stepfilled", alpha=0.2)
axs[n].legend(loc="best", frameon=False)
plt.show()
Histograms can also be used to identify modes, which are distinct peaks in the distribution. The above charts all have a single peak and are unimodal. Distributions can be bimodal (two peaks) or multimodal (> 2 peaks). This isn't rigourously defined, but is something you should be aware of when evaluating your data.
The distance of an observation from its mean ($\bar{x}$) is its deviation. The deviation of the $a^{th}$ observation from the mean is:
$$ x_a - \bar{x} = v_a $$Squaring the deviations, to get rid of negative signs, and averaging, gives an approximation of the sample variance ($s^2$):
$$ s^2 = \frac{v_1^2 + v_2^2 + ... + v_n^2}{n-1} $$The standard deviation ($s$) is defined as the square root of the variance:
$$ s = \sqrt{s^2} $$This is useful when assessing how close the data are to the mean, where variance are the average squared distance from the mean, and standard deviation is the square root of this variance.
These symbols are used to describe the sample of a population. For describing the population itself we use sigma ($\sigma$): $\sigma^2$ for population variance, and $\sigma$ for the population standard deviation. All going well during randomisation, your sample should be similar to the population, but that isn't always the case.
The numbers themselves can be ambiguous since vastly different datasets can have the same means and standard deviations.
# Draw a distribution chart shading the areas under the chart
# for the first and second standard deviations
# Derived from https://pythonforundergradengineers.com/plotting-normal-curve-with-python.html
from scipy.stats import norm
# Define constants
mu = 1000 # mean
sigma = 100 # first standard deviation
# Calculate the distribution
x1 = np.arange(-1, 1, 0.001) # first sd
x2 = np.arange(-2, 2, 0.001) # second sd
x_all = np.arange(-10, 10, 0.001) # all the data
y1 = norm.pdf(x1,0,1)
y2 = norm.pdf(x2,0,1)
y_all = norm.pdf(x_all,0,1)
# Draw the chart
fig, ax = plt.subplots(figsize=(9,6))
ax.plot(x_all,y_all, color="C0")
ax.fill_between(x1,y1,0, alpha=0.3, color="C0")
ax.fill_between(x2,y2,0, alpha=0.2, color="C0")
ax.fill_between(x_all,y_all,0, alpha=0.1, color="C0")
ax.set_xlim([-4,4])
ax.set_yticklabels([])
plt.show()
We now have enough theory to begin reading our research papers for this lesson and evaluate them for reliability.
Peer review is the process of subjecting scholarly research, work or ideas to the scrutiny of others who, ordinarily, are drawn from amongst the producer's peers.
Historically, such a review takes place prior to publication, and approval acts as a gatekeeping function. This act of pre-publication review approval has created an unfortunate form of moral hazard in that simply being published in a prestigious journal is seen as sufficient for results to be taken seriously.
A serious effort is being made by scholars for research to be reviewed post-publication.
Ending pre-publication review may remove the conferral of quality within the traditional system, thus eliminating the prestige associated with the simple act of publishing. Removal of this barrier may result in an increase of the quality of published work, as it eliminates the cachet of publishing for its own sake. Readers will also know that there is no filter, so they must interpret anything they read with a healthy dose of skepticism, thereby naturally restoring the culture of doubt to scientific practice. \cite{crane_peer_2018, brembs_reliable_2019, stern_proposal_2019}
This course will focus on public and transparent post-publication and informal research review. PubPeer is an online application permitting community-based public post-publication peer review. They offer a useful guide on how to perform a review:
In most cases you will not have access to the researchers' underlying data. That does not mean you cannot test their results, but it does mean you cannot ordinarily attempt to correct any errors you find. You may know something does not make sense, or is incorrect, but not be able to assess the research data to recreate the results.
This may seem daunting, but the best way to learn how to conduct your own research is to review the work of others. That way you also gain confidence as you realise how simple and inadvertant errors crop up regularly, and learn how to pre-emptively identify these issues throughout your own work.
Even the most experienced researchers make mistakes and good-faith reviews will usually be met with gratitude and corrections. Everyone gains from public review.
Research requires specialist domain knowledge, but it is built on fundamental skills and methods common to any scientist in any field. Just because you don't know the specific morphology of coronary atherosclerosis doesn't mean your skills in statistics or ethics are less valid. You may not be able to review everything, or even very much, in a paper but the little you can review could indicate underlying issues.
It is critical, and an act of good faith towards your peers, that you stay within your area of competence. Validating arithmetic or statistics is one thing, declaring superior knowledge of, for example, organic chemistry when you've never studied the subject is reaching too far.
We know sufficient to evaluate the quality of sample randomisation relied upon for experimental measurements. If the basics of patient randomisation are wrong, or bias has crept into measurements, you don't need to worry too much about the rest. If the fundamentals are correct, then you can hand the paper over to domain experts safe in the knowledge at least the foundations are solid.
The analysis which follows is inspired by Darrel Francis, Professor of Cardiology at Imperial College London, who conducted a public review of both papers.
Our first paper was produced by researchers at the Lundquist Institute for Biomedical Innovation in the US.
Budoff, Matthew J., Deepak L. Bhatt, April Kinninger, Suvasini Lakshmanan, Joseph B. Muhlestein, Viet T. Le, Heidi T. May, et al. 2020. “Effect of Icosapent Ethyl on Progression of Coronary Atherosclerosis in Patients with Elevated Triglycerides on Statin Therapy: Final Results of the EVAPORATE Trial.” European Heart Journal. doi:10.1093/eurheartj/ehaa652.
The paper abstract summaries the research and lets us know what to expect from the body of the work:
Aims: Despite the effects of statins in reducing cardiovascular events and slowing progression of coronary atherosclerosis, significant cardiovascular (CV) risk remains. Icosapent ethyl (IPE), a highly purified eicosapentaenoic acid ethyl ester, added to a statin was shown to reduce initial CV events by 25% and total CV events by 32% in the REDUCE-IT trial, with the mechanisms of benefit not yet fully explained. The EVAPORATE trial sought to determine whether IPE 4 g/day, as an adjunct to diet and statin therapy, would result in a greater change from baseline in plaque volume, measured by serial multidetector computed tomography (MDCT), than placebo in statin-treated patients.
The researchers are looking at plaques - a deposit of some type - which forms in hearts, and assessing whether or not a specific treatment will reduce plaque volume in comparison to that of a placebo treatment. Note, patients continue to be "statin-treated". Placebo does not mean no treatment. Both arms of the trial will continue to receive the same care, except for one difference - whether or not they receive the active chemical molecule being assessed.
We already, therefore, know to expect a treatment arm and a control arm in the methods. Adequate randomisation between these two arms, and an indistinguishable but inert placebo, are required.
Methods and results: A total of 80 patients were enrolled in this randomized, double-blind, placebo-controlled trial.
Is 80 patients a lot? It depends on the rareness of the condition and the scale of the difference between the treatment and control arms at the end of the trial. However, from the paper we know that - for two groups - we'd expect about 40 patients per group making up the 80-patient sample.
Patients had to have coronary atherosclerosis as documented by MDCT (one or more angiographic stenoses with ≥20% narrowing), be on statin therapy, and have persistently elevated triglyceride (TG) levels.
These are the inclusion criteria. This clarity will help domain experts assess whether these are the appropriate patients to include.
Patients underwent an interim scan at 9 months and a final scan at 18 months with coronary computed tomographic angiography.
The endpoint of the study was treatment after 18 months with an interim scan at 9 months to evaluate if anything had happened. This is a long-term condition and the researchers hope the condition improves. The scan at 9 months is there, amongst other things, to test if the experiment caused the treatment arm to get worse - which may necessitate stopping the trial - or improve so much that the trial would be stopped so as to ensure those in the control arm also got the new treatment.
In this case, the trial continued to its endpoint.
The pre-specified primary endpoint was changed in low-attenuation plaque (LAP) volume at 18 months between IPE and placebo groups. Baseline demographics, vitals, and laboratory results were not significantly different between the IPE and placebo groups; the median TG level was 259.1 ± 78.1 mg/dL. There was a significant reduction in the primary endpoint as IPE reduced LAP plaque volume by 17%, while in the placebo group LAP plaque volume more than doubled (+109%) (P = 0.0061). There were significant differences in rates of progression between IPE and placebo at study end involving other plaque volumes including fibrous, and fibrofatty (FF) plaque volumes which regressed in the IPE group and progressed in the placebo group (P < 0.01 for all). When further adjusted for age, sex, diabetes status, hypertension, and baseline TG, plaque volume changes between groups remained significantly different, P < 0.01. Only dense calcium did not show a significant difference between groups in multivariable modelling (P = 0.053).
This is the core of the experimental results. We may not know what all of these terms are, but they're all numbers and we should expect tables of these data for each of the treatment and control arms. We can review these data and test whether it is as expected and supports the conclusions.
Conclusions: Icosapent ethyl demonstrated significant regression of LAP volume on MDCT compared with placebo over 18 months. EVAPORATE provides important mechanistic data on plaque characteristics that may have relevance to the REDUCE-IT results and clinical use of IPE.
The Results section contains tables of interest. Of interest is that, of the 80 patients who joined the trial, only 68 completed, with 31 in the treatment arm, and 37 in the placebo. Table 1 profiles these patients, and that is the lowest resolution we will get to ensure and preserve participant anonymity. You should see that we could create synthetic data to mimic this profile as well.
Table 2 is the core area of interest:
Look for the familiar terms:
The placebo arm is called Placebo and the treatment arm is called IPE. For now, we don't have the context to understand many of the other terms, but that doesn't mean we can't perform analysis.
We can start with our expectations from these data. We know it is a randomised trial, so we expect the two groups to have a similar profile at the beginning and - given the claims in the conclusions - that the IPE group should be meaningfully better than the Placebo at the end.
The authors encourage this view: "Baseline demographics, vitals, and laboratory results were not significantly different between the IPE and placebo groups."
$$ \text{About the same } \to \text{Different} $$Except ...
Can you see something weird? The two means at baseline are quite different. We can draw this to get an idea of how different.
# Baseline variance between study arms
# Placebo
mu_p = 4.1
sd_p = 1.8
x_p = np.arange(-5, 15, 0.1)
y_p = norm.pdf(x_p, mu_p, sd_p)
# Treatment (IPE)
mu_t = 5.0
sd_t = 1.8
x_t = np.arange(-5, 15, 0.1)
y_t = norm.pdf(x_t, mu_t, sd_t)
# Draw the chart
fig, ax = plt.subplots(figsize=(9,6))
ax.plot(x_p,y_p, color="C0", label="Placebo")
ax.plot(x_t,y_t, color="C1", label="IPE")
ax.legend(loc="best", frameon=False)
plt.show()
We know - based on a symmetrical distribution - that 70% of a sample occurs within one standard deviation of the mean. How far apart are the means of the two groups?
$$ 5.0 - 4.1 = 0.9 $$$$ \frac{0.9}{1.8} = 0.5$$Half a standard deviation apart, meaning there is at least a 40% difference between the two groups. Given the groups are quite small, these differences increase the risk from outliers (confounding variables), mean they're not the same in terms of disease morphology and treatment, and may mean we can't read much into the results.
Confirm for yourself that these variations are present in each of the plaque types as well (and notice that the Calcification type is less different).
It's hard to reconcile these differences with their claim that "Baseline demographics, vitals, and laboratory results were not significantly different between the IPE and placebo groups."
And how different are they on conclusion of the trial?
# Follow-up variance between study arms
# Placebo
mu_p = 4.6
sd_p = 1.4
x_p = np.arange(-5, 15, 0.1)
y_p = norm.pdf(x_p, mu_p, sd_p)
# Treatment (IPE)
mu_t = 4.5
sd_t = 1.8
x_t = np.arange(-5, 15, 0.1)
y_t = norm.pdf(x_t, mu_t, sd_t)
# Draw the chart
fig, ax = plt.subplots(figsize=(9,6))
ax.plot(x_p,y_p, color="C0", label="Placebo")
ax.plot(x_t,y_t, color="C1", label="IPE")
ax.legend(loc="best", frameon=False)
plt.show()
The conclusion seems to be that the placebo patients got slightly worse, and the treatment patients got slightly better, and now they're almost the same.
But a trial is supposed to consider the best alternative to a new treatment. The patients in the placebo arm should not have gotten worse. There are questions here that need answers and which we cannot find in the paper.
We shouldn't speculate, but we can ask, what do we think may have happened? We know from the paper, that the patients were randomised: "multi-centre, randomized, double‐blind, placebo‐controlled trial". Either the randomisation didn't work, or the measurements were biased in some way. But biased measurements implies a lack of blinding.
In the section on Plaque quantification we are told that an automated imaging system quantified the plaque volume but that: "Once automated software had completed the vessel trace, an expert reader manually corrected areas of misregistration." In other words, a human manually adjusted the values scored by the computer. Could this person have inadvertantly biased the plaque volumes in the two groups? We cannot know, but it is at least one potential source of error.
Besides the meaningful difference between the two arms, we also note that the patients in the placebo group got worse. For low-attenuation plaques we can spot the following:
# Variance between baseline and follow-up for low-attenuation plaques
data = {
"placebo": {
# Baseline
"mu_b": 0.8,
"sd_b": 1.5,
"lbl_b": "Placebo at Baseline",
# Follow-up
"mu_e": 1.6,
"sd_e": 1.8,
"lbl_e": "Placebo at Follow-up",
},
"treatment": {
# Baseline
"mu_b": 1.9,
"sd_b": 1.8,
"lbl_b": "IPE at Baseline",
# Follow-up
"mu_e": 1.6,
"sd_e": 1.7,
"lbl_e": "IPE at Follow-up",
},
}
fig, ax = plt.subplots(figsize=(9,6))
for n, arm in enumerate(data.values()):
x_base = np.arange(-5, 9, 0.1)
y_base = norm.pdf(x_base, arm["mu_b"], arm["sd_b"])
ax.plot(x_base,y_base, color=F"C{n}", alpha=0.5, label=arm["lbl_b"])
x_end = np.arange(-5, 9, 0.1)
y_end = norm.pdf(x_base, arm["mu_e"], arm["sd_e"])
ax.plot(x_end,y_end, color=F"C{n}", label=arm["lbl_e"])
ax.legend(loc="best", frameon=False)
plt.show()
There is a clear and unambiguous worsening of the placebo group that exceeds any improvement in the treatment arm. The placebo was chosen because its physical properties, as a mineral oil, made it indistiguishable from the treatment, but could it have inadvertantly made patients in that group worse-off?
There are other concerns we could consider (1, 2, 3), however, at this stage how confident do you feel that the study shows what it claims to show? That the new treatment is better than the current best alternative?
You can have a look at how others are reviewing the paper at PubPeer. There is also comment on Medscape:
In an interview, Steven Nissen, MD, who is chair of cardiovascular medicine at the Cleveland Clinic, Cleveland, Ohio, and has been among the critics of the mineral oil placebo, also questioned the plaque progression over the 18 months. "I've published more than dozen regression/progression trials, and we have never seen anything like this in a placebo group, ever," he said. "If this was a clean placebo, why would this happen in a short amount of time?"
Domain experts are expressing doubts, but you didn't need to be one to evaluate this paper.
Our second paper is a collaboration between 36 researchers:
Packer Milton, Anker Stefan D., Butler Javed et al., 2020. ``Cardiovascular and Renal Outcomes with Empagliflozin in Heart Failure'', New England Journal of Medicine, August 2020. dop:10.1056/NEJMoa2022190
As before, lets review the abstract before diving into the body of the paper:
Background: Sodium–glucose cotransporter 2 (SGLT2) inhibitors reduce the risk of hospitalization for heart failure in patients regardless of the presence or absence of diabetes. More evidence is needed regarding the effects of these drugs in patients across the broad spectrum of heart failure, including those with a markedly reduced ejection fraction.
Methods: In this double-blind trial, we randomly assigned 3730 patients with class II, III, or IV heart failure and an ejection fraction of 40% or less to receive empagliflozin (10 mg once daily) or placebo, in addition to recommended therapy. The primary outcome was a composite of cardiovascular death or hospitalization for worsening heart failure.
Here we have a double-blinded trial with a much larger sample population: 3,730 patients, and we would expect about 1,850 patients in each group. Once again, the treatment molecule - empagliflozin - is the only factor differentiating the two groups which continue to receive the standard recommended therapy.
Recording of a primary outcome event - the main measurement data - is cardiovascular death or hospitalisation as a result of a worsening of their heart condition. This is a heart-related study and is conditional on heart-related event data.
Results: During a median of 16 months, a primary outcome event occurred in 361 of 1863 patients (19.4%) in the empagliflozin group and in 462 of 1867 patients (24.7%) in the placebo group (hazard ratio for cardiovascular death or hospitalization for heart failure, 0.75; 95% confidence interval [CI], 0.65 to 0.86; P<0.001). The effect of empagliflozin on the primary outcome was consistent in patients regardless of the presence or absence of diabetes. The total number of hospitalizations for heart failure was lower in the empagliflozin group than in the placebo group (hazard ratio, 0.70; 95% CI, 0.58 to 0.85; P<0.001). The annual rate of decline in the estimated glomerular filtration rate was slower in the empagliflozin group than in the placebo group (–0.55 vs. –2.28 ml per minute per 1.73 m2 of body-surface area per year, P<0.001), and empagliflozin-treated patients had a lower risk of serious renal outcomes. Uncomplicated genital tract infection was reported more frequently with empagliflozin.
The time-frame of the trial was not fixed, so the authors report a median time-frame for a primary outcome event of 16 months. They report differences between treatment and control for a variety of different patient outcomes, all indicating that treatment improves patient outcomes.
Treatment has a side-effect.
Conclusions: Among patients receiving recommended therapy for heart failure, those in the empagliflozin group had a lower risk of cardiovascular death or hospitalization for heart failure than those in the placebo group, regardless of the presence or absence of diabetes.
All in all, a good outcome for the hypothesis that empagliflozin should be prescribed to patients at risk of heart failure. Given your experience with the IPE paper, how reliable are these claims?
The study endpoint is not stated in so many words, but we can derive it. Instead of fixing a time period for the study, as in the IPE paper, here the endpoint is either cardiovascular death or heart failure hospitalisation - i.e. a coronory-related event.
The objective of any study is to limit confounding variables. If the purpose of a study is to measure the effectiveness of a new treatment on patient outcomes then the most important thing to measure is patient outcomes. Setting an arbitrary time-limit may mean that healthcare events fall outside the measurement period and aren't recorded. This will bias the results.
Death is the hardest of endpoints, but not all patients will die. Most patients receiving treatment will get better (or, at least, be able to manage their conditions), although they are at high risk of hospitalisation given the nature of heart failure. So hospitalisation becomes a soft endpoint.
This ensures the study will end (otherwise, waiting for all patients to die would make the study unhelpfully long). It also massively reduces the sample size we'd need to ensure a statistically valid result. The absolute number of people who die will be significantly less than the number who are hospitalised. From a patient's perspective, death is certainly the worst outcome, but frequent hospitalisation is life-altering. By including these two events in the study, the number of patients recruited can be reduced, and your chance of getting people to participate improves (imagine recruiting people with the promise the study ends only when they die).
By removing confounding variables, the study authors improve their opportunity to record a meaningful result. From a research perspective, they increase the statistical power of their result on a smaller sample.
You could argue, why not include all-cause mortality? What happens if the treatment causes some catastrophic side-effect unrelated to heart failure? Such occurrances are notifiable events. They trigger a review regardless. The authors included the side-effect of genital tract infection, so it's not as if such things are being ignored, they are just not a primary outcome event.
On page 3 of the report, Statistical Analysis, the authors state: "We determined that a target number of 841 adjudicated primary outcome events would provide a power of 90% to detect a 20% lower relative risk of the primary outcome in the empagliflozin group than in the placebo group at a two-sided alpha level of 0.05. Assuming an annual incidence of the primary outcome of at least 15% per year in the placebo group and a recruitment period of 18 months, we established a planned enrollment of 2850 patients, with the option of increasing the enrollment to 4000 patients if the accumulation of primary outcome events was slower than expected." The entirety of this section is worth reading to see how they estimated results which may require early termination of the study as well.
Next, what is the event rate ratio; the ratio of the number of events in the treatment group divided by that in the control group:
$$ \frac{n_t}{n_c} \approx 0.8 $$This is sanity-checking rather that specific accuracy. Use the event-related percentages and mentally divide 19.4/24.7 (or 20/25). We're studying people sampled from a general population with all its variability. You're looking for a clear signal of efficacy. These approaches to quick 'n dirty mental arithmetic help you read the paper fluidly and test whether conclusions stated in the text are reasonable.
Reading further, you see the hazard ratio for a primary outcome event is 0.75 which is close to our sanity-check approximation of 0.8.
Let's review the study groups:
Unlike in the IPE paper, we are not presented specifically with a standard deviation. The hazard ratio is a mean (or average) ratio of the difference between the two groups and the range is at a 95% confidence interval.
We can't chart a mean and standard deviation for the baseline and endpoint stages of the treatment and control groups because those data are not given. The endpoint for the IPE study was a time-period, with measurement of the change in volume of coronary plaques. You could then measure plaques in each of the groups at the beginning and end of the trial.
In the empagliflozin study, the baseline is that none of the participants are either dead or in hospital. This is binary. Inclusion required them to be not one of these end-points. Therefore there's no value in assessing these groups separately. We must conduct our assessment on the variation between the groups on conclusion of the study.
Similarly, these endpoints are far easier to quantify. There's no risk from measurement error in quantifying the volume of a plaque. A patient either is, or is not dead or in hospital at the end of their participation. Much less chance of ambiguity in that result.
We know that 70% of patients (one standard deviation) experienced a hazard ratio of between 0.65 and 0.86, the indication of the ratio of how likely either group was to experience a primary outcome event.
If there were no difference between the two groups then the hazard ratio would be 1. The ratio is treatment to placebo. As the ratio shifts towards zero, patients experience less and less risk than patients in the control group. If it were greater than 1, then those in the placebo group have more favourable outcomes.
They draw this in the following figure:
They've saved us some trouble on chart drawing. These are called box-plots and they summarise data showing the middle range of 50% of the data (the "box" part of the plot) and then an uppper and lower "whisker" which capture the rest of the data to give a visual idea of data distribution in a small space. In this chart, the size of the boxes have been scaled to represent the size of the groups represented by the ratio. That is clear in something like age where most participants are above the age of 65.
This figure also gives a clear view of variance along the hazard ratio continuum. Patient outcome improvement is unambiguous.
Slightly beyond the theory covered in this lesson, but: how do you tell if the effect measured wasn't entirely due to chance?
The $p$-value will be introduced in future lessons but it is a measure of the probability that the hypothesis is wrong (that the null hypothesis is true). The smaller the p-value, the less likely such an event would be true. The authors state that $p < 0.001$ which is very low.
However, we can ask questions about blinding and about whether the endpoint matters. Would changes to these criteria effect the outcome of the study?
In Effect of Study Design on the Reported Effect of Cardiac Resynchronization Therapy Jabbour, et al considered whether observational, randomised but unblinded or randomised and blinded trials influenced study findings \cite{jabbou_effect_2015}.
They summarise their findings as follows:
This figure is similar to the empagliflozin study, presented with similar mean and distribution ranges. The charts used here are violin plots, a subtle variation on the box plot.
It should be obvious that the key approach is not the choice of endpoint, but the study approach. Randomisation with blinding is key. Observational studies are no better than unblinded studies (for the clinical trials anyway) ... The authors didn't test double-blinding, so only the patients weren't aware of their treatment status. It would have been interesting to see this study done with an included double-blind group.
Our analysis - with the skills learned to this point - can only take us so far. As with the IPE study, we're not questioning the values themselves reported in the study, merely if they're internally consistent and if the textual analysis is aligned to the reported data.
In the IPE study, the authors claimed that the groups were the same at baseline and changed over the duration of the study. This was not aligned with their own data.
In the empagliflozin study, the data and analysis are aligned, and - from a review perspective - their assumptions, data and analysis are better presented.
There's more we could consider. For instance, are the side-effects of treatment predictable? If you considered the way empagliflozin works (described in the paper) you may be able to work this out.
There has been less online review of this paper (and less controversy), but you can keep an eye on PubPeer and read Prof Francis' public review.
Concluding our review of randomisation and blinding in the papers, how do you feel about each? Does either paper feel as if it has a solid foundation in reliance on randomisation and blinding into its two treatment groups? Which paper would you be more comfortable recommending, if any?
The objective of a published journal article is to support persuasion of the authors' peers of the validity of their work. The easier you make that, the more likely it is to be read, and the more persuasive it will be.
Where a reader comes away confused or hesitant then the work has failed. A reason why authors may deliberately reduce the clarity of their presentation is precisely because they lack confidence in their conclusions.
Well-designed charts and tables make any differences between data and conclusions obvious. They are not only an aid to understanding for the reader, but a meaningful part of the analytical process for the researchers. The IPE paper has no charts, while the empagliflozin does. The latter also includes a better description of methods and assumptions.
Clarity is key to validation.
Histograms and scatter charts are useful for analysis, but they are also bulky and presenting them in a way that aids comparisons is difficult. The charts we saw used in the empagliflozin and study design papers are a form of distribution chart.
Here's how you can draw them. We'll use Seaborn, an extension to Matplotlib that offers shortcuts to standardised statistical charts, and a more modern design. If you haven't already installed it, pip install seaborn
will do the job.
Seaborn has a repository of sample data which we will use for demonstrating the various functions. We're not going to analyse these data. They are useful only for demonstration. In future lessons, we'll use these charts along with case study data.
# Import Seaborn, and sample data to demonstrate a box plot
# https://seaborn.pydata.org/generated/seaborn.boxplot.html#seaborn.boxplot
import seaborn as sns
sns.set(style="whitegrid")
# Load our sample data ... we're not going to analyse it
tips = sns.load_dataset("tips")
ax = sns.boxplot(x=tips["total_bill"])
A box plot is a standardised way of displaying a statistical distribution based on a five-number summary:
The box plot itself is constructed of two parts, the box spanning the IQR, and a set of whiskers indicating the minimum and maximum. There are variations here:
Any data observations which fall outside this maximum and minimum range are presented as points. These are the outliers. These extreme measurements offer a range of insight:
Box plots are extremely concise ways of presenting data and permit comparisons between quite complex observations.
ax = sns.boxplot(x="total_bill", y="day", hue="smoker",
data=tips, palette="Set3")
If you want to go further and show the observation distribution on the box plots, you can combine a swarm plot with a box plot. The swarm points will have jitter associated so their values on the chart are not directly meaningful, but it does give a sense of how the measurements were distributed amongst the bins in the histogram, as well as the relative positioning of the outliers.
ax = sns.boxplot(x="day", y="total_bill", data=tips)
ax = sns.swarmplot(x="day", y="total_bill", data=tips, color=".25")
Violin plots are similar to the box plot. Unlike a box plot, in which all of the plot components correspond to actual datapoints, the violin plot features a kernel density estimation of the underlying distribution.
As before, we'll use sample data from Seaborn to demonstrate how to draw them.
# Import Seaborn, and sample data to demonstrate a violin plot
# https://seaborn.pydata.org/generated/seaborn.violinplot.html#seaborn.violinplot
import seaborn as sns
sns.set(style="whitegrid")
# We're using the same dataset
tips = sns.load_dataset("tips")
ax = sns.violinplot(x=tips["total_bill"])
# Comparison of nested groupings by two categorical variables
ax = sns.violinplot(x="total_bill", y="day", hue="smoker",
data=tips, palette="muted")
These approaches to presenting data distributions encourage engagement and support review of the means, standard deviations and variances presented in your tables. Not everyone finds it easy to read a table of numbers and instantly tell how they relate to each other.
If you have a strong point to make, reinforce it with these and learn how to read them. If you want to make a claim that two randomised sample distributions are the same and can physically see they look different on a chart, then you really need to review your analysis or assumptions.
Ultimately, it is for journal authors to make their case to others, not for everyone else to simply agree. The case-studies here took years of research and measurement to produce. It takes seconds to create a set of easy-to-read pleasant-looking charts that vastly improve understanding of what can be a complex and data-intensive study.
If you have a strong set of research results to present, reinforce it with good presentation, don't undermine it by burying detail in the text.
Ethics: Research is performed by moral agents undertaking activities which may effect moral subjects such that their rights - both positive and negative - must be considered within the context of legal and moral means and ends.
Researchers, particularly those working with human subjects, have a duty of care towards their stakeholders that arises from rights they, or their stakeholders, may hold. There are seven criteria against which we may assess moral subject participation in research:
Curation: The data management process is there to support data collection, but it is the responsibility of the data scientist to sanity check data going in. There are physical and technical limits to measurements and you need to have a comprehensive understanding of your research domain to ensure that measurements reflect reality.
Randomisation of the sample selection process is critical to ensure representation and avoid accidentally introducing bias. There are five factors to consider:
Analysis: We start any analysis by exploring our data to understand its shape, distribution and variance. The mean is a measure of the centre of the distribution of a data series. This is written as $\bar{x}$ or $\mu$ (mu) and is the sum of all of the observations divided by the number of observations. These distributions can be normal, where the observations are symmetric about the central axis, or they can be skewed. The distance of an observation from its mean is its deviation. The standard deviation is defined as the square root of the variance.
Peer review is the process of subjecting scholarly research, work or ideas to the scrutiny of others who, ordinarily, are drawn from amongst the producer's peers.
Presentation: A box plot is a standardised way of displaying a statistical distribution based on a five-number summary. If you have a strong set of research results to present, reinforce it with good presentation, don't undermine it by burying detail in the text.
Conduct your own peer review on any of the publications listed here:
As your skills develop, you will gain confidence in ever-deeper review but, for now, focus on randomisation and distribution.
You are also welcome to conduct your own review. There are a host of publication repositories which support redistribution of published research. Here are a few and you can explore them to run your own reviews:
If you're searching, remember to use the American spelling for randomization.
(Budoff, Bhatt et al., 2020) Budoff Matthew J., Bhatt Deepak L., Kinninger April et al., ``Effect of icosapent ethyl on progression of coronary atherosclerosis in patients with elevated triglycerides on statin therapy: final results of the EVAPORATE trial'', European Heart Journal, vol. , number , pp. , August 2020. online
(Packer, Anker et al., 2020) Packer Milton, Anker Stefan D., Butler Javed et al., ``Cardiovascular and Renal Outcomes with Empagliflozin in Heart Failure'', New England Journal of Medicine, vol. 0, number 0, pp. null, August 2020. online
(Maslin and Wallace, 2018) Maslin Douglas and Wallace Marc, ``Cutaneous larva migrans with pulmonary involvement'', Case Reports, vol. 2018, number , pp. bcr, February 2018. online
(Baggini and Fosl, 2007) Julian Baggini and Peter Fosl, ``The Ethics Toolkit'', 2007. online
(Kramer, Guillory et al., 2014) Kramer Adam D. I., Guillory Jamie E. and Hancock Jeffrey T., ``Experimental evidence of massive-scale emotional contagion through social networks'', Proceedings of the National Academy of Sciences, vol. 111, number 24, pp. 8788--8790, June 2014. online
(Shaw, 2015) Shaw David, ``Facebook’s flawed emotion experiment: Antisocial research on social network users:'', Research Ethics, vol. , number , pp. , May 2015. online
(Vickers, Kramer et al., 2006) Vickers Andrew J., Kramer Barry S. and Baker Stuart G., ``Selecting patients for randomized trials: a systematic approach based on risk group'', Trials, vol. 7, number 1, pp. 30, October 2006. online
(Emanuel, Wendler et al., 2000) Emanuel Ezekiel J., Wendler David and Grady Christine, ``What Makes Clinical Research Ethical?'', JAMA, vol. 283, number 20, pp. 2701--2711, May 2000. online
(Chait, 2014) Gavin Chait, ``Technical assessment of open data platforms for national statistical organisations'', World Bank Group, number: , December 2014. online
(Downey, 2014) Allen B. Downey, ``Think Stats 2 - Exploratory Data Analysis in Python'', 2014. online
(Vu and Harrington, 2020) Julie Vu and David Harrington, ``Introductory Statistics for the Life and Biomedical Sciences'', July 2020. online
(Dahmen and Cook, 2019) Dahmen Jessamyn and Cook Diane, ``SynSys: A Synthetic Data Generation System for Healthcare Applications'', Sensors (Basel, Switzerland), vol. 19, number 5, pp. , March 2019. online
(Dietz, Barr et al., 2015) David M Dietz, Christopher D Barr and Mine Çetinkaya-Rundel, ``OpenIntro Statistics'', 2015. online
(Crane and Martin, 2018) Crane Harry and Martin Ryan, ``In peer review we (don't) trust: How peer review's filtering poses a systemic risk to science'', Researchers.One, vol. , number , pp. , September 2018. online
(Brembs, 2019) Brembs Björn, ``Reliable novelty: New should not trump true'', PLoS Biology, vol. 17, number 2, pp. , February 2019. online
(Stern and O’Shea, 2019) Stern Bodo M. and O’Shea Erin K., ``A proposal for the future of scientific publishing in the life sciences'', PLoS Biology, vol. 17, number 2, pp. , February 2019. online
(Jabbou, Shun‐Shin et al., 2015) Jabbou Richard J., Shun‐Shin Matthew J., Finegold Judith A. et al., ``Effect of Study Design on the Reported Effect of Cardiac Resynchronization Therapy (CRT) on Quantitative Physiological Measures: Stratified Meta‐Analysis in Narrow‐QRS Heart Failure and Implications for Planning Future Studies'', Journal of the American Heart Association, vol. 4, number 1, pp. e000896, May 2015. online