Here’s a deconstruction of the elements:
That the generated data can be linked to people, systems, equipment, etc. that generated the data (the source of the data), as well as a discrete point in time.
Example source of error: If a password is used to create recipes, one password can be used by multiple people, so the data cannot be attributed to a single operator.
Can the data be read and understood years after recording? This can refer to certain pens (ink) or printers (e.g. thermal) whose output decreases over time due to exposure to the environment. It can also refer to language euphemisms and syntax that are colorful and not straightforward, which can obscure the meaning over time.
Error source example: Certain printers can produce printouts that become faded or illegible after just a few weeks or months from the source. This makes data collection selection critical so as not to be in an awkward position where you end up with garbled results several years down the road.
This means connected or together in time, data must be captured at the same time as their corresponding operation. Future dating (or timestamping) is strictly prohibited, as is retroactive dating for the reason that the data was not contemporaneous with archiving.
Error Source Example: Pre-populating batch record fields with information about something likely to happen in the near future is a major source of errors. Perhaps the filter lot numbers have been changed, causing the initial entry to be incorrect. Due to the non-contemporary nature of this type of activity, it is prohibited.
Original records (laboratory notebooks, batch reports, study reports, etc.) should be kept rather than copies. The ability to enter “backdoors” of the system and modify data after it has been recorded is a violation of data originality and is prohibited.
Error source example: Manufacturing or lab notes written for the purpose of later transcribing as the original would be a violation of the originality of source data.
The recorded data should reflect the reality of what happened during a given activity. In addition, if changes have been made, the accuracy clause requires there to be documentation of the changes tracing them back to the original information to preserve the accurate nature of the data. The electronic recording of data must take place within systems with accuracy checks and verification checks. This is also one of the reasons measuring equipment needs to be calibrated: To keep an accurate source record of what is being measured.
Error Source Example: Removing or erasing data to record something that has changed in the operations violates this clause because the accuracy of the activities is not recorded as performed.
One of the “+” elements added to the original ALCOA, all records must have an audit trail showing any changes and that nothing has been deleted or lost.
Example of an error source: Performing a new QC lab test where a secondary result may override or replace a possible original (first) result obtained.
The past 8 years have produced the highest number of data integrity warning letters in history; Part of this is due to more scrutiny than ever before, but nevertheless these problems exist and continue to grow in number.
A specific industry example found that analysts at an analytical lab had system access to delete and overwrite data. The FDA investigators found about 36 deleted data files or folders in the trash.
Further examples of FDA warning letters to various organizations recently include the following1:
- “Your company has failed to maintain proper controls on computer or related systems to ensure that only authorized personnel make changes to master production and control records or other records (21 CFR 211.68(b)).”
- “FDA analyzes of research data generated at these companies and submitted in several applications … found significant instances of misconduct and violations of federal regulations, resulting in the submission of invalid study data to the FDA.”
- “Your failure to preserve study records as required by FDA regulations significantly compromises the validity and integrity of the data collected at your site.”
- “The batch record documented one employee performing multiple production steps, such as measuring containers and bulk reconciliation on two separate dates, and a second employee documenting activity verification. However, the second employee (verifier) stated to our investigator that they were not working when these steps were documented to be performed.
- “Your quality system has not sufficiently ensured the accuracy and integrity of the data to support the safety, effectiveness and quality of the drugs you produce. Without accurate data, you cannot make accurate decisions regarding batch release, product stability, and other issues that are fundamental to ongoing quality assurance.”
Our world is inundated with more data than ever before in the history of civilization – by a wide margin. It is estimated that the creation, storage and retrieval of data and information represents a third of all revenue generated in the world. But herein lies the problem: if you quickly scale up the amount of data and its speed, it often happens without paying enough attention to its veracity – the “truth” of the data. This fact, along with the sheer volume of appalling data analysis that is constantly being performed on tortured datasets every day around the world, has led to the saying that we are “drowning in data, but thirsty for information”.
When we perform analysis on data that has been collected incorrectly, we begin to model the error within that data, leading to incorrect conclusions and magnifying it. In many industries, this can lead to simple decision making errors. But in healthcare and pharma, misapplied analysis of the wrong data can lead to conclusions that ultimately lead to public harm. For example, in clinical trials, non-integral data can directly lead to the committing of a Type 1 error or a Type 2 error, both of which would be invisible to the data analyst if the underlying integrity of the data was unknown.
According to IDC estimates, 64 zettabytes (64 trillion gigabytes) of data will have been created or replicated by 2020, compared to the estimated 45 zettabytes in 2019 and the predicted 140 zettabytes in 2024. Of all these mountains of data, about 10% of the tagged data was deemed useful for analysis or input into AI/ML.
Now let’s compare this to an interesting general review of business data published in Harvard Business Review conducted by Nagle et al.: only 3% of the data reviewed in their study was rated “acceptable” (meaning 97% of the rest of the data fell on the other side of the scale (“unacceptable”).2 This means that AI models are often trained on big data sets that are riddled with unacceptable levels of errors in their data quality.
Artificial intelligence (AI) models usually draw incorrect conclusions due to a lack of data integrity. This should not be surprising: if machine learning algorithms live and die on the quality of their datasets, errors in the data will necessarily lower their results. As we move to more and more AI/ML/DL analytics in health data, including molecular discovery and development, clinical trial analytics, and so on, the integrity of that underlying data is and will continue to be key to the future of healthcare.
Data without integrity is like a wrapped empty box: it looks good on the outside, but it has no value. Non-integral data is really just numbers, and unreliable ones. The great theoretical physicist Wolfgang Pauli once lamented, “Not only is that not right, it’s not even wrong.”* What he was referring to was primarily that he was frustrated by careless and incorrect thinking, and also that there are different levels of error – and being so wrong that you don’t even know or believe you’re wrong is one of the worst kinds of error – the kind of data error that cannot be falsified. And so it is with data from unknown sources or origins that are analyzed daily around the world: can we trust their results? That is the ultimate data integrity question facing the FDA and other international regulatory authorities.
* Pauli’s original formulation was: “Das ist nicht nur nichtrichtig, es ist nicht einmal falsch!”
- US Food and Drug Administration. (2022). Warning letters. https://www.fda.gov/inspections-compliance-enforcement-and-criminal-investigations/compliance-actions-and-activities/warning-letters
- Nagle, T., Redman, T.C., and Sammon, D. (2017). Only 3% of company data meets basic quality standards. Harvard Business Review.
Ben Locwin is an executive and careful consumer of global data and analytics. Whether it is clinical trial analysis, drug development or manufacturing data, analysis intended for socio-political endeavors, or everyday basic data and information forced upon us through media sources, the credibility and integrity of the data matters. We have a right to expect more from our data, and we can help ensure this by requiring and enforcing the integrity of the data sources and their collection practices.