The sign of successful research is, among others, the publication of the results in a respected journal, making it known to the broader scientific community. Our colleague, researcher Rudolf Wittner, recently succeeded in this and published his article in the prestigious journal Scientific Data published by Nature Publishing Group.
The paper, entitled „Lightweight Distributed Provenance Model for Complex Real–world Environments“, addresses provenance information to increase the reproducibility of research in the life sciences.
Automated Generation of Reliable and Trustworthy Documentation
One of the main characteristics of modern research is that the research objects are typically transferred between organizations. An example can be the collection and processing of biological material, from which data is subsequently generated. This data can be integrated with data from other sources and further processed. The individual steps of the entire process are typically implemented by different types of organizations, such as hospitals, biobanks, analytical laboratories, universities, computing centers, or private companies (for example, pharmaceutical companies), between which the researched objects are transferred.
The current problem is verifying the origin and quality of the transferred objects - biological samples, data, or SW tools. Since the documentation of individual parts of these objects' life cycle is created separately and in a different time period (the delay can be up to several years), it is very prone to errors, often incomplete, does not exist at all, or cannot be traced. One way to solve this problem is the automated generation of reliable and trustworthy documentation of the entire process, otherwise known as provenance information - and this is exactly the goal of Rudolf Wittner's research and his colleagues.
In published work, they proposed a data model for documenting experiments in the life sciences that would enable the creation of distributed provenance for relevant processes and related objects. The data model then enables integration from various heterogeneous sources. The main characteristic of the created data model is that it will allow the search of the created provenance using a unified algorithm, which was not possible before. The proposed model also takes into account the cases when provenance parts do not exist. Such situations may arise, for example, if the given organization does not create documentation according to the suggested model or it suddenly ends.
Other features of the model are the possibility of using it to describe digital and physical objects, domain and technology independence, and thanks to its simplicity, it can be used for a wide range of areas from sampling biological material to data processing or training an artificial intelligence model. An important part of the published work is a proposed procedure for provenance versioning and management, which forms the basis for ensuring its authenticity, integrity, and non-repudiation, thus ensuring its credibility. The presented data model also covers the protection of sensitive data. It does not have to be only data regarding donors and their health status but also operational information, for example, regarding the transport of pathogens (infectious substances).
A Model Applicable in Various Areas of Biomedical Research
Although the data model in the published work is prototyped for a specific example from the field of digital pathology, the goal is its broad application in various areas of the life sciences. The model is already applied in the BY-COVID project, whose aim is to design a platform for integrating and processing research and health data related to viral diseases, including, for example, covid-19 or monkeypox. The plan is to verify the model's applicability in various areas of biomedical research, such as genetic data processing, biological material, or tissue engineering. The data model will be applied in several other areas. Since the model is also subject to standardization within the framework of the International Organization for Standardization (ISO) for the field of biotechnology, it also has great potential for future application in industry.
Creating complete documentation of the relevant processes in research could bring other benefits in addition to verifying the quality, origin, and suitability (so-called fitness-for-purpose) of the transferred objects. Depending on the content of the resulting provenance, it might be possible to use it, for example, to determine the propagation of erros in research (for example, in the case of detecting an error in the collection of biological material, we want to know which other objects are affected by this error) tracing the original donors of biological material in case of accidental findings regarding their state of health, or to determine the affected data in case of revocation or update of consent to the processing of personal data.
Rudolf Wittner has been researching the field of provenance information for the third year. Unsurprisingly, he still feels that he is only at the beginning of the journey. If you are interested in learning more about his research, you can listen to a recently recorded podcast where he talks about the work with his dissertation supervisor Petr Holub.
RNDr. Rudolf Wittner
Researcher and developer from the cybersecurity and data management division, a MU Faculty of Informatics graduate in the field of information technology security, and a current Ph.D. candidate. He focuses on research in the field of provenance information.