Lecture

Enabling data-science in the laboratory with open formats and infrastructure

  • 09.04.2024 at 15:30 - 16:00
  • ICM Saal 4b
  • Language: English
  • Type: Lecture

Lecture description

Data-science and more specifically the impressive progress in the areas of machine learning and artificial intelligence have rekindled interest of analytical laboratories to be more serious about data gathering and reuse. Analytical data play a major role in chemical, food, and pharmaceutical product development and are collected along various points throughout a product’s lifecycle. These data are often gathered to calculate a narrow set of parameters to file as an electronic document and are then not used again. This is unfortunate because analytical data contain richer information than what they are typically reduced to. As an example, from chromatography, the chromatograms themselves, unknown peaks, mass spectrometry (MS) data, to name a few, constitute a vast resource that could be used to generate additional value.

One of the main challenges for reusing laboratory data for other purposes such as data science projects is the process of extracting, transforming, and loading the data (ETL). The data from analytical instruments are typically stored in application specific formats that are not compatible with common data analytics tools. To make them compatible, the data must be extracted using either the instrument software's built-in exports or custom tools that can parse the application files. However, different vendors have different formats and capabilities, so this step requires significant customization. After extracting the data, they must be aggregated and stored in a central repository that can be accessed by data analytics software. This repository is an important part of the infrastructure that must be set up and maintained. Therefore, enabling reuse of analytical data usually requires dedicated resources that are not available for most small organizations or academic labs.

Open data formats can make these workflows more efficient by creating a common standard for instruments in the same category across different vendors [1]. This way, open data formats can reduce the need for customizing the same pipeline for different variations. In this contribution, I provide a summary of open data formats in the chromatography field, their features, such as model and representation, and the problems they can address. I also show a prototype application that covers the entire data flow from analytical results to a central repository that allows direct queries into the data. Using this database, I illustrate interactive data exploration with standard data analytics tools to show the potential of data science applications when laboratory data is stored and processed with that purpose in mind.

References
[1] D. Rauh et al., Pure Appl. Chem 2022, 94, 725–736.
All lectures within this session