CERN Science for Open Data

Project goal

The main objective of the CERN Science for Open Data (CS4OD) project is to define and implement common principles, best practices, and tools for data management, data analysis and reproducibility of results. These are to be applied across different research communities and based on open-access policies.

CS4OD is building an integrated platform which will provide users with a broad catalogue of cutting-edge tools and services for data management, analysis and reproducibility. These tools have been developed either at CERN or through established open-source initiatives. Examples include Zenodo, REANA, SWAN, and Jupyter Lab.

The platform is designed to:

  1. Provide transparent and effective “data stewardship” for publicly accessible data in multi-domain fields.
  2. Adapt to different user profiles, enabling researchers with different backgrounds to benefit more easily from the latest technologies, enhancing reproducibility and contributing to open science.
  3. Enable participants to contribute, share, access, and manage data from multiple heterogeneous sources with permanent unique identifiers.
  4. Design and execute data-curation and data-analysis pipelines using integrated tools and services, on different (local or cloud) hardware resources.

The first release of the platform has been deployed here https://cs4od-platform.web.cern.ch/.

R&D topic
Applications in other disciplines
Project coordinator(s)
Alberto Di Meglio, Tim Smith
Team members
Alexander Ioannidis (project manager), Anna Ferrari, Ivan Knezevic, Ines Pinto Perreira Da Cruz, Nihal Ezgi Yuceturk, Jose Benito Gonzalez Lopez
Collaborator liaison(s)
Ilaria Capua, Luca Mantegazza, Elio Borgonovi, Benedetta Pongiglione, Claudio Buongiorno Sottoriva, Massimiliano Di Cagno, Massimo Pugliese, Vladimiro Guarnaccia, Peter Grübling

Collaborators

Project background

Global crises, like the COVID-19 pandemic, have highlighted the need to increase the pace at which data is collected, organised, analysed and shared at large scale. This is vital for supporting rapid, informed and accountable response mechanisms from governments and other organisations. Achieving this will play an important role in addressing critical and urgent medical, social, economic and educational challenges.

There is a recognised difficulty in implementing large-scale, cross-disciplinary investigations that are able to access large amounts of data from multiple sources. For such investigation to be effective, barriers related to data management, governance, access, scalability and reproducibility must be overcome.

Today, different research groups use different data, different assumptions, different models and different methods. This means they can come to conclusions that cannot be objectively challenged because other research teams do not have access to the same information and cannot reproduce the work. And, in the case of successful research, it can be difficult for other teams to build upon it further.

CERN has a long, proven track record for open science and for implementing and managing large-scale, data-driven operations. In collaboration with international initiatives and projects, CERN engineers and physicists have developed efficient strategies for managing data at scale, as well as tools for supporting such strategies. Optimised and efficient systems — combined with the experience of implementing distributed systems and a strong culture of openness and sharing ideas, software and data — make CERN an ideal partner for implementing multi-disciplinary data-driven research projects based on open-access data and open-source tools.

Project timeline

The project started in March 2021 and is set to run for three years.

Year 1: Analysis of use cases, requirements, technology and functional gaps. A minimum-viable-product prototype will be tested with early users.

Year 2: Iterative integration of functions, tools, and best practices.

Year 3: A public beta version will be released and extended to address additional use cases. It will be deployed on infrastructures outside CERN.

Recent progress

During 2021, experts from CERN openlab and the CERN IT department have collaborated with researchers at the One Health Center of Excellence in Florida, US, as well as at both Bocconi University and Milano-Bicocca University in Milan , both in Italy. Together, we have defined the initial requirements for the data-management and computing infrastructure. An initial set of pilot use cases is being investigated for the design of the first release of the CS4OD platform. These include: analysis of excess mortality related to COVID-19, resistance to antibiotics, taxonomy of plant diseases, analysis of cancer patients’ data, and classification of Parkinson’s disease.

Next steps

The integration with additional analysis frameworks (e.g. Tensor Flow) and distributed computing frameworks (e.g. OpenFL) will be investigated in Q1 2022.

Publications

    D. Patsidis, A. Ferrari, Platform for Reproducible Analyses. Published on Zenodo, 2021. cern.ch/go/6lHw

Presentations

    A. Di Meglio, M. Manset, A Social-Technological Platform for Making Sense of (Medical) Data (23 January). Presented at CERN openlab Technical Workshop, Geneva, 2020. cern.ch/go/Mb8X
    A. Ferrari, I. Knezevic, A. Ioannidis, J. B. G. Lopez, CERN Science for Open Data – The CS4OD Project (10 March). Presented at CERN openlab Technical Workshop, Geneva, 2021. cern.ch/go/b8n8
    A. Ferrari, I. Knezevic, D. Patsidis, A. Di Meglio, A. Ioannidis, I. P. P. Da Cruz, N. E. Yuceturk, J. B. G. Lopez, T. Roun, T. Smith, CERN openlab / CERN Science 4 Open Data (CS4OD): use cases for the life science (18 October). Presented at ExaHealth, Geneva, 2021. cern.ch/go/QL9C
    D. Patsidis, Data Platform for collection, storage, integration, analysis and distribution (6 Sepember). Presented at CERN openlab Summer Student Lightning Talk, Geneva, 2021. cern.ch/go/wt7v