Project goal

We are exploring scalable solutions that can be used to satisfy a variety of different needs related to the analysis of both physics data and data from industrial control systems. Big-data technologies like Spark show great potential in speeding up existing analysis procedures.

Through this project, we are working to optimise the analytics solutions at CERN in the following areas: data integration, data ingestion and transformation, performance, scalability, benchmarking, resource management, data visualisation, and hardware utilisation.

R&D topic
R&D Topic 3: Machine learning and data analytics
Project coordinator(s)
Maria Girone and Luca Canali
Technical team members
Evangelos Motesnitsalis, Ian Fisk, Matteo Cremonesi, Viktor Khristenko, Jim Pivarski, Bian Bianny (Intel), Rangarajan Radhika (Intel)
Collaborator liaison(s)
Claudio Bellini (Intel), Illia Cremer (Intel) Oliver Gutsche (Fermilab), Marco Manca (SCimPULSE), Mike Reiss (Intel)

Collaborators

Project background

This project is split into four main areas of work:

  • Accelerator controls: CERN runs a large number of industrial control systems based on SCADA tools, PLCs, etc. We are working on a proof-of-concept system to process the controls data using big-data platforms, such as Apache Spark.
  • Physics data analysis: The LHC experiments continue to produce valuable physics data, offering numerous possibilities for new discoveries to be made. We are working on benchmarking ROOT, the CERN-created data-processing framework used for LHC physics data.
  • Physics data reduction: Physics data reduction plays a vital role in ensuring researchers are able to gain valuable insights from the vast amounts of data produced by the LHC experiments. Our goal is to develop a new system — using industry-standard big-data tools — for filtering many petabytes of heterogeneous collision data to create manageable, but rich, datasets of a few terabytes for analysis.

Personalised medicine, epidemiology and diagnostics: We are also planning to work with the SCimPULSE foundation to explore how big-data-analysis technologies can be applied to medical data collected, thus informing efforts to improve practices related to safety and prevention.

Recent progress

In 2017, we made significant progress in the area of physics data analysis. We first worked to make the format used for files in ROOT accessible via Spark SQL data frames, so as to avoid having to perform format conversions. We then worked to connect the Hadoop-related systems to the existing storage system of CERN, called EOS, thus making it possible to perform physics analysis without having to move large amounts of data into the Hadoop Distributed File System. We also managed to create the first fully functioning analysis examples using Apache Spark; these were tested using 1 TB of open data from the CMS experiment.

Progress was also made in the area of physics data reduction: Fermilab, the USA’s premier particle physics and accelerator laboratory, joined CERN openlab in November. Researchers from the laboratory will collaborate with members of the CMS experiment and the CERN IT Department on efforts to improve technologies related to physics data reduction. More information on the scope of this project — as well as plans for proceeding — can be found in a news article published on the CERN openlab website.

Finally, in terms of the accelerator control systems, we began testing Apache Kudu as a potential candidate solution for data storage in 2017.

Next steps

Work will be carried out in all four areas next year. In terms of the physics analysis work described above, our next step will be to investigate scaling options for larger inputs (we aim to scale up testing to a sample of 1 PB of open data during 2018). We also plan to investigate the possibility of using Spark over OpenStack, capitalising on the capabilities offered by Intel® CoFluent™ technology for cluster simulation.

Publications

  • O. Gutsche et al., CMS Analysis and Data Reduction with Apache Spark. Proceedings for 18th International Workshop on Advanced Computing and Analysis Techniques in Physics Research (21 August), Seattle, 2017. http://cern.ch/go/H6Xj

Presentations

  • O. Gutsche, Status of CMS Big Data Project (April 04), Presented at R&D meeting of CMS Spring Offline and Computing Week 2017, Geneva, 2017. http://cern.ch/go/hBC6
  • O. Gutsche, Data Analytics in Physics Data Reduction (April 27), Presented at CERN openlab workshop on Machine Learning and Data Analytics, Geneva, 2017. http://cern.ch/go/8JNM
  • M. Cremonesi, Infrastructure for Large Scale HEP data analysis (May 11), Presented at DS@HEP 2017 at Fermilab, Illinois, 2017. http://cern.ch/go/tL6c
  • S. Sehrish, A path toward HEP data analysis using high performance computing (May 11), Presented at DS@HEP 2017 at Fermilab, Illinois, 2017. http://cern.ch/go/S9tD
  • O. Gutsche, Status and Plans of the CMS Big Data Project (May 29), Presented at CERN Database Futures Workshop, Geneva, 2017. http://cern.ch/go/C7TJ
  • O. Gutsche, CMS Analysis and Data Reduction with Apache Spark (August 22), Presented at 18th International Workshop on Advanced Computing and Analysis Techniques in Physics Research (ACAT 2016), Seattle, 2017. http://cern.ch/go/JTm7
  • E. Motesnitsalis, Intel Big Data Analytics (21 September), Presented at CERN openlab Open Day, Geneva, 2017. http://cern.ch/go/8MM6
  • E. Motesnitsalis et al., Physics Data Analytics and Data Reduction with Apache Spark (10 October), Presented at Extremely Large Database Conference ‘XLDB’ 2017, Clermont-Ferrand, 2017. http://cern.ch/go/l9LJ
  • V. Khristenko, HEP Data Processing with Apache Spark (December 6), Presented at CERN Hadoop User Forum, Geneva, 2017. http://cern.ch/go/D7x6