Intel big-data analytics

Project goal

At CERN, researchers are always exploring the latest scalable solutions needed to tackle a range of data challenges at the laboratory, related to both physics analysis and machine analytics. This project aims to help optimise analytics solutions for data integration, data ingestion, data transformation, performance, scalability, benchmarking, resource management, data visualisation, and hardware utilisation.

R&D topic
Machine learning and data analytics
Project coordinator(s)
Luca Canali, Maria Girone, Eric Grancher
Team members
Evangelos Motesnitsalis, Viktor Kozlovszky, Viktor Khristenko, Matteo Migliorini, Vasileios Dimakopoulos, Matteo Cremonesi, Oliver Gutsche, Bian Bianny, Klaus-Dieter Oertel
Collaborator liaison(s)
Claudio Bellini, Mike Riess


Project background

The LHC experiments continue to produce large amount of physics data, which offers numerous possibilities for new discoveries. Big-data technologies, such as Apache Spark, hold great potential for helping us to optimise our existing physics data-processing procedures, as well as our solutions for industrial control and online processing. Through this project, we are working to design and optimise solutions based on open-source big-data technologies. This work is being carried out in collaboration with Intel, the CMS experiment, the CERN IT department, the Fermi National Accelerator Laboratory (Fermilab) in the United States, and DIANA/HEP (a collaborative endeavour to develop state-of-the-art software tools for high-energy physics experiments).

Recent progress

In 2018, the project mostly focused on use cases related to the processing of physics data at scale. In particular, we built on two key data-engineering challenges that were tackled in the previous year: the development of a mature Hadoop-XRootD Connector library, which makes it possible to read files from the CERN’s EOS storage system, and the Spark-ROOT library, which makes it possible to read ROOT files in Spark DataFrames (ROOT is an object-oriented program and library developed at CERN that provides tools for big data processing, statistical analysis, visualisation, and storage). We were able to produce, scale up, and optimise physics data-processing workloads on Apache Spark and test them with over one petabyte of open data from the CMS experiment.

In addition, we worked to address challenges related to the application of machine-learning solutions on physics data, using Intel BigDL (a distributed deep-learning library for Apache Spark) alongside a combination of Keras (an open-source neural network library) and TensorFlow (an open-source machine-learning framework). This led to promising results. The compatibility of the developed workloads with popular open-source analytics and machine-learning frameworks makes them very appealing, with various analysis groups from the CMS experiment choosing to carry out further development of these solutions.

Next steps

We will repeat the workload tests on top of virtualised/containerised cloud-native infrastructure, complete with Kubernetes. This will include running at CERN and performing tests on public clouds.

Furthermore, we also have plans for extending the techniques developed in the project to tackle more workloads. For example, we will work to address more complex physics data-processing challenges, such as use cases related to machine learning for online data processing (streaming).




    O. Gutsche et al., CMS Analysis and Data Reduction with Apache Spark. Proceedings for 18th International Workshop on Advanced Computing and Analysis Techniques in Physics Research (21 August), Seattle, 2017.


    O. Gutsche, Status of CMS Big Data Project (April 04), Presented at R&D meeting of CMS Spring Offline and Computing Week 2017, Geneva, 2017.
    O. Gutsche, Data Analytics in Physics Data Reduction (April 27), Presented at CERN openlab workshop on Machine Learning and Data Analytics, Geneva, 2017.
    M. Cremonesi, Infrastructure for Large Scale HEP data analysis (May 11), Presented at DS@HEP 2017 at Fermilab, Illinois, 2017.
    S. Sehrish, A path toward HEP data analysis using high performance computing (May 11), Presented at DS@HEP 2017 at Fermilab, Illinois, 2017.
    O. Gutsche, Status and Plans of the CMS Big Data Project (May 29), Presented at CERN Database Futures Workshop, Geneva, 2017.
    O. Gutsche, CMS Analysis and Data Reduction with Apache Spark (August 22), Presented at 18th International Workshop on Advanced Computing and Analysis Techniques in Physics Research (ACAT 2016), Seattle, 2017.
    E. Motesnitsalis, Intel Big Data Analytics (21 September), Presented at CERN openlab Open Day, Geneva, 2017.
    E. Motesnitsalis et al., Physics Data Analytics and Data Reduction with Apache Spark (10 October), Presented at Extremely Large Database Conference ‘XLDB’ 2017, Clermont-Ferrand, 2017.
    V. Khristenko, HEP Data Processing with Apache Spark (December 6), Presented at CERN Hadoop User Forum, Geneva, 2017.
    E. Motesnitsalis, Hadoop and Spark Services at CERN (19 April). Presented at DataWorks Summit, Berlin, 2018.
    E. Motesnitsalis, From Collision to Discovery: Physics Analysis with Apache Spark (April). CERN Spring Campus, Riga, 2018.
    V. Khristenko, Physics Analysis with Apache Spark in the CERN Hadoop Service and DEEP-EST Environment (25 May). Presented at IT Technical Forum, Geneva, 2018.
    E. Motesnitsalis, From Collision to Discovery: Physics Analysis with Apache Spark (7 August). Presented at IT Lectures CERN openlab Summer Student Programme, Geneva, 2018.
    E. Motesnitsalis, Big Data at CERN (20 September). Presented at Second International PhD School on Open Science Cloud, Perugia, 2018.
    M. Cremonesi et al., Using Big Data Technologies for HEP Analysis (July). Presented 23rd International Conference on Computing in High Energy and Nuclear Physics (CHEP), Sofia, 2018.