Data analytics in the cloud

Project goal

This project is testing and prototyping solutions that combine data engineering with machine-learning and deep-learning tools. These solutions are being run using cloud resources — in particular resources and tools from Oracle Cloud Infrastructure (OCI) — and address a number of use cases of interest to CERN’s community. Notably, this activity will make it possible to compare the performance, maturity, and stability of solutions deployed on CERN’s infrastructure with the deployment on the OCI.

R&D topic
Machine learning and data analytics
Project coordinator(s)
Eva Dafonte Perez, Eric Grancher
Team members
Luca Canali, Riccardo Castellotti
Collaborator liaison(s)
Barry Gleeson, Vincent Leocorbo, Don Mowbray, Cristobal Pedregal-Martin, David Ebert, Dmitrij Dolgušin

Collaborators

Project background

Big-data tools — particularly related to data engineering and machine learning — are evolving rapidly. As these tools reach maturity and are adopted more broadly, new opportunities are arising for extracting value out of large data sets.

Recent years have seen growing interest from the physics community in machine learning and deep learning. One important activity in this area has been the development of pipelines for real-time classification of particle-collision events recorded by the detectors of the LHC experiments. Filtering events using so-called “trigger” systems is set to become increasingly complex as upgrades to the LHC increase the rate of particle collisions.

Recent progress

In 2019, we tested and deployed data-analytics and machine-learning workloads of interest for CERN on OCI. Testing began with the deployment of Apache Spark on Kubernetes, using OCI resources.

During this initial phase, we were able to successfully deploy two workloads for processing physics data at scale:

•    Reduction of big data from the CMS experiment: This use case consists of running data-reduction workloads for data from particle collisions. Its goal is to demonstrate the scalability of a data-reduction workflow based on processing ROOT files using Apache Spark.

•    Spark deep-learning trigger: This use case entails the deployment of a full data-preparation and machine-learning pipeline (with 4.5 TB of ROOT data) using Apache Spark and TensorFlow.

This activity has led to a number of improvements. In particular, we were able to improve the open-source connector between OCI and the Hadoop Distributed File System: we made it compatible with recent versions of Spark and we developed a mechanism to distribute workloads.

Next steps

In 2020, the focus of the project will also include work to improve user interfaces and ease of adoption. We will develop a proof-of-concept integration of CERN’s analytics platform (SWAN) with OCI resources.

Publications

    M. Bień, Big Data Analysis and Machine Learning at Scale with Oracle Cloud Infrastructure. Zenodo (2019). cern.ch/go/lhH9
    M. Migliorini, R. Castellotti, L. Canali, M. Zanetti, Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics. arXiv e-prints, p. arXiv:1909.10389 [cs.DC], 2019. cern.ch/go/8CpQ
    T. Nguyen et al., Topology classification with deep learning to improve real-time event selection at the LHC, 2018. cern.ch/go/8trZ

Presentations

    L. Canali, “Big Data In HEP” - Physics Data Analysis, Machine learning and Data Reduction at Scale with Apache Spark (24 September). Presented at IXPUG 2019 Annual Conference, Geneva, 2019. cern.ch/go/6pr6
    L. Canali, Deep Learning Pipelines for High Energy Physics using Apache Spark with Distributed Keras on Analytics Zoo (16 October). Presented at Spark Summit Europe, Amsterdam, 2019. cern.ch/go/xp77