CERN Accelerating science

Data Analytics

Oracle Analytics as a Service

The main objective of this set of projects is to support the research and development - R&D - activities regarding the implementation of the CERN Data Analytics as a Service infrastructure – DAaaS. This infrastructure aims to (1) Integrate the existing analytics developments; (2) Centralize and standardise the complex data analytics needs for CERN’s research and engineering community; (3) Deliver real-time, batch data analytics and information discovery capabilities; (4) Offer storage for large data volumes – structured and unstructured; (5) Provide transparent access and Extract, Transform and Load (ETL), mechanisms to the various and mission-critical existing data repositories.



Siemens Industrial Control and Monitoring

The control systems used by CERN’s technical infrastructures produce enormous amounts of data related to both the systems they control and their own internal state. This project focuses on ways to handle these large datasets and extract insights that can lead to improved operational efficiency. The work is arranged into two main areas:

  •          The data flow from WinCC Open Architecture (a SCADA tool widely used at CERN) to a high-performance storage system
  •          The analysis of the stored data with Siemens analytics tools

In the area of WinCC OA, work is being carried out to make a generic archiver through which one can plug in different systems. For the analysis of the stored data, work is being carried out to enhance detection of faulty sensor measurements, to enable better measurement of the performance of control processes, and to develop a new alarm system for flooding detection.


IDT Trigger and Data Analytics 

The operation of the IT infrastructure at CERN relies on significant and continuous streams of monitoring and logging data, which are aggregated and stored in central repositories. The main repositories are based on a Hadoop cluster deployed on commodity hardware. It gives hardware experts, system administrators and service managers a convenient framework for large-scale data processing using Apache Spark.

The topic to be studied is whether or not an alternative Hadoop deployment on a cluster with a low latency RapidIO interconnect will provide sufficient throughput for performing useful near-line processing of the accumulated IT monitoring and logging data.


Yandex Data Popularity at LHCb

Data collected by the LHCb experiment is stored in the form of multiple datasets (files) on tapes and disks in the LHCb data storage grid. The storage systems used within this vary in terms of their cost, energy consumption, and speed of use. The goal of this project is to design, develop, and deploy a ‘data popularity estimator service’ that would analyse the usage history of each dataset, predict future usage patterns, and provide an optimal scheme for data placement and movement.


Yandex Anomaly Detection in LHCb Online Data Processing

Ensuring data quality is essential for the LHCb experiment. Checks are done in several steps, both offline and online. Monitoring is based on continuous comparison of histograms with references, which have to be regularly updated by experts. The aim of this project is to create a novel, autonomous data-collection monitoring service that is capable of identifying deviations from normal operational modes. It will also help the personnel responsible for data-quality monitoring to explore the underlying reasons for such deviations, thus reducing the amount of ‘spoiled data’ that may erroneously be stored for further analysis.

Related content