Yandex data popularity and anomaly detection
The goal of the project is to improve LHCb operations by means of machine learning. There are two particular areas of focus: (1) data certification and (2) data management.
comparison of histograms with references, which have to be regularly updated by experts. Over the past two years, we have worked to create a novel, autonomous monitoring service for data collection. The service is capable of identifying deviations from normal operational mode to help personnel responsible for monitoring data quality to find the reason for such deviations.
The data collected by the LHCb experiment is stored across multiple datasets on both disks and tapes in the LHCb data storage grid. The storage systems used vary in terms of cost, energy consumption, and speed of use. We are also continuing our work on creating a ‘data popularity estimator service’ to analyse the usage history of each dataset, predict future usage patterns, and provide an optimal scheme for data placement and movement.
We investigated how machine-learning approaches can be used to ensure automatic detection of anomalies in the data collected at the LHCb experiment. Our anomaly-detection algorithm was embedded into the LHCb experiment’s web-based monitoring system for data quality, called Monet. This helped operators to identify when subsystems of the detector behave abnormally. Using sample data for training, the accuracy of our algorithm was shown to be quite high. However, as the scope of these samples was rather limited, there is still work for us to do to improve the generalisability of our algorithm.
We also carried out work to identify the least popular LHCb datasets in the storage system, so that they can be removed from fast, expensive storage media. In the first half of the year, we published a paper showing that our algorithm offers significant improvement over regular caching techniques. It is now being used in production at LHCb.
This work is close to completion. We are now keen to understand how the algorithms can be generalised and used at other experiments, potentially both at the LHC and beyond.
- M Hushchyn et al., GRID Storage Optimization in Transparent and User-Friendly Way for LHCb Datasets, Journal of Physics: Conference Series, Vol. 898 (2017). cern.ch/go/8vtQ
- A. Ustyuzhanin, Yandex Data Popularity and Anomaly Detection at LHCb (21 September), Presented at CERN openlab Open Day, Geneva, 2017. cern.ch/go/T7Df