Evaluation of Power CPU architecture for deep learning

Project goal

We are investigating the performance of distributed learning and low-latency inference of generative adversarial networks (GANs) for simulating detector response to particle-collision events. The performance of a deep neural network is being evaluated on a cluster consisting of IBM Power CPUs (with GPUs) installed at CERN.

R&D topic
Machine learning and data analytics
Project coordinator(s)
Maria Girone and Federico Carminati
Team members
Sofia Vallecorsa
Collaborator liaison(s)
Eric Aquaronne, Lionel Clavien

Collaborators

Project background

GANs offer potential as a possible way of greatly reducing the need for detailed Monte Carlo (MC) simulations in generating particle showers. Detailed MC is computationally expensive, so this could be a way to improve the overall performance of simulations in high-energy physics.

Using the large data sets obtained from MC-simulated physics events, the GAN is able to learn to generate events that mimic these simulated events. Once an acceptable accuracy range is achieved, the trained GAN can replace the classical MC simulation code, with an inference invocation of the GAN.

Recent progress

In accordance with the concept of data-parallel distributed learning, we trained a GAN model on a total of twelve GPUs, distributed over the three nodes that comprise the test Power cluster. Each GPU ingests a unique part of the physics data set for training the model.

The model we benchmarked is called ‘3DGAN’. It uses three-dimensional convolutions to simulate the energy patterns deposited by particles travelling through high-granularity calorimeters (part of the experiments’ detectors). More details about this can be found on the page about the fast-simulation project). In order to distribute the training workload across multiple nodes, 3DGAN uses an MPI-based tool called Horovod. Running on the test cluster, we achieved excellent scaling performance and improved the training time by an order of magnitude.

As planned, work also began in 2019 to prototype a deep-learning approach for the offline reconstruction of events at DUNE, a new neutrino experiment that will be built in the United States. Initial work focused on developing a model — based on a combination of convolutional and graph networks — to reduce the noise in the raw data produced by the detector. Preliminary results on MC-simulated data are very promising.

Next steps

We will work to further optimise our noise-reduction model for the DUNE data, testing its performance on real data collected from a prototype experiment built at CERN called ProtoDUNE. Furthermore, we will investigate the feasibility of running the model in low-latency environments for real-time applications, using FPGAs.

Our plan is to then extend this approach to perform several other steps in the data-processing chain. In the longer term, our ultimate goal is to develop a tool capable of processing the raw data from DUNE, thus making it possible to replace the entire offline reconstruction approach.


Presentations

    A. Hesam, Evaluating IBM POWER Architecture for Deep Learning in High-Energy Physics (23 January). Presented at CERN openlab Technical Workshop, Geneva, 2018. cern.ch/go/7BsK
    D. H. Cámpora Pérez, ML based RICH reconstruction (8 May). Presented at Computing Challenges meeting, Geneva, 2018. cern.ch/go/xwr7
    D. H. Cámpora Pérez, Millions of circles per second. RICH at LHCb at CERN (7 June). Presented as a seminar in the University of Seville, Seville, 2018.