Simons Foundation Applied Multi-Disciplinary AI on High-Performance Computing

Applied Multi-Disciplinary AI on High-Performance Computing

In collaboration with

The project aims to enable scalable and future-proof machine-learning workflows for large-scale scientific data analysis, with a focus on high-energy physics event reconstruction. It addresses both computing and algorithmic challenges, optimising data access and storage for efficient training and inference, while developing machine-learned reconstruction methods that are accurate, scalable, and adaptable to evolving detectors and computing architectures.

Overview

This project is composed by two use-cases: ‘Ceph scaling strategies for machine learning workloads’ and ‘HPC-accelerated AI optimisation’. Machine learning applications continue to grow in complexity, driving increasing model sizes and unprecedented demands on data storage and access. While software-defined storage solutions, such as the POSIX-compliant Ceph File System, are well suited to large-scale, distributed ML workflows, storage access speed and coordination overheads remain key limitations. This project focuses on optimising Ceph resource usage to improve performance for large-scale training and inference. In parallel, it aims to modernise high-energy physics event reconstruction by replacing hand-crafted heuristic algorithms with scalable machine-learned approaches. Building on the Machine-Learned Particle Flow (MLPF) framework, the project also explores supervised, self-supervised, and foundation-model techniques to deliver accurate, efficient, and adaptable reconstruction suited to future detectors and computing architectures.

Highlights in 2025

At the start of the project, reliable benchmarking of the storage system was established. Dedicated hardware was provisioned to measure raw performance, while supporting software layers were benchmarked to quantify overheads introduced at each level of the storage stack and to control error sources. Under controlled synthetic workloads, both Ceph’s internal metrics and client-side I/O measurements were used to study the impact of configurable system parameters. In parallel, detailed performance-profiling capabilities were enabled to identify and analyse suboptimal code behaviour.

Methods to emulate storage-access patterns typical of machine-learning workloads, including those used at CERN, were systematically reviewed. Particular emphasis was placed on tools that minimise computational requirements, reducing interference with benchmarking results and dependence on hardware with limited availability. Suitable tooling was identified, addressing a key early project objective.

In 2025, the foundations for an end-to-end self-supervised learning workflow were also established, building on experience from Machine-Learned Particle Flow (MLPF). A shared repository and data-processing pipeline were developed, including dataset pre-processing tools and a full PyTorch-based training pipeline. Initial studies explored multiple self-supervised approaches, alongside supervised baselines for comparison.

Next Steps

This project will continue next year, characterizing the performance of more complex storage cluster configurations with a focus on horizontal scaling. Various ML workloads will be tested utilizing the benchmarking methodology developed earlier, their performance with conventional Ceph scaling strategies analysed to identify promising approaches, and profiling conducted to guide the development of optimization.

The next phase of the project will also focus on consolidating the effort on self-supervised learning and assessing its potential across a range of reconstruction-relevant tasks. Work will continue to explore and compare different clustering, SSL, and foundation-model approaches for learning meaningful representations directly from low-level detector data, while refining the data formats, workflows, and evaluation strategies.

Publications & Presentations

Wasowski, R. M., Bocchi, E. (2025, March 4). Applied Multi-Disciplinary AI on High-Performance Computing [Conference presentation]. 2025 CERN openlab Technical Workshop, Meyrin, Switzerland https://indi.to/f4nM8

Wulff, E. (2025, March 4). Hyper-Parameter Optimisation on HPC systems [Conference presentation]. 2025 CERN openlab Technical Workshop, Meyrin, Switzerland. https://indico.cern.ch/event/1440389/contributions/6364347

Technical Team

Eric Wulff, Joosep Pata, Radomir Wasowski, Enrico Bocchi, Abhishek Lekshmanan

Project Coordinator

Eric Wulff

Collaboration Liaisons

Ian Fisk, Andras Pataki, Mariel Pettee