Designing and operating distributed data infrastructures and computing centres poses challenges in areas such as networking, architecture, storage, databases, and cloud. These challenges are amplified and added to when operating at the extremely large scales required by major scientific endeavours. CERN is evaluating different models for increasing computing and data-storage capacity, in order to accommodate the growing needs of the LHC experiments over the next decade. All models present different technological challenges. In addition to increasing the on-premise capacity of the systems used for traditional types of data processing and storage, explorations are being carried out into a number of complementary distributed architectures and specialised capabilities offered by cloud and HPC infrastructures. These will add heterogeneity and flexibility to the data centres, and should enable advances in resource optimisation.

 

Dynamical Exascale Entry Platform – Extreme Scale Technologies (DEEP-EST)

Project goal

The main focus of the project is to build a new kind of a system that makes it possible to run efficiently a wide range of applications — with differing requirements — on new types of high-performance computing (HPC) resources. From machine-learning applications to traditional HPC applications, the goal is to build an environment which is capable of accommodating workloads that pose completely different challenges for the system.

R&D topic
Data-centre technologies and infrastructures
Project coordinator(s)
Maria Girone
Team members
Viktor Khristenko

Collaborators

Project background

DEEP-EST is an EC-funded project that launched in 2017, following on from the successful DEEP and DEEP-ER projects. The project involves 27 partners in more than 10 countries and is coordinated from Jülich Supercomputing Centre at Forschungszentrum Jülich in Germany.

Overall, the goal is to create a modular supercomputer that best fits the requirements of diverse, increasingly complex, and newly emerging applications. The innovative modular supercomputer architecture creates a unique HPC system by coupling various compute modules according to the building-block principle: each module is tailored to the needs of a specific group of applications, with all modules together behaving as a single machine.

CERN, in particular the CMS Experiment, participates by providing one of the applications that are used to evaluate this new supercomputing architecture.

Recent progress

During 2019, the prototype was assembled and initial tests begun. The prototype consists of three compute modules: the cluster module (CM), the extreme scale booster (ESB), and the data-analytics module (DAM).

  • Applications requiring high single-thread performance are targeted to run on the CM nodes, where Intel Skylake processors provide general-purpose performance and energy efficiency.
  • The architecture of ESB nodes was tailored for highly scalable HPC software stacks capable of extracting the enormous parallelism provided by Nvidia V100 GPUs.
  • Flexibility, large memory capacities using Intel Optane Memory technology and different acceleration capabilities (provided by Intel Stratix 10 and Nvidia V100 GPU on each node) are key features of the DAM; they make it an ideal platform for data-intensive and machine-learning applications.

We ported the software used at CMS for the reconstruction of particle collision events for hadronic and electromagnetic calorimeters. We then optimised these workloads for running on Nvidia V100 GPUs, comparing performance against the CPU-based systems currently used in production.

Next steps

We will incorporate MPI offloading into the ‘CMSSW’ software framework used at the CMS experiment, in order to be able to run different parts of the reconstruction on different hardware. We will also explore the use of FPGAs for reconstruction at CMS.

Furthermore, the prototype will be extensively tested in 2020. This will help us to validate the developed functionality, particularly in terms of the availability of the large number of accelerators.

Publications

    The following deliverables were submitted to the European Commission:
    Deliverable 1.2: Application Use Cases and Traces
    Deliverable 1.3: Application Distribution Strategy
    Deliverable 1.4: Initial Application Ports

Presentations

    V. Khristenko, CMS Ecal Reconstruction with GPUs (23 October), CMS ECAL DPG Meeting, Geneva, 2019. http://cern.ch/go/FC6j
    V. Khristenko, CMS Hcal Reconstruction with GPUs (8 November). Presented at CMS HCAL DPG Meeting, Geneva 2019. http://cern.ch/go/P6Js
    V. Khristenko, Exploiting Modular HPC in the context of DEEP-EST and ATTRACT projects (22 January). Presented at CERN openlab Technical Workshop, Geneva, 2020. http://cern.ch/go/rjC7

Kubernetes and Google Cloud

Project goal

The aim of this project is to demonstrate the scalability and performance of Kubernetes and the Google Cloud, validating this setup for future computing models. As an example, we are using the famous Higgs analysis that led to the 2013 Nobel Prize in Physics, thus also showing that real analysis can be done using CERN Open Data.

R&D topic
Data-centre technologies and infrastructures
Project coordinator(s)
Ricardo Manuel Brito da Rocha
Collaborator liaison(s)
Karan Bhatia, Andrea Nardone, Mark Mims, Kevin Kissell

Collaborators

Project background

As we look to improve the computing models we use in high-energy physics (HEP), this project serves to demonstrate the potential of open and well established APIs, such as Kubernetes. They open up a wide range of new possibilities in terms of how we deploy our workloads.

Based on a challenging and famous use case, we’re working to demonstrate that these new tools —together with the virtually unlimited capacity offered by public cloud providers — make it possible to rethink how analysis workloads can be scheduled and distributed. This could lead to further improvements in the efficiency of our systems at CERN.

The project also provides an excellent opportunity to show how, given enough resources, anyone can replicate important physics analysis work using the open data published by CERN and the LHC experiments.

Recent progress

The initial goal of the project has been fulfilled: we have demonstrated that Kubernetes and the Google Cloud is a viable and extremely performant setup for running HEP analysis work. The code required, as well as the setup, is fully documented and publicly available (see publications).

The outcome of the project was presented at a number of high-profile conferences, including a keynote presentation at KubeCon Europe 2019, an event attended by over 8000 people. A live demo of the whole setup, using data from the CERN Open Data Portal, was shown on stage.

The setup — as well as the dataset used — has been prepared for publication as a Google Cloud official tutorial. This will enable anyone to trigger a similar execution using their own public cloud resources. This tutorial will be published in early 2020, once the text has been finalised.

Next steps

This project was initially self-contained, with a clear target for the presentation at KubeCon Europe 2019. However, the project has now grown beyond this initial, limited scope. Future steps should include:

  • Further investigating how using public cloud can improve physics analysis.
  • Working to provide on-demand, bursting to public cloud capabilities for our on-premise resources.
  • Seeking to understand how we can best define policies and accounting procedures for using public cloud resources in this manner.

 

 

Publications

    R. Rocha, L. Heinrich, higgs-demo. Project published on GitHub. 2019. cern.ch/go/T8QQ

Presentations

    R. Rocha, L. Heinrich, Reperforming a Nobel Prize Discovery on Kubernetes (21 May). Presented at Kubecon Europe 2019, Barcelona, 2019. cern.ch/go/PlC8
    R. Rocha, L. Heinrich, Higgs Analysis on Kubernetes using GCP (19 September). Presented at Google Cloud Summit, Munich, 2019. cern.ch/go/Dj8f
    R. Rocha, L. Heinrich, Reperforming a Nobel Prize Discovery on Kubernetes (7 November). Presented at the 4th International Conference on Computing in High-Energy and Nuclear Physics (CHEP), Adelaide, 2019. cern.ch/go/6Htg
    R. Rocha, L. Heinrich, Deep Dive into the Kubecon Higgs Analysis Demo (5 July). Presented at CERN IT Technical Forum, Geneva, 2019. cern.ch/go/6zls

Infrastructure monitoring and automation of resource deployment

Project goal

By learning from industrial standards and best practices, we’re working to further improve and expand CERN’s infrastructure monitoring and deployment processes for our Java platform.

R&D topic
Data-centre technologies and infrastructures
Project coordinator(s)
Eric Grancher and Eva Dafonte Perez
Team members
Viktor Kozlovszky, Luis Rodríguez Fernández, Artur Wiecek, Scott Hurley
Collaborator liaison(s)
Vincent Leocorbo, Cristobal Pedregal-Martin

Collaborators

Project background

CERN’s IT department is home to a dedicated group responsible for database services, referred to as IT-DB. This group maintains server infrastructures required by departments across the laboratory. In order to successfully maintain these infrastructures, it is important to be constantly aware of our systems’ status. The monitoring infrastructure consists of several components; these collect essential information about the machines and the applications running on them.

The IT-DB group also provides tools and custom Java libraries that play an important role in the processes of teams across the laboratory. It is therefore vital to deliver stable, quality applications that build on industrial standards and best practices.

Recent progress

In 2018, we began evaluation of the commercial Java monitoring tools and features provided by Oracle. We set up a test environment and we managed — through an SSL secured channel — to establish a connection between the commercial monitoring clients and the test application servers. The outcome of our experimental work has been published on our group’s blog (see list of publications/presentations).

At the end of the year, we also started work to update and improve our Java tools by incorporating the latest automated-testing and deployment practices from industry.

Next steps

In terms of monitoring, we will work to evaluate the behaviour tracking (Java “flight recording”) feature of Oracle, comparing it with our existing monitoring solutions, and will use our Kubernetes cluster to evaluate the commercial monitoring features from Oracle. In terms of automation, we will update the remaining Java tools to ensure they can make use of the new automated testing and deployment approaches.

 

 

Publications

    S. Hurley, Configuring Technologies to Work with Java Mission Control. Databases at CERN blog. 2018. http://cern.ch/go/Rzs8

Oracle Management Cloud

Project goal

We are testing Oracle Management Cloud and providing feedback to Oracle. We are assessing the merits and suitability of this technology for applications related to databases at CERN, comparing it with our current on-premises infrastructure.

R&D topic
Data-centre technologies and infrastructures
Project coordinator(s)
Eric Grancher and Eva Dafonte Perez
Team members
Aimilios Tsouvelekakis, Artur Wiecek
Collaborator liaison(s)
Jeff Barber, Simone Indelicato, Vincent Leocorbo, Cristobal Pedregal-Martin

Collaborators

Project background

The group responsible for database services within CERN’s IT department provides specialised monitoring solutions to teams across the laboratory which use database infrastructure. These solutions are used for a range of tools, from servers to applications and databases. Monitoring performance in this manner provides invaluable insights, and is key in helping those responsible for providing services at the laboratory to gain a full picture of what is going on with the infrastructure at all times. To accomplish this monitoring functionality, we use two different monitoring stacks: ElasticSearch for log management and InfluxDB for metrics management. This project is evaluating a unified monitoring solution provided by Oracle; Oracle Management Cloud.

Recent progress

Last year’s evaluation took place in two distinct phases. The first was performed in February and March; this mainly focused on deploying the required components in our infrastructure, using only a subset of our datasets. The second evaluation phase — for which Oracle granted CERN a significant amount of cloud credits — lasted from June to December. During this time, we evaluated three components of the platform: log analytics, infrastructure monitoring, and application performance monitoring.

We used datasets from CERN’s Oracle REST Data Services (ORDS) and CERN's Engineering Data Management Service (EDMS) — combining development, test, and production environments — to evaluate each of the three aforementioned components. From this, we were able to generate important graphs for logs and metrics, which — based on experience with our current, on-premises infrastructure — could be a significant boon when it comes to dealing with issues that arise. Based on our evaluation, we were able to provide in-depth feedback and propose possible enhancements that could be useful for other environments like ours.

Next steps

Our primary focus will be on continuing to work with Oracle on the evolution of the platform, based on the detailed feedback provided from our rigorous testing.

 

 


Presentations

    A. Tsouvelekakis, Oracle Management Cloud: A unified monitoring platform (23 January). Presented at CERN openlab Technical Workshop 2019, Geneva, 2019.

Oracle WebLogic on Kubernetes

Project goal

CERN is in the process of moving its Oracle WebLogic infrastructure to containers and Kubernetes, starting with the development environment. The goal is to achieve a robust, zero-downtime service. Taking advantage of the portability of Kubernetes, we want to evaluate Oracle cloud as a solution for disaster recovery.

R&D topic
Data-centre technologies and infrastructures
Project coordinator(s)
Eric Grancher and Eva Dafonte Perez
Team members
Antonio Nappi, Luis Rodríguez Fernández, Artur Wiecek
Collaborator liaison(s)
Vincent Leocorbo, Cristobal Pedregal-Martin, Monica Riccelli, David Cabelus, Will Lyon

Collaborators

Project background

For over 20 years, CERN has run a production service to host critical Java applications. Many of these applications are central to the administration of the laboratory, while others are important for engineering or IT. We’re working on solutions to help keep these applications running in case of major problems with the CERN data centre.

At CERN’s database-applications service, there is ongoing work to migrate from virtual machines to Kubernetes. We’re capitalising on this opportunity to evaluate how our services can run on public clouds — in particular, on Oracle cloud. This new architecture will increase the team’s productivity, freeing up time to focus more directly on developers’ needs.

Recent progress

In 2018, we consolidated the work of the previous year. We worked on two versions of Oracle WebLogic, thus ensuring backward compatibility with legacy applications and giving our users the opportunity to test the newer version. We also integrated a new open-source tool, called Oracle WebLogic Deploy Tooling, into our infrastructure. This is used to easily configure WebLogic domains starting from simple configuration files. Integration of this tool has enabled us to move the configuration of the WebLogic infrastructure outside the Docker images and to increase the speed at which images are generated. In addition, we developed tools to automate the deployment workflow of new applications on Kubernetes.

Another area of work in 2018 was the evaluation of Oracle WebLogic Operator. This is a new open-source tool that provides a WebLogic environment running on Kubernetes. We worked very closely with the Oracle team responsible for this tool, with a number of our feedback and suggestions having a direct impact on new releases.

Next steps

In 2019, we will mainly focus on ensuring that our production environment runs on Kubernetes. In addition, we will start to evaluate a disaster recovery plan to run on the Oracle cloud. We will also look into new options for our monitoring infrastructure; in particular, we will evaluate a tool called Prometheus.

 

 

 

 

 

 

 

 

Publications

    A. Nappi. HAProxy High Availability Setup. Databases at CERN blog. 2017. cern.ch/go/9vPf
    A. Nappi. HAProxy Canary Deployment. Databases at CERN blog. 2017. cern.ch/go/89ff

Presentations

    A.Nappi, L. Rodriguez Fernández, WebLogic on Kubernetes (17 January), Presented at CERN openlab meeting with Oracle in Geneva, Geneva, 2017. cern.ch/go/6Z8R
    S. A. Monsalve, Development of WebLogic 12c Management Tools (15 August), Presented at CERN openlab summer students’ lightning talks, Geneva 2017. cern.ch/go/V8pM
    A. Nappi, L. Rodriguez Fernández, WebLogic on Kubernetes (15-17 August), Presented at Oracle Workshop Bristol, Bristol, 2017 cern.ch/go/6Z8R
    A. Nappi, WebLogic on Kubernetes (21 September), Presented at CERN openlab Open Day, Geneva, 2017. cern.ch/go/6Z8R
    A.Nappi, L. Rodriguez Fernández, Oracle Weblogic on Containers: Beyond the frontiers of your Data Centre Openday (21 September), Presented at CERN openlab Open Day, Geneva, 2017. cern.ch/go/nrh8
    A. Nappi, L. Gedvilas L. Rodríguez Fernández, A. Wiecek, B. Aparicio Cotarelo (9-13 July), Presented at 23rd International Conference on Computing in High Energy and Nuclear Physics (CHEP), Sofia, Bulgaria, 2018. cern.ch/go/dW8J
    L. Rodriguez Fernandez, A. Nappi, Weblogic on Kubernetes (11 January). Presented at CERN Openlab Technical Workshop, Geneva, 2018. cern.ch/go/6Z8R
    B. Cotarelo, Oracle Weblogic on Kubernetes (July). Presented at 23rd International Conference on Computing in High Energy and Nuclear Physics (CHEP), Sofia, 2018. cern.ch/go/6MVQ
    M. Riccelli, D. Cabelus, A. Nappi, Running a Modern Java EE Server in Containers Inside Kubernetes (23 October). cern.ch/go/b6nl

EOS productisation

Project goal

This project is focused on the evolution of CERN’s EOS large-scale storage system. The goal is to simplify the usage, installation, and maintenance of the system. In addition, the project aims to add native support for new client platforms, expand documentation, and implement new features/integration with other software packages.

R&D topic
Data-centre technologies and infrastructures
Project coordinator(s)
Luca Mascetti
Team members
Elvin Sindrilaru
Collaborator liaison(s)
Gregor Molan, Ivan Arizanovic, Branko Blagojevic

Collaborators

Project background

Within the CERN IT department, a dedicated group is responsible for the operation and development of the storage infrastructure. This infrastructure is used to store the physics data generated by the experiments at CERN, as well as the files of all members of personnel.

EOS is a disk-based, low-latency storage service developed at CERN. It is tailored to handle large data rates from the experiments, while also running concurrent complex production workloads. This high-performance system now provides more than 300 petabytes of raw disks.

EOS is also the key storage component behind CERNBox, CERN’s cloud-synchronisation service. This makes it possible to sync and share files on all major mobile and desktop platforms (Linux, Windows, macOS, Android, iOS), with the aim of providing offline availability to any data stored in the EOS infrastructure.

Recent progress

Comtrade's team continued to acquire further knowledge of EOS, profiting from their visit to CERN and from working side-by-side with members of the development and operations teams. This helped them to improve their work on EOS installation, documentation, and testing.

In particular, a dedicated document describing best practices for operating EOS in large-scale environments was produced, as well as a full-stack virtual environment hosted at Comtrade. This shows the potential of the system when used as a geographically distributed storage system.

Next steps

The project will focus on improving and updating the EOS technical documents, for future administrators and operators. The next main goal is to host dedicated hardware resources at CERN to support prototyping of an EOS-based appliance. This will enable Comtrade to create a first version of a full storage solution and to offer it to potential customers in the future.

In addition, the team will investigate the possibility of developing a native Windows client for EOS.

Publications

    X. Espinal, M. Lamanna, From Physics to industry: EOS outside HEP, Journal of Physics: Conference Series (2017), Vol. 898, https://doi.org/10.1088/1742-6596/898/5/052023. cern.ch/go/7XWH

Presentations

    X. Espinal, M. Lamanna, From Physics to industry: EOS outside HEP, Journal of Physics: Conference Series (2017), Vol. 898, https://doi.org/10.1088/1742-6596/898/5/052023. cern.ch/go/7XWH
    L. Mascetti, Comtrade EOS productization (23 January). Presented at CERN openlab technical workshop, Geneva, 2019. cern.ch/go/W6SQ
    G. Molan, EOS Documentation and Tesla Data Box (4 February). Presented at CERN EOS workshop, Geneva, 2019. cern.ch/go/9QbM
    L. Mascetti, EOS Comtrade project (23 January). Presented at CERN openlab Technical workshop, Geneva, 2020. cern.ch/go/l9gc
    L. Mascetti, CERN Disk Storage Services (3 February 2020). Presented at CERN EOS workshop, Geneva, 2020. cern.ch/go/pF97
    G. Molan, Preparing EOS for Enterprise Users (27 January 2020). Presented at Cloud Storage Services for Synchronization and Sharing (CS3), Copenhagen, 2020. cern.ch/go/tQ7d
    G. Molan, EOS Documentation for Enterprise Users (3 February 2020). Presented at CERN EOS workshop, Geneva, 2020. cern.ch/go/swX8
    G. Molan, EOS Windows Native Client (3 February 2020). Presented at CERN EOS workshop, Geneva, 2020. cern.ch/go/P7DX
    G. Molan, EOS Storage Appliance Prototype (5 February 2020). Presented at CERN EOS workshop, Geneva, 2020. cern.ch/go/q8qh