Designing and operating distributed data infrastructures and computing centres poses challenges in areas such as networking, architecture, storage, databases, and cloud. These challenges are amplified and added to when operating at the extremely large scales required by major scientific endeavours. CERN is evaluating different models for increasing computing and data-storage capacity, in order to accommodate the growing needs of the LHC experiments over the next decade. All models present different technological challenges. In addition to increasing the capacity of the systems used for traditional types of data processing and storage, explorations are being carried out into a number of alternative architectures and specialised capabilities. These will add heterogeneity and flexibility to the data centres, and should enable advances in resource optimisation.

 

Kubernetes and Google Cloud

Project goal

The aim of this project is to demonstrate the scalability and performance of Kubernetes and the Google Cloud, validating this setup for future computing models. As an example, we are using the famous Higgs analysis that led to the 2013 Nobel Prize in Physics, thus also showing that real analysis can be done using CERN Open Data.

R&D topic
Data-centre technologies and infrastructures
Project coordinator(s)
Ricardo Manuel Brito da Rocha
Collaborator liaison(s)
Karan Bhatia, Andrea Nardone, Mark Mims, Kevin Kissell

Collaborators

Project background

As we look to improve the computing models we use in high-energy physics (HEP), this project serves to demonstrate the potential of open and well established APIs, such as Kubernetes. They open up a wide range of new possibilities in terms of how we deploy our workloads.

Based on a challenging and famous use case, we’re working to demonstrate that these new tools —together with the virtually unlimited capacity offered by public cloud providers — make it possible to rethink how analysis workloads can be scheduled and distributed. This could lead to further improvements in the efficiency of our systems at CERN.

The project also provides an excellent opportunity to show how, given enough resources, anyone can replicate important physics analysis work using the open data published by CERN and the LHC experiments.

Recent progress

The initial goal of the project has been fulfilled: we have demonstrated that Kubernetes and the Google Cloud is a viable and extremely performant setup for running HEP analysis work. The code required, as well as the setup, is fully documented and publicly available (see publications).

The outcome of the project was presented at a number of high-profile conferences, including a keynote presentation at KubeCon Europe 2019, an event attended by over 8000 people. A live demo of the whole setup, using data from the CERN Open Data Portal, was shown on stage.

The setup — as well as the dataset used — has been prepared for publication as a Google Cloud official tutorial. This will enable anyone to trigger a similar execution using their own public cloud resources. This tutorial will be published in early 2020, once the text has been finalised.

Next steps

This project was initially self-contained, with a clear target for the presentation at KubeCon Europe 2019. However, the project has now grown beyond this initial, limited scope. Future steps should include:

  • Further investigating how using public cloud can improve physics analysis.
  • Working to provide on-demand, bursting to public cloud capabilities for our on-premise resources.
  • Seeking to understand how we can best define policies and accounting procedures for using public cloud resources in this manner.

 

 

Publications

    R. Rocha, L. Heinrich, Reperforming a Nobel Prize Discovery on Kubernetes (21 May). Presented at Kubecon Europe 2019, Barcelona, 2019. cern.ch/go/PlC8
    R. Rocha, L. Heinrich, Higgs Analysis on Kubernetes using GCP (19 September). Presented at Google Cloud Summit, Munich, 2019. cern.ch/go/Dj8f
    R. Rocha, L. Heinrich, Reperforming a Nobel Prize Discovery on Kubernetes (7 November). Presented at the 4th International Conference on Computing in High-Energy and Nuclear Physics (CHEP), Adelaide, 2019. cern.ch/go/6Htg
    R. Rocha, L. Heinrich, Deep Dive into the Kubecon Higgs Analysis Demo (5 July). Presented at CERN IT Technical Forum, Geneva, 2019. cern.ch/go/6zls

Presentations

    R. Rocha, L. Heinrich, higgs-demo. Project published on GitHub. 2019. cern.ch/go/T8QQ

Infrastructure monitoring and automation of resource deployment

Project goal

By learning from industrial standards and best practices, we’re working to further improve and expand CERN’s infrastructure monitoring and deployment processes for our Java platform.

R&D topic
Data-centre technologies and infrastructures
Project coordinator(s)
Eric Grancher and Eva Dafonte Perez
Team members
Viktor Kozlovszky, Luis Rodríguez Fernández, Artur Wiecek, Scott Hurley
Collaborator liaison(s)
Vincent Leocorbo, Cristobal Pedregal-Martin

Collaborators

Project background

CERN’s IT department is home to a dedicated group responsible for database services, referred to as IT-DB. This group maintains server infrastructures required by departments across the laboratory. In order to successfully maintain these infrastructures, it is important to be constantly aware of our systems’ status. The monitoring infrastructure consists of several components; these collect essential information about the machines and the applications running on them.

The IT-DB group also provides tools and custom Java libraries that play an important role in the processes of teams across the laboratory. It is therefore vital to deliver stable, quality applications that build on industrial standards and best practices.

Recent progress

In 2018, we began evaluation of the commercial Java monitoring tools and features provided by Oracle. We set up a test environment and we managed — through an SSL secured channel — to establish a connection between the commercial monitoring clients and the test application servers. The outcome of our experimental work has been published on our group’s blog (see list of publications/presentations).

At the end of the year, we also started work to update and improve our Java tools by incorporating the latest automated-testing and deployment practices from industry.

Next steps

In terms of monitoring, we will work to evaluate the behaviour tracking (Java “flight recording”) feature of Oracle, comparing it with our existing monitoring solutions, and will use our Kubernetes cluster to evaluate the commercial monitoring features from Oracle. In terms of automation, we will update the remaining Java tools to ensure they can make use of the new automated testing and deployment approaches.

 

 

Publications

    S. Hurley, Configuring Technologies to Work with Java Mission Control. Databases at CERN blog. 2018. http://cern.ch/go/Rzs8

Oracle Management Cloud

Project goal

We are testing Oracle Management Cloud and providing feedback to Oracle. We are assessing the merits and suitability of this technology for applications related to databases at CERN, comparing it with our current on-premises infrastructure.

R&D topic
Data-centre technologies and infrastructures
Project coordinator(s)
Eric Grancher and Eva Dafonte Perez
Team members
Aimilios Tsouvelekakis, Artur Wiecek
Collaborator liaison(s)
Jeff Barber, Simone Indelicato, Vincent Leocorbo, Cristobal Pedregal-Martin

Collaborators

Project background

The group responsible for database services within CERN’s IT department provides specialised monitoring solutions to teams across the laboratory which use database infrastructure. These solutions are used for a range of tools, from servers to applications and databases. Monitoring performance in this manner provides invaluable insights, and is key in helping those responsible for providing services at the laboratory to gain a full picture of what is going on with the infrastructure at all times. To accomplish this monitoring functionality, we use two different monitoring stacks: ElasticSearch for log management and InfluxDB for metrics management. This project is evaluating a unified monitoring solution provided by Oracle; Oracle Management Cloud.

Recent progress

Last year’s evaluation took place in two distinct phases. The first was performed in February and March; this mainly focused on deploying the required components in our infrastructure, using only a subset of our datasets. The second evaluation phase — for which Oracle granted CERN a significant amount of cloud credits — lasted from June to December. During this time, we evaluated three components of the platform: log analytics, infrastructure monitoring, and application performance monitoring.

We used datasets from CERN’s Oracle REST Data Services (ORDS) and CERN's Engineering Data Management Service (EDMS) — combining development, test, and production environments — to evaluate each of the three aforementioned components. From this, we were able to generate important graphs for logs and metrics, which — based on experience with our current, on-premises infrastructure — could be a significant boon when it comes to dealing with issues that arise. Based on our evaluation, we were able to provide in-depth feedback and propose possible enhancements that could be useful for other environments like ours.

Next steps

Our primary focus will be on continuing to work with Oracle on the evolution of the platform, based on the detailed feedback provided from our rigorous testing.

 

 


Presentations

    A. Tsouvelekakis, Oracle Management Cloud: A unified monitoring platform (23 January). Presented at CERN openlab Technical Workshop 2019, Geneva, 2019.

Oracle WebLogic on Kubernetes

Project goal

CERN is in the process of moving its Oracle WebLogic infrastructure to containers and Kubernetes, starting with the development environment. The goal is to achieve a robust, zero-downtime service. Taking advantage of the portability of Kubernetes, we want to evaluate Oracle cloud as a solution for disaster recovery.

R&D topic
Data-centre technologies and infrastructures
Project coordinator(s)
Eric Grancher and Eva Dafonte Perez
Team members
Antonio Nappi, Luis Rodríguez Fernández, Artur Wiecek
Collaborator liaison(s)
Vincent Leocorbo, Cristobal Pedregal-Martin, Monica Riccelli, David Cabelus, Will Lyon

Collaborators

Project background

For over 20 years, CERN has run a production service to host critical Java applications. Many of these applications are central to the administration of the laboratory, while others are important for engineering or IT. We’re working on solutions to help keep these applications running in case of major problems with the CERN data centre.

At CERN’s database-applications service, there is ongoing work to migrate from virtual machines to Kubernetes. We’re capitalising on this opportunity to evaluate how our services can run on public clouds — in particular, on Oracle cloud. This new architecture will increase the team’s productivity, freeing up time to focus more directly on developers’ needs.

Recent progress

In 2018, we consolidated the work of the previous year. We worked on two versions of Oracle WebLogic, thus ensuring backward compatibility with legacy applications and giving our users the opportunity to test the newer version. We also integrated a new open-source tool, called Oracle WebLogic Deploy Tooling, into our infrastructure. This is used to easily configure WebLogic domains starting from simple configuration files. Integration of this tool has enabled us to move the configuration of the WebLogic infrastructure outside the Docker images and to increase the speed at which images are generated. In addition, we developed tools to automate the deployment workflow of new applications on Kubernetes.

Another area of work in 2018 was the evaluation of Oracle WebLogic Operator. This is a new open-source tool that provides a WebLogic environment running on Kubernetes. We worked very closely with the Oracle team responsible for this tool, with a number of our feedback and suggestions having a direct impact on new releases.

Next steps

In 2019, we will mainly focus on ensuring that our production environment runs on Kubernetes. In addition, we will start to evaluate a disaster recovery plan to run on the Oracle cloud. We will also look into new options for our monitoring infrastructure; in particular, we will evaluate a tool called Prometheus.

 

 

 

 

 

 

 

 

Publications

    A. Nappi. HAProxy High Availability Setup. Databases at CERN blog. 2017. cern.ch/go/9vPf
    A. Nappi. HAProxy Canary Deployment. Databases at CERN blog. 2017. cern.ch/go/89ff

Presentations

    A.Nappi, L. Rodriguez Fernández, WebLogic on Kubernetes (17 January), Presented at CERN openlab meeting with Oracle in Geneva, Geneva, 2017. cern.ch/go/6Z8R
    S. A. Monsalve, Development of WebLogic 12c Management Tools (15 August), Presented at CERN openlab summer students’ lightning talks, Geneva 2017. cern.ch/go/V8pM
    A. Nappi, L. Rodriguez Fernández, WebLogic on Kubernetes (15-17 August), Presented at Oracle Workshop Bristol, Bristol, 2017 cern.ch/go/6Z8R
    A. Nappi, WebLogic on Kubernetes (21 September), Presented at CERN openlab Open Day, Geneva, 2017. cern.ch/go/6Z8R
    A.Nappi, L. Rodriguez Fernández, Oracle Weblogic on Containers: Beyond the frontiers of your Data Centre Openday (21 September), Presented at CERN openlab Open Day, Geneva, 2017. cern.ch/go/nrh8
    A. Nappi, L. Gedvilas L. Rodríguez Fernández, A. Wiecek, B. Aparicio Cotarelo (9-13 July), Presented at 23rd International Conference on Computing in High Energy and Nuclear Physics (CHEP), Sofia, Bulgaria, 2018. cern.ch/go/dW8J
    L. Rodriguez Fernandez, A. Nappi, Weblogic on Kubernetes (11 January). Presented at CERN Openlab Technical Workshop, Geneva, 2018. cern.ch/go/6Z8R
    B. Cotarelo, Oracle Weblogic on Kubernetes (July). Presented at 23rd International Conference on Computing in High Energy and Nuclear Physics (CHEP), Sofia, 2018. cern.ch/go/6MVQ
    M. Riccelli, D. Cabelus, A. Nappi, Running a Modern Java EE Server in Containers Inside Kubernetes (23 October). cern.ch/go/b6nl

EOS productisation

Project goal

This project is focused on the evolution of CERN’s EOS large-scale storage system. The goal is to simplify the usage, installation, and maintenance of the system. In addition, we will add support for new client platforms, expand documentation, and implement new features/integration with other software packages.

R&D topic
Data-centre technologies and infrastructures
Project coordinator(s)
Luca Mascetti
Team members
Elvin Sindrilaru
Collaborator liaison(s)
Gregor Molan, Ivan Arizanovic, Branko Blagojevic

Collaborators

Project background

Within the CERN IT department, a dedicated group is responsible for the operation and development of storage infrastructure. This infrastructure is used to store the physics data generated by the experiments at CERN, as well as the files of all members of personnel.

EOS is a disk-based, low-latency storage service developed at CERN. It is tailored to handle large data rates from the experiments, while also running concurrent complex production workloads. This high-performance system now provides more than 300 petabytes of raw disks.

EOS is also the key storage component behind CERNBox, CERN’s cloud-synchronisation service. This makes it possible to sync and share files on all major mobile and desktop platforms (Linux, Windows, macOS, Android, iOS), with the aim of providing offline availability to any data stored in the EOS infrastructure.

Recent progress

We are now in the third phase of this project. The team at Comtrade has been working to acquire further knowledge of EOS, with this activity carefully organised into nine separate work packages. Three Comtrade engineers also visited CERN and spent two weeks working side-by-side with members of the development and operations teams (helping to handle hardware failures, reconfigurations, software upgrades, and user support). We were then able to work together to create a set of technical documents describing the main aspects of EOS, for use by future administrators and operators.

In addition, we set up a proof-of-concept system using container technology. This shows the potential of the system to be used as a geographically distributed storage system and will serve as a demonstrator to potential future customers.

Next steps

We will continue our evolving work on EOS installation, documentation, and testing. We will prepare a dedicated document outlining “best practices” for operating EOS in large-scale environments.

An additional goal is to provide future customers with a virtual full-stack environment hosted at Comtrade. This would consist of an EOS instance enabled with the latest-generation namespace, a sync-and-share endpoint (using CERNBox), and an interactive data-analysis service (based on SWAN, the JupyterHub notebook used at CERN).

Publications

    X. Espinal, M. Lamanna, From Physics to industry: EOS outside HEP, Journal of Physics: Conference Series (2017), Vol. 898, https://doi.org/10.1088/1742-6596/898/5/052023. cern.ch/go/7XWH

Presentations

    X. Espinal, M. Lamanna, From Physics to industry: EOS outside HEP, Journal of Physics: Conference Series (2017), Vol. 898, https://doi.org/10.1088/1742-6596/898/5/052023. cern.ch/go/7XWH
    L. Mascetti, Comtrade EOS productization (23 January). Presented at CERN openlab technical workshop, Geneva, 2019. cern.ch/go/W6SQ
    G. Molan, EOS Documentation and Tesla Data Box (4 February). Presented at CERN EOS workshop, Geneva, 2019. cern.ch/go/9QbM