Facilitate and Automatize Kubernetes Operations

Project Goal

The project aims to produce a tool that will help to validate, test and automate Kubernetes cluster upgrades. Currently the current process of validating a new Kubernetes version takes several weeks. We would like to simplify this and be able to run the same process in shorter time and in automated way. 

During the course of the project, we also explored new avenues and evaluated the potential integration of Database Multilingual Engine (Oracle MLE) with our tools. Additionally, we conducted an in-depth analysis of Oracle REST Web Services (ORDS). Thanks to these efforts and collaborative teamwork, we were able to develop a comprehensive migration strategy and use-cases for transitioning from the legacy version to the new system, as well as implementing a modern OIDC authentication standard. This allowed us to move away from the technology currently in use.

Background

The main issue during upgrade to a newer version of Kubernetes cluster is that we cannot statically determine if our current Kubernetes workloads are going to break in the newer version because of API changes/deprecations. The only way to determine is to run it against a new Kubernetes cluster. When you have thousands of pods/resources this way to perform operations doesn’t scale. The testing should not follow an empiric strategy but take advantage of a static analysis. This is the first step of a wider idea for statical analysis of service mesh dependencies. 

In supporting our infrastructure engineers, as well as engineers worldwide working with Kubernetes, we focused on making workload migration between cluster versions less error-prone, faster, more efficient, and less susceptible to human oversight. Meanwhile, the use of MLE enables partial offloading of responsibilities for specific JavaScript applications to the database level, streamlining operations and reducing the burden on dedicated personnel.

As for ORDS, adopting the proposed approach will help us avoid potential issues arising from our current setup, which is soon to become outdated—several components will no longer be supported. Additionally, the current solution is overly complex, acting as a kind of workaround or hack that is both difficult to understand and implement. It relies on injecting a principal into the authentication flow at some point, which complicates the process further. The guide we developed will simplify and streamline this process, making it easier, faster, and more comprehensible for everyone involved.

We successfully garnered Oracle’s interest in our case and observations, leading to a collaborative effort to develop a solution that benefits not only the company but also CERN and others worldwide who rely on similar technologies.

Progress in 2024

The project related to static analysis of context (kubernetes-diff) is considered completed. It is not feasible to cover the entire domain of possible cases due to the inherent limitations of Kubernetes schemas, which are often incomplete. Additionally, many behaviors, strategies, and validation rules are dispersed across various parts of the codebase, which frequently undergo dynamic changes or refactoring.

This inability to achieve 100% case coverage makes it impossible to fully realize the project’s objectives.

However, we worked to raise awareness within the community by contributing with updates to the Kubernetes project documentation. These changes aim to enhance community understanding, assist feature researchers, and explicitly address this limitation in the appropriate context. This is particularly important since such knowledge is typically familiar only to those directly working on the Kubernetes codebase, while average users may lack this insight.

MLE has been thoroughly tested, analyzed, and measured, with feedback and feature requests submitted. Its potential has been assessed, and its performance evaluated. In certain use cases, it can significantly enhance the processing time of specific requests or processes, achieving performance improvements of several times over.

With ORDS, we successfully analyzed, illustrated, tested, and clarified the work around involving the injection of principals, JNDI, Tomcat Realms, and the associated complexities. Additionally, we developed a scenario to transition away from this solution toward a streamlined and clean adoption of OIDC.

Next Steps

Finish all the topics, from kubernetes-diff to MLE and ORDS - documentation, guides, technical stories, codebase, scripts, infrastructure, containers.

To ensure that anyone interested in adopting a similar setup or migrating to the latest solutions and standards can do so in a straightforward, clear, and comprehensible manner, we finalized comprehensive documentation detailing each step of the process, including best practices and potential pitfalls. We incorporated feedback from initial implementations to refine and enhance the procedures. Additionally, we developed user-friendly manuals to guide users through the setup and migration processes, ensuring clarity at every stage. We also created Docker images and configurations to facilitate easy deployment and replication of the environment.

Furthermore, we provided configuration files and templates to streamline the setup process, reducing the likelihood of errors. These resources are designed to make the adoption of the new setup intuitive and accessible, even for those unfamiliar with the underlying technologies.

Project Coordinator: Antonio Nappi

Technical Team: Antonio Nappi, Adrian Karasinski

Collaboration Liaisons from Oracle: Eric Grancher (CERN), Cristobal Pedregal-Martin (Oracle), Garret Swart (Oracle), Artur Wiecek (CERN)

In partnership with: Oracle