Facilitate and Automatize Kubernetes Operations

Project Goal

The main goal is to produce a tool that will help to validate, test and automate Kubernetes cluster upgrades. Currently the current process of validating a new Kubernetes version takes several weeks. We would like to simplify this and be able to run the same process in shorter time and in automated way.  

Background

The main issue during upgrade to a newer version of Kubernetes cluster is that we cannot statically determine if our current Kubernetes workloads are going to break in the newer version because of API changes/deprecations. The only way to determine is to run it against a new Kubernetes cluster. When you have thousands of pods/resources this way to perform operations doesn’t scale. The testing should not follow an empiric strategy but take advantage of a static analysis. This is the first step of a wider idea for statical analysis of service mesh dependencies. 

Progress

Invent and develop kubernetes-diff application as working tool helping migration of workloads between different version of Kubernetes (K8s). As part of the project we have developed a tool that allows for the automated extraction of OpenAPI K8s schemas supported, which are then used by the application. The application allows detection of issues in Kubernetes workloads, against the chosen K8s cluster version. Because our tool can fully integrate with static system files or a running K8s cluster, it can be used in any scenario or context, for example, as a component of an automated pipeline, CI/CD system, script, or system terminal application. The result of the scan (differences/errors detection) is presented in popular formats such as JSON, YAML, and shows the results grouped in human-readable tables. A big difficulty during the project proved to be the inconsistency of the kubernetes project with its own schemas, which means that a lot of K8s resources do not comply with the validation rules present in public OpenAPI, and refer to many different, hidden, private-based places in the code. As a result, time was spent on research on understanding and obtaining additional, hidden validation rules, which are essential if we are to make our tool overall usable. Full description of the research was prepared in the course of the work, along with readme of scripts: runtime environment setup, setup metadata for debugging, support for DWARF, Delve, Go - enabling remote debugging of existing kube-apiserver instances (entry point for Kubernetes resources), which makes the experiment repeatable and possible to continue by anyone who repeats our setup and reverse engineering and debug engineering.

Next Steps

We will close the project at the beginning of 2024. We would like to make our investment repeatable for everyone and advertise it in order to increase awareness among the wider Kubernetes community to let them react. It requires increased effort from several entities that must make several, intricate adjustments to the entire Kubernetes codebase - in fact, validation rules are actually found throughout the code. We have a situation where schemas don’t match what’s in the code and vice versa. Maintaining such a tool, forces us to manually review and diff the entire codebase every time new Kubernetes version is released. 

 

Project Coordinator: Antonio Nappi

Technical Team: Antonio Nappi, Adrian Karasinski

Collaboration Liaisons from Oracle: Eric Grancher (CERN), Cristobal Pedregal-Martin (Oracle), Garret Swart (Oracle), Artur Wiecek (CERN)

In partnership with: Oracle