Kubernetes — Double jump upgrades

Sumanth Reddy
4 min readJun 19, 2023
Versions show in image are for context of double jump and not constrained to concepts outlined

Unpopular opinion — The most difficult part of platform software is not the installation but its maintenance.

TLDR version

Kubernetes has a version skew policy which allows for data plane to be two minor versions behind control plane. This gives us the ability to skip a version and go ahead to +2 version of the data plane as part of the upgrade.

As long as one uses GA and Beta API versions in the cluster, it is possible to do double jump upgrades. Meaning we upgrade control plane twice (Say 1.22 → 1.23 → 1.24) and directly upgrade Data plane to +2 version (Say from 1.22 → 1.24).

The concepts detailed below apply for all Kubernetes clusters in general.

NOTE — Doing this has it’s own risks associated, Please do due diligence before taking the decision.

If the TLDR version looks interesting, please follow along while i take you through some concepts behind Kubernetes release and deprecation policy.

We all agree to the fact that regular upgrades to the software is important but not necessarily easy, Kubernetes is no different. Kubernetes team has a regular release cadence for various versions. Pre 1.22 version, there was a 4-releases-per-calendar-year cadence which changed to 3-releases-per-calendar-year cadence from 1.22 version with KEP(Kubernetes Enhancement Proposal) described here.

For example, below is the AWS EKS end of life calendar.

https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions.html#kubernetes-release-calendar

As you can see from above, we need to keep upgrading to the newer version every 4 months. This might not be an easy pace to keep up with, for every team.

Evidently Kubernetes has very good feature deprecation/removal policy. More on this at https://kubernetes.io/docs/reference/using-api/deprecation-policy/

API versions progression example

Features generally progress from alpha — > beta — > GA, but might be deprecated/removed before reaching GA as well. More about the deprecation policy can be read from above link, but quoting Rule #4a which helps us in easy upgrades.

Rule #4a: API lifetime is determined by the API stability level

GA API versions may be marked as deprecated, but must not be removed within a major version of Kubernetes

Beta API versions are deprecated no more than 9 months or 3 minor releases after introduction (whichever is longer), and are no longer served 9 months or 3 minor releases after deprecation (whichever is longer)

Alpha API versions may be removed in any release without prior deprecation notice

This ensures beta API support covers the maximum supported version skew of 2 releases, and that APIs don’t stagnate on unstable beta versions, accumulating production usage that will be disrupted when support for the beta API ends.

Delving into the version skew policy, you will find below

  1. Control plane components can be one minor version different from each other in HA cluster.
  2. Data plane can be two minor versions behind Control plane and shouldn’t be newer than Control plane.

Most notable points from Rule #4a quoted above is as follows

  1. GA API versions are deprecated but not removed from major version and Beta API versions once marked for deprecation are only removed only after 3 minor versions.

Combining the above two points one can arrive at below conclusion

As long as one uses GA and Beta API versions in the cluster, it is possible to do double jump upgrades. Meaning we upgrade control plane twice (Say 1.22 → 1.23 → 1.24) and directly upgrade Data plane to +2 versions (Say from 1.22 → 1.24).

This might seem insignificant at first glance but the Kubernetes platform teams know that the most difficult part of upgrade is data plane rotation which generally comprises of multiple node groups and combined number makes up to hundreds of nodes.

In the next blog, i’ll explain the general process of due diligence for making these double jump upgrades safer in general and my experience upgrading clusters from 1.22 -> 1.24 version.

General expected FAQ’s

  1. Is it safer to do so?

Depends. Most folks run various things in their data plane and it’s always those that break. Control plane upgrade is the easier part, data plane, not so much. So if your data plane is simple enough, you can do so, at your own risk and after due diligence.

2. Did i actually do this in practice on real world production clusters?

Yes. I have succeeded in making the double jump cluster upgrades a couple of times to date. Although i should add that, i did find few small missed points, which i had to fix post upgrade. For example, log format change in 1.24, which earlier had docker json logging which changed to cri format which isn’t json. This will break your json log parsing pipelines. Some folks say this is just luck, i think it’s the beautiful design of Kubernetes feature deprecation and release process. The efforts of Kubernetes community in trying to maintain backward compatibility ♥️.

3. Does this work on all Kubernetes clusters?
Yes, the concept in general is from upstream Kubernetes. It applies equally to self hosted as well as managed Kubernetes clusters. Although i should add that doing this on managed clusters will be a bit easy, especially considering the control plane upgrade in managed services is just a button click.

Reference links

--

--