Debugging namespace deletion issue in Kubernetes

Sumanth Reddy
3 min readFeb 3, 2022

--

Image source β€” https://stacksimplify.com/azure-aks/azure-kubernetes-service-namespaces-imperative/

I got a support ticket from a developer about their job breaking and when checked, the issue was with namespace stuck in Terminating state forever. Namespace stuck in termination is one of the most classic problems faced by every Kubernetes engineer out there. Yet, the reasons are not always the same πŸ™ƒ.

This article is about the experience while debugging namespace deletion issue in Kubernetes. Just one of many fun moments in the day to day life of an Kubernetes engineer :)

This article has 3 sections

  1. Quick workaround
  2. Debugging the underlying issue
  3. Namespace deletion under the hood

Workaround

When i exported the namespace, the status showed that there was some issue with one of the objects β€”

unable to retrieve the complete list of server APIs: external.metrics.k8s.io/v1beta1: the server is currently unable to handle the request
Namespace export output β€” trimmed

Tried finding out which service uses/creates/manages this using

~ k api-resources

The results didn't have the external.metrics.k8s.io/v1beta1 resource and moreover also had message that it was unable to retrieve list.

Since this was blocking developer pipeline, i had to do cheap workaround to delete the namespace forcefully β€”

Script from https://stackoverflow.com/a/53661717

Tried out patching/editing spec to remove finalizers, but didn't work. Finally the above script worked for me. What i did above essentially was, removing finalizers from namespace object. More about finalizers here.

Debugging

Now moving on, to fix the actual issue….

Upon brief ducking(DuckDuckGo FTW), found that this was installed by external metrics adapter. Found this β€” https://github.com/kubernetes-sigs/prometheus-adapter/blob/master/deploy/manifests/custom-metrics-apiservice.yaml#L37 We were testing external metric sources and my colleagues tried out Keda as well on cluster. This also made me realise why the k api-resources didn’t yield this in list, it was using the aggregated api server. The error itself was saying unable to retrieve the complete list of server API's πŸ€¦β€β™‚οΈ

~ k get APIService v1beta1.external.metrics.k8s.ioNAME                              SERVICE                                AVAILABLE                             AGE
v1beta1.external.metrics.k8s.io keda/keda-operator-metrics-apiserver False(MissingEndpoints) 13h

So this cleared that the issue is with keda controller deployed into keda namespace. Upon checking, there was some issue with deployment and once the keda controller was fixed, the namespace deletion issue vanished.

~ k get APIService v1beta1.external.metrics.k8s.ioNAME                              SERVICE                                AVAILABLE         AGE
v1beta1.external.metrics.k8s.io keda/keda-operator-metrics-apiserver True 13h

Peek under the hood

How does the namespace deletion work under the hood?? πŸ€”

This took me to namespace package in the Kubernetes code repo

➜  kubernetes/pkg/controller/namespace/namespace

.
β”œβ”€β”€ OWNERS
β”œβ”€β”€ config
β”‚ β”œβ”€β”€ OWNERS
β”‚ β”œβ”€β”€ doc.go
β”‚ β”œβ”€β”€ types.go
β”‚ β”œβ”€β”€ v1alpha1
β”‚ β”‚ β”œβ”€β”€ conversion.go
β”‚ β”‚ β”œβ”€β”€ defaults.go
β”‚ β”‚ β”œβ”€β”€ doc.go
β”‚ β”‚ β”œβ”€β”€ register.go
β”‚ β”‚ β”œβ”€β”€ zz_generated.conversion.go
β”‚ β”‚ └── zz_generated.deepcopy.go
β”‚ └── zz_generated.deepcopy.go
β”œβ”€β”€ deletion
β”‚ β”œβ”€β”€ namespaced_resources_deleter.go # File of our interest :p
β”‚ β”œβ”€β”€ namespaced_resources_deleter_test.go
β”‚ β”œβ”€β”€ status_condition_utils.go
β”‚ └── status_condition_utils_test.go
β”œβ”€β”€ doc.go
└── namespace_controller.go # Main controller
3 directories, 17 files

namespace_controller.go spins up workers which read the namespace from the queue. Each worker picks a namespace from queue and syncs its status (think control loops) until it receives stop signal in its control channel.

There are multiple namespace conditions defined in deletion_controller as below:

Copied from https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/namespace/deletion/status_condition_utils.go#L47

The worker calls the deletion controller to determine the status of the namespace and sets its conditions accordingly on every sync cycle. The entire call stack is roughly as below:

Rough call stack for deletion controller

Issue in my case was, when the deletion controller tried to discover all the resources, the aggregated api server(keda in this case) didn’t respond with respective resources, leading to discovery failure.

Wow this code exploration felt very good πŸ₯³ Ending this article with a hope that i’ll be writing more of these on various Kubernetes objects.

Reference links

--

--

No responses yet