Context
Since 1.18, the Monitoring part of Conduktor Console has been externalized in an image called conduktor-platform-cortex
.
This image contains 3 components:
- Prometheus, to scrape the metrics from Conduktor Console
- Cortex, to store these metrics in your S3, or volumes
- Alert Manager, to setup alerts
These 3 components are external to Conduktor, and not maintained by us, but we use them in order to make our Monitoring work.
Issue
If right after an upgrade, or more generally a restart, you can see that:
- The Monitoring module doesn't work from within the UI, with some "unexpected errors"
- The logs of the Cortex container contains "SIGKILL", "SIGTERM", or "waiting for prometheus, alertmanager, cortex to die" errors
- The Cortex container is down in the 2 minutes following the restart
It might be because of two things:
- Either the container gets OOM killed
- Or it takes some time to get started, and the liveness / readiness probes have been exceeded
In order to identify the source of the problem, you can run the following command line:
kubectl describe pod {your pod name}
The container is OOM killed
In the response, you should find a part that looks like this:
Containers:
conduktor-platform-cortex:
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Ready: False
This clearly indicates that the pod has been killed because of an Out Of Memory issue (code 137).
In order to give it enough memory, please refer to this article, that explains how to properly size this pod.
The liveness probe is not long enough
In the Events part at the bottom of the describe, you see a timeout on the liveness probe:
Warning Unhealthy 3s (x2 over 8s) kubelet Liveness probe failed: "probe failed due to timeout "
If Cortex dies in an unexpected way, it runs a recovery process during its restart that might take some time.
In order to give it more time to be up and running, you can increase the following properties. That way, you will avoid a timeout on the liveness probe after a restart of the pod.
platformCortex:
livenessProbe:
initialDelaySeconds: 30 #default
failureThreshold: 3 #default
Comments
0 comments
Please sign in to leave a comment.