On Monday April 16th and Wednesday April 18th, Manifold experienced it’s first real outages. In this blog post we’ll talk about what exactly happened, why this happened, and which steps we’re taking to ensure this doesn’t happen again.
By doing so, we also aim to help others understand how they can debug their Kubernetes cluster in case of a failure and what they should look out for.
First and foremost, we’d like to apologise to our affected customers.
For both outages, the impact was global. Every registered user to our platform was impacted and could neither access the dashboard nor use our CLI tool.
Our integrations such as our Terraform Provider and Kubernetes Controller were also affected and could not get the desired credentials.
Before we dive in any deeper, I’d like to set the scene. In January we migrated to Kubernetes. To configure the cluster on AWS, we used kops. This set up Autoscaling Groups for us, which helps us when there is a bad node. We also deploy our applications across multiple nodes and zones with Anti Affinity rules.
On Sunday April 15th, at around 1am UTC, we suffered a node outage. This in itself wasn’t an issue as our nginx retried other upstream servers and we had multiple replicas of our applications running. Our scheduler rescheduled everything on other nodes and when the new node became available, it got scheduled on this node as well. Everything seemed fine.
Monday April 16th
Our first outage was caused by a memory leak. Or, at least we thought so. This is indeed partially true, due to a memory leak docker crashed on one of our nodes which caused everything to be scheduled on another node. Seeing that we deploy our applications multiple times, this memory leak was also affecting other nodes. This caused our Identity service — which has high internal usage — to be scheduled on a single node which had enough memory. We do have rules in place to not do this, but they’re lower priority than resource constraints which were currently being hit.
The quick fix to this was restart the identity service. Because we now had 3 nodes to choose from instead of 2 (due to affinity rules), the service got scheduled on different nodes and was reachable again. This allowed our nginx to talk to the identity service again and serve traffic.
After this, we started digging and came up with some temporary solution to make sure the memory leak wouldn’t cause a crash anymore (whilst trying to actually solve the memory leak).
Wednesday April 18th
After testing a temporary solution for the memory fix — ensuring heavy offenders weren’t running on the same node — on staging, at 11am UTC, one of our engineers deployed this temporary solution to production.
This in it’s turn started a whole chain of events. The first thing we noticed was that the service this rule was being added to didn’t deploy. It got stuck in a CrashLoopBackOff error. We looked into why this was the case — by using kubectl describe on that pod — and found out that all our credentials were wiped.
This raised a red flag for us as we relied on Manifold to be available to populate our credentials. Yes, we dogfood! we’re dogfooding this. We’re using our Kubernetes Credentials Controller to pull in the secrets and store them accordingly. This meant that our services wouldn’t be able to boot anymore because we didn’t have the correct credentials in place.
We immediately jumped ship and started pulling in our secrets from Terraform backups. This took us longer than we wanted to and by 12pm UTC we managed to get everything back in place.
With everything back to normal and a whole set of new symptoms, it was time to dig deeper. What stood out was that we were still seeing a bunch of “Connection Refused” errors, but because we were running things on multiple nodes, the nginx retries managed to serve most traffic regardless. Below we’ll dig deeper into this issue.
Why did this happen
We came to the conclusion we suffered 2 main issues which together lead to this outage. These 2 issues are listed below together with some additional information and how we figured out what was going on and how we fixed these issues.
Detecting a memory leak is pretty straight forward if you have the right tools in place. We use DataDog to track everything and use the stats package to measure our application stats. A few weeks ago, we noticed a memory leak and put some time in to setting up more instrumentation (enabling pprof to allow us to get memory profiles out of our application). To our regret, we hadn’t really put more effort into fixing this but we had a rough idea of what this could be.
By looking at the heap profile, we noticed a lot of memory stayed around in our context. We’re using go-swagger (v0.11.0) which in it’s turn uses gorilla/context. They’re aware that there is a potential danger for memory leaks and have provided a solution for this, which we had implemented. To our surprise, this actually didn’t work in our case and we were still seeing a lot of memory being held by our context.
We’ve since found another solution, also provided by the context package, and this is doing the job for us. To our regret, we didn’t implement this soon enough.
As mentioned before, we noticed a lot of “Connection Refused” errors. Sporadic network failures are expected, but the amount we were seeing was intolerable. This lead us down the path of digging into this deeper. The first thing we checked was to see if our pods were actually healthy, and, according to our health checks, they were. This was confirmed by tailing the logs (kubectl logs -f deploy/identity) and seeing our health check being reported.
We then listed all available pods with their associated internal IP and which node they were hosted on (kubectl get pods -o wide) at that time. We then cross referenced all the connection failures in our logs and found out that most of these failures occurred when trying to reach applications on a specific node.
We also noticed that these applications were actually still serving some traffic, albeit very small amounts. First thing did was deploying a Ubuntu pod into our cluster (kubectl run …) and test out if we could reach applications on other nodes by using their IPs directly instead of going through nginx or through a Service.
To our surprise, we could. None of the pods gave us back an error (some did occasionally, but network errors happen and the amount was tolerable).
At this point, it’s worth noting that we run our marketplace in a different namespace than our nginx. The first test we ran was in the marketplace namespace. Seeing that our internal was fine, but the traffic from nginx to our applications gave us a lot of errors, we re-did the test in the nginx namespace. Lo and behold, the exact same behaviour!
This immediately lead us to the conclusion that something was going on with the networking between namespaces. As mentioned in our migration blog post, we set up some rules specifically for this with Network Policies. Because of this, we immediately jumped to the conclusion that something was wrong in this area. We turned our Network Policies off and all errors disappeared.
This isn’t a solid solution, so we had to dig deeper. What we found out is that with kops 1.8, our CNI, for us kube-router, was tagged with “latest”. Because of the node restart it pulled down a new version for this specific node and thus had a different rule set compared to the other nodes. This is why speaking to applications on this node did not work.
Each of the issues above on it’s own wouldn’t have caused an outage, but combined they put our cluster in a bad state.
By having Anti Affinity rules state that we prefer not to deploy an application on a host or in a zone where this application is already deployed, it tries to deploy it on 2 other nodes. In our case, this puts us down to two available nodes. So when we deploy, applications will always swap hosts initially. They might get rescheduled at a later point.
However, Kubernetes will also look at resource consumption. Because of the memory leak, one of our nodes was at maximum capacity and wouldn’t be able to take on more. On the other hand, one node was not doing anything — all traffic was rejected due to the Network Policy. So both applications got scheduled on the same node.
Seeing that the application is now unavailable, we get alerted. When redeploying this application, we now have 3 nodes available instead of 2. At least one of these did not have memory issues and thus, at least one replica got scheduled there. This is why redeploying worked in our case, even though we did not know it at the time.
As mentioned in the second outage, at one point we lost all our credentials in the cluster. This happened due to Manifold being unavailable.
Manifold being unavailable should however not cause secrets to be deleted, and we put a failsafe in place for this. This however did not work in this case.
We’ve pinpointed the issue where we didn’t check for the correct status codes and are currently in the progress of patching this.
Steps we’ve taken and are taking
After all of this, we learned a lot. We’ve come up with a list of steps we’re going to take (or have taken) to prevent this from happening in the future.
First and foremost, we fixed the memory leak. Once we deployed this change, we noticed a steady memory consumption instead of a rising one.
Linked with this, we’ve also started implementing Resource Limits. This will ensure that 1) pods get restarted when they consume too much memory; 2) they also don’t get scheduled on nodes where there isn’t enough capacity for this limit. This means that we wouldn’t run into the issue of crashing Docker again by running out of memory.
We’re going over all the manifests that are generated by other tools to ensure none of them run an image without a specific tag. This will ensure us that we’re not running into surprises anymore when a node failure happens.
We currently use the “PreferredDuringExecution” AntiAffinity setup. This is great, but it’s also what caused our applications to eventually run on the same nodes without us knowing.
To mitigate this in the future, we’re looking at moving this over to the “RequiredDuringExecution” strategy. This in it’s turn will make sure a pod isn’t scheduled twice on the same node. By setting up correct alerts in DataDog, we’ll also know when a pod isn’t available when it should be.
Although we haven’t figured out the root cause of what made our Controller delete all values in our secrets, we do believe we need to create a better backup strategy. It took us about 1 hour to recover from this, which is intolerable. We also want to ensure our customers that when they use this controller, even when the Manifold platform is experiencing issues, they won’t be affected.
Because of this, we’re looking into several possibilities for backup strategies which work well with our controller and which we can implement as a failover.