It’s common to treat DevOps as a job title that means an infrastructure operator who writes code to automate things. This is part of the story; real DevOps is a cultural potpourri of development and operations tasks merged together for everyone involved. This means that in addition to operators taking on development practices, developers must also take on, or at least be considerate of, operations practices.
At Manifold, and in the development community at large, many of us haven’t had to deal with operations tasks before. We haven’t worried about where our code is deployed, or how much ram it needs. We expect someone with a different job title to care for our code once it’s shipped. When something explodes at 2 am, we read about it over breakfast. As the application authors, we’ve got so much context that we can provide for all of this. If deploys and smoke tests are automated, we can deploy ourselves, saving time and reducing the number of people involved. Lastly, if we’re the ones woken up at 2 am, we can identify and fix problems faster, and put measures in place to prevent these 2 am wake-ups in the first place.
This is where GameDays come in. GameDays are a controlled (and hopefully fun!) way to introduce a problem into your system, see what happens, roll it back, and learn from the experience. If you’ve heard of Netflix’s Chaos Monkey, you can think of GameDays as something in the same family, but easier to sell to your managers. GameDays were first introduced at Amazon by Jesse Robbins.
We’ve identified a few objectives for our use of GameDays:
- Get used to incident response and triage
- Practice post incident reviews
- Find blind spots in our monitoring / alerting
- Increase our application’s tolerance to partial failure
- Get a soft introduction to chaos engineering
How to play
We’ve found it helpful to have one person be the GameMaster. Their role for the first few GameDays is purely administrative. The GameMaster ensures that GameDays run smoothly, on time, and without surprise disruptions to the rest of the team. They ensure the GameDay participants come away with action items for improving monitoring and the system as a whole. The GameMaster participates in all GameDays, to help establish the flow. After a handful of GameDays, other people may take on the role.
The Monday before a GameDay, the GameMaster solicits volunteers (around 5 people total) to participate in the GameDay. Everyone can then brainstorm on and jointly select one target for the GameDay. This doesn’t have to be coarse grained. Examples include:
- The database
- The Kubernetes control plane
- The login service
- The cloud provider’s object store (ie S3)
With a target in mind, participants should spend a bit of time refreshing their memory on the target, and coming up with some ideas for testing it. Don’t worry too much about this though; all players will do it together on the GameDay!
We try and keep most GameDays to 2 hours. This gives us enough time to test a single aspect of our system, and still allows for regular tasks to get done during the day.
Our GameDays are run against our staging environment to minimize possible impacts on users.
GameDays are a bit like a long meeting, so it helps to have some structure. We follow this agenda (described in detail below):
- 15 mins 🕒 Architectural overview of target
- 10 mins 🕑 Pick an aspect to test
- 10 mins 🕑 Document assumptions
- 10 mins 🕑 Communicate with the rest of the team. Grab a drink
- 15 mins 🕒 Execution and rollback plan
- 40 mins 🕗 Execute and observe
- 20 mins 🕓 Cleanup
Manifold is a distributed team, so we take extra care to record any whiteboard drawings or notes, and stay on a video call for the duration of the GameDay. Writing down as much as possible during the GameDay is a good idea, as recorded documents will be useful for subsequent steps, like post incident reviews, issue creation, or general technical documentation!
You might want to elect a note keeper for each part of the GameDay.
Architectural overview of target
At the start of the GameDay, spend some time describing the target. What does it do? What does it depend on? What depends on it? This will help level up everyone’s knowledge on the target, and help with coming up with an aspect to test.
For example, if your target is S3, you might say:
S3 is Amazon’s Simple Storage Service. We use it in our product to host user avatars and screenshot assets. It is used as a backup location for various logs.
With this in mind, you might assume that problems with S3 might affect your user interface.
Pick an aspect to test
Now that everyone has a fresh idea of the target in mind, you can work together and pick a single aspect of it to test. Some GameDays will try and fit in multiple aspects; we keep it quick and simple and just do one. For this, you want to select an aspect of the system that might break. It could be as large as “The CDN is down and timing out on connection requests” or as small as “The product service is returning a 500 for all requests for the list of products”. The important thing is to keep it simple, so it is easy to trigger, and the cause and effect are clear.
When testing, what do you think will happen? What should happen?
- What other parts of the system will break? In what ways?
- Will anything degrade gracefully? Can it?
- Is monitoring capturing the relevant metrics?
- Will an alert trigger? Should one?
- Are we capturing enough data to debug the issue and determine affected customers? This may include some or all of:
- Exception stack traces
- What will the impact on the customer be?
Communicate with the rest of the team
Now that you know what you’re testing, and have an idea of how things may break, let the rest of the team know. Post a brief description in #development. Grab yourself a drink; give everyone time to read the message.
Double-check that the pager is assigned to someone in the GameDay. Sending pages to someone else wouldn’t be nice :)
Execution and rollback plan
Come up with a way to create the behaviour you wish to observe.
Prefer real over simulated behavior. For example, if testing what happens when your user service returns 500 errors for requests for user profiles, actually return 500 errors from the user service (or a layer immediately in front of it), rather than coding a 500 response into a client library. Using real behaviour will help you find dependencies you weren’t aware of.
Prefer configuration over code. It’s better to find or create runtime controls to inject bad behaviour than to modify and deploy a bad version of a service. These controls will let you easily retest or automate in the future, and reduce the chances of accidentally deploying bad code to production. For example, if testing what happens when response latency from one of your services is high, prefer writing code that will allow you to adjust latency in the service via an environment variable (or use a service mesh that can inject this fault) over temporarily hard coding the slow responses.
Determine how you’ll end the test. This is probably simple; if you modified an environment variable, change it back.
Determine how to abort your test if things go wrong. Decide what your comfort level is, and come up with a plan to abort if the test goes past that. For example, imagine that you are testing what happens when the database runs out of connections. You’ve written a program that will consume all free connections to do this. During your test, all of staging stops working, and you don’t like this! To abort the test, your plan may be to shutdown the program you’ve written, connect to postgresql and forcibly terminate its connections (as you’ve kept a session open from before the test began; clever you!), and finally, to restart all services that connect to the database.
Execute and observe
Run the test you’ve planned. See what happens. Compare this to your assumptions. Document everything.
If you get uncomfortable, abort the test.
Pretend it is real. This is your chance to practice incident response in a controlled manner, in the company of friends. If an alarm goes off, see if you can trace it back to the source. Likewise if a teammate reports a problem. If you didn’t cause the problem, how would you fix it?
After running the test for 30–40 minutes, perform your rollback. Stick around for a bit and make sure you’ve documented everything. Do some exploration to make sure there are no unintended side effects to your test.
Stay vigilant for the rest of the day, just in case some side effects are still lurking.
The post incident review
On the Friday following each GameDay, we hold a post incident review with all participants of the GameDay.
This meeting should be less than an hour, and should answer the following:
- What metrics are missing to capture the bad behaviour?
- What logs / error reporting are missing to help debug or find affected customers?
- What code level changes can we make to minimize the effects? For example, should we add retries to requests?
- What code level changes can we make to degrade gracefully? For example, if your frontend can’t retrieve a user’s twitter followers, does the rest of the page still display?
- Were any failures critical? If so, were alerts triggered? Did these alerts contain enough information?
- Were any alerts triggered for non-critical failures? If so, can they be disabled?
Create issues for everything that needs to be done. Ideally during the meeting, so they’re not lost! Hand them off to the appropriate squad for prioritization.
If you already hold post incident reviews / post-mortems, follow the normal process. Holding one for a GameDay lets you get practice!
Beyond the first GameDay
Our plan at Manifold is to run GameDays every 2 weeks against staging, for the rest of the year. This should give every developer a chance to participate in a GameDay, and plenty of practice performing incident management. In the new year, we’ll begin running GameDays against our production environment, then switch over to chaos days, having an automated system cause breakages for us during scheduled times.