Ever get a page from your system alerting tools on the ferry home from work? Wouldn’t it be nice to simply text back “restart the pod” and keep on boating home? In this post, we’ll explain why you may want to plug in something like Till in order to go one step further and let your ops team take action from a simple text message.
Why ChatOps at all?
With the ease of adding integrations to our chat platforms, the opportunity to interact and automate ops activities has been seized by the DevOps community and practitioners. By placing the actions, the conversation and the people that need to respond to incidents all in one room, or channel, the feedback loop is tight, succinct as well as transparent.
By posting your alerts into chat rooms, you now have a single pane of focus for your team to manage. The actions themselves, however, may be manual or out of band (say, in another tool), making them less transparent at a minimum, mysterious and undocumented at worst. To solve this, ChatOps practice introduces the idea of adding corrective actions (usually via bots) into chat. Instead of popping out to go do a thing, you just do it via the bot, inside chat. You now have closed the loop; alarm, discussion and action, all in one place. Not only has your team been notified in the place they already discuss operations, but they can also perform corrective measures from within the same space.
Reaching out from inside the ether
Of course, this model is not perfect, what if the alert is missed? What if we have too much noise in our channels (giphy anyone?), all of which have best practices and processes that can help solve. What if the alert happens out of hours, when no one is watching the single pane you have crafted to all its perfection? Sure, if you are paying attention, but what if you are on a boat? All alerting providers worth a grain of salt can reach out to you via SMS and phone, but triggering actions based on the alert, now this is where the fun starts!
The ability to interact with your ChatOps bots, outside of chat, is a key factor in making ChatOps successful. If you can’t always use that bot to do the corrective action, then it won’t always be used, negating the “get it all in one place” philosophy. In order to do this, you either need to log into the chat remotely (doable, but mobile isn’t always fun) or, enable some interaction via SMS or voice.
E.T Phone Home
Here at Manifold, we started thinking about automating most of our early triage actions for outages via scripts we all had access to, then to bots in Slack. While we aren’t there yet, we are already thinking beyond just chat. This is where we started thinking about responding to outages via SMS.
The genesis of the idea is this; if we can make the initial triage steps as simple as asking a bot in chat to run a thing, we could just skip the needing to log into chat and just SMS back “yup, run the triage steps”. So we started hammering out a small proof of concept, you can find a version of it on GitHub. In that repository you will find the minimal things we wanted to test, primarily how easy is it to interact with a ChatBot via SMS, and can it ask questions. Since we are Manifold, we went and got a Till Mobile API key via the Manifold CLI. Take a look here to see how to get your own.
from pytill import pytill
Then we made this small function prove to ourselves it was going to be easy to send out messages:
Similarly, asking a question, while waiting for a response:
Combining these two functions, plus adding a bit of Flask to our Python code to get a webhook available for Till to talk back to, and we have a bot that can send SMS messages out, and wait for the response. Which can lead to some fun…
If you are interested in digging deeper, please check out the readme which does a more detailed walkthrough of how we built out the bot. Needless to say, we were impressed with how easily we could quickly prototype out our sms chatbot, and with one simple manifold run -p till-demo python command, we had it all up and running.
Of course, this MVP bot does not run scripts on command, nor get auto triggered by alerts, but that work is actively happening now, and we are excited to get it all plugged in.
Liberating the operator
ChatOps, the practice of doing operational activities within chat clients, makes a lot of sense and has hence seen widespread adoption. As all of our toolsets become more and more extensible, we can combine our chat, with our alerting, with our reactions all into one place. In doing so, however, we must remember that you may not always have your chat client around, and ensuring that we explore other options for automation and simplifying actions is key.
Being able to SMS back to a bot, “Yes, please redeploy the kubernetes deployment that is giving us trouble.” while sailing a boat across the 7 seas opens up the world of ChatOps to being beyond just the chat room, while keeping the philosophy of “all in one place” alive and well.
While we strive to automate all of our reactions to alerts, we remain bullish on the need to sometimes have a human come in and say “yes, please run that script”. Doing so from a text message, well, that just seems liberating to us.