A while ago, Kevin Matheny wrote a blog post about how to increase productive time with the Sheriff pattern. I tweeted a reply quickly describing how we handle interrupts at glomex’ SRE squad.
— JohannesBrandstetter (@jobrandstetter) April 11, 2017
I don’t want to dive too deep into the theoretical background on why having a designated person to handle interrupts for a team is a very good thing to do. Kevin himself as well as Dave O’Connor in the Google SRE book have done a very good job at doing this.
To give a little background I joined the team which would eventually become glomex in the beginning of last year and was quite surprised that their Operations team members suffered from permanent interruptions by all means possible. People were coming into the team room, phones were ringing, emails were sent and tickets wildly assigned to random people. When I became manager of that team I did an initial retrospective to collect everyone’s thoughts. And as it turns out one of the main concerns of the team was that they we’re dealing with too many distractions and found it hard to focus on one task at a time.
To remedy that I made use of a technique that I learned when I was working at 1und1. Back in the day we used to have a “Sys-Admin of the Day (SoD)” that took care of all incoming tickets and deployments for software development teams. They actually had lots of techniques and methods in place that other companies are only now slowly embracing with what’s called DevOps.
So the first thing we did was to find a new name for the role. The first ideas were still a bit formal like “Operations Engineer on Duty”, “System Engineer of The Day” but we finally introduced a few funnier ones and in the end it became Captain Crunch after John Draper.
After that the next step was to make people aware that there is now always exactly one designated person they can talk to and that they should avoid interrupting anyone else with standard requests or questions. We of course sent an initial email explaining the concept but we still had coming people into the room not knowing who to talk to. So we bought a pirate hat and put pictures of our team wearing the hat next to our door:
That worked quite well but still was a little bit too analog. So we built a little display that shows the name of the current captain:
It’s based on a Wemos ESP-8266 which pulls a JSON file from S3 and displays the content. That file actually gets written by a n AWS Lambda function that polls our OpsGenie schedule to get the current candidates name.
So all of our on-call schedule handling is done via OpsGenie. After initially having two four hour shifts each day we decided it’s better to switch to one 8h shift as this makes sure that the current Captain can rely focus on any incoming tasks and fix them for good. The time spent for CaptainCrunch duties is actually written off from project work so we use that time to improve processes or documentation or general flaws in our system.
At glomex we use Slack as our main means of communication so we for one have a designated channel where people can join and ask questions or for general help. As it was unclear to them who would actually answer we created a Slack group “@captaincrunch” which acts as a proxy to the actual person on Captain Crunch duty. An AWS lambda function sets the member of the group according to the OpsGenie schedule and announce the change every morning so that everyone is aware who is today’s Captain Crunch. We also recently introduced a first version of a Slackbot. Currently it only announces out-of-office messages and replies to direct messages and mentions of the @captaincrunch handle after 5pm until 9am as well as on weekends. This way we make sure that people know when they can expect an answer.
Out of office reply:
In the future the idea is to have a “real” bot that points them at documentation or provides other help for standard tasks.
For the people in the Munich office we also have an Echo Dot in our room that people can ask who the current captain is.
After one Captain’s shift has ended we also send a report to the other team members with notes of everything that has been done. That includes tickets, questions in Slack and other noteworthy events. Currently this is just a manual email template but we are working on a simple webapp that pulls in information automatically from Jira and Slack because we hate repetitive work 😉
The adoption of Captain Crunch at glomex has been great. People like the informal approach and the promise that they will always find someone that listens to them. Also the team likes it because they also know that on non-duty days they can focus on their project work and have more of a feeling of “getting things done”.
I’m curious to hear how other teams are doing this. Please tweet me at @jobrandstetter.
And if you want to be a pirate, too – we’re hiring