Site Reliability Engineering Is Made of People
So far 2021 has actually been pretty slow news-wise, at least in the cloud native computing community. This has given us the chance to go back and review some of the recently posted talks from the USENIX SRECon20, a virtual conference held in December.
While at first glance, site reliability engineering (SRE) appears to be about removing all manual processes from operations to ensure the robustness of a system, in fact, we’ve learned over the past month how much it depends on good old fashioned human interactions.
As TNS London correspondent Jennifer Riggins explained in 2017, the priority of the SRE team is to make sure the systems stay strong and stable by spending at least half their time on development. They “think about the whole life cycle of software objects from their inception to their deployment to operation, refinement, and eventual, peaceful decommissioning,” Google researcher Chris Jones told her at the time.
The SRECon talks we heard anyway stressed looking beyond the numbers to the users themselves, in order to get a true understanding of how a system could serve users. In one presentation, AppDynamics Technology Evangelist Marco Coulter noted that “whenever a measure becomes a target, it ceases to be a good measure.” British economist Charles Goodhart, who, writing about managing U.K. monetary policy, explained “Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.”
The lesson Coulter shared was to look beyond the metrics. Coulter recalled working for a hospital where the development team, responding to complaints from staff, set strict Service Level Objectives (SLOs) for a message queuing system to respond to queries within 10 seconds. The bottleneck, however, was not with the queuing system, so setting this SLO was counter-intuitive. It was only when the development team met with staff, and understood their “instinctive expectation” of when they wanted this data available, were they able to craft a Service Level Agreement (SLA) that was satisfactory to both parties.
Other SRE talks had similar themes of keeping humans in the loop. Moshe Zadka, senior site reliability engineer at Twisted Matrix Laboratories, talked about how SREs could use the Jupyter Notebooks, a tool created for the scientific community, to document their findings around incidents, and share them with others. And Stanford University researcher Deepti Raghavan shared information about POSH, a “data-aware” Linux shell she is helping build that could help SREs more easily process data from the command line, without all the effort that otherwise would be needed to write to an API to access data resources. Again the theme for this talk was making SRE work easier for people, by listening and responding to their needs.