What is Reliability Engineering?

June 21, 2019

What is Reliability Engineering?

Site Reliability engineers engineer site reliability reliability – obviously. In order to understand what it means to engineer reliability we should first understand what it means to be reliable. Google says that reliability is “the quality of being trustworthy or of performing consistently well.” and also “the degree to which the result of a measurement, calculation, or specification can be depended on to be accurate.”

To put another way, something that is reliable performs in a way that is unsurprising and meets expectations. With regards to a particular computer service, when we think of reliability we think that it has these characteristics:

Does the same thing each time (consistent behavior)
Takes the same amount of time (low response variance, consistent latency)
Does it autonomously (does not require intervention, trustworthy to complete)

However, since site reliability engineers are not typically responsible for writing, or maintaining services, how do they engineer reliability into them? Conway’s observation tells us that:

the software interface structure of a system will reflect the social boundaries of the organization(s) that produced it, across which communication is more difficult

Now, the sources of unreliability typically manifest in the software interface structures. As such, if we want to engineer reliability, we should actually start with teams of people, not software. In order to ensure we’re producing reliable software we want to ensure each team is:

Able to respond consistently to outside demands
Have low variance in their response times.
Are autonomous and can stay focused on the mission.

Teams operate most efficiently and reliably when they are autonomous within their area of competency. Therefore, it is the job of SREs to ensure that teams are able to operate autonomously. This means an SRE ensures the team is:

Ensures their team knows what they are expected to be responsible for, and that other teams know what they are responsible for as well.
Ensure they are autonomous
Ensure their team has the ability to make decisions within the scope of their duties
That they are held accountable for the decisions they make which affect their services.
Create cross-cutting standards and tools for repeat problems so teams can keep a mission-focused mindset, rather than worrying about reinventing the wheel. (PFE vs NIH)
Holding your team accountable for having proper monitoring, alerting, and incident response procedures
Ensure that individuals who are on call understand the need for:
1. Integrity
2. Formality
3. Procedural Compliance
4. Level of Knowledge
5. Questioning Attitude
6. Forceful Backup

By ensuring the above things for the team which they are embedded on, the team will produce more reliable services. As such, you will find SREs doing a variety of tasks for the team they are embedded on in order to meet the above goals. While the specific tasks an SRE will be doing will depend on what they see as the most urgent area, you can find them:

Demonstrating best practices for logging, monitoring, and alerting.
Teaching best practices for deploying and managing software services
Ensuring that data about services is being surfaced to the correct people. (e.g. downstream consumers)
That the whole company is able to tell who is responsible for what services.
Teaching appropriate formality when dealing with incident response.
Be a resource for advice on best practices, architecture decisions, and code reviews
Holding your team accountable for proper monitoring and alerting.
Create cross-cutting standards and tools for repeat problems to enable individual teams can keep a mission-focused mindset, rather than worrying about reinventing the wheel. (PFE vs NIH)

Things SRE is NOT responsible for:

Instrumenting services with logging, monitoring, and alerting – although they can help where necessary.
Being on call for services they did not write – unless they are helping train with the process.
Telling people what to do – software engineers own their code, and the responsibility for their bugs.

Read and subscribe on Substack