As DevOps teams face increased pressure to deliver positive customer experiences, keeping track of service levels is becoming increasingly important.
Of course, this means different things, depending on the service the product is providing. But for most site reliability engineers (SRE)s, the priorities include providing maximum uptime, and fast response times.
That said, the business and IT teams will often have competing priorities – one holding the finances to account, the other managing limited tech resources. The glue that holds them together are the service level agreements (SLAs), service level objectives (SLOs), and service level indicators (SLIs).
What are SLAs, SLOs, and SLIs?
Simply put, SLAs are what’s promised. SLOs are the specific goals. And SLIs measure success.
Service Level Agreements are where the provider/client journey begins. Drawn up by their respective legal teams, SLAs detail the level of service your customers expect when they use your service and the consequences of failing to meet them.
However, a consistent issue with SLAs is how difficult they are to measure and report on – often because they’re written up by people who lack sufficient technical insight. Increasingly, IT and DevOps are collaborating with legal and business development functions to develop realistic SLAs. While there may be friction between the departments, this change should be welcomed, as it will help vendors avoid potentially serious financial penalties and provide more realistic expectations for customers.
Service Level Objectives exist within SLAs and outline specific metrics like uptime or response time. While SLAs are the entire agreements, SLOs are most easily defined as the individual points they contain – they set customer expectations and tell the IT and DevOps teams what goals they need to be aware of.
SLOs need to be tight, ensuring that everyone understands exactly what is required, they should also be clear and concise, and kept to an absolute minimum. However, what counts as essential may vary wildly from SLA to SLA, and it is important that delays on both sides should be considered when they’re being written.
Service Level Indicators are the final level – how both parties measure whether SLOs are being achieved. To stay compliant, the SLI will need to meet, or preferably exceed, the levels set within the SLA. SLIs should also be made as simple as possible to avoid confusion on both ends. This means choosing practical metrics to track, both in terms of volume and complexity.
Altogether, this might sound a little confusing, but it’s actually not too complex. For example, SLAs often dictate the level of uptime required, for example, 99.995%. The SLO will set the goal to achieve at least 99.995%, while the SLI will measure actual uptime.
Why are SLIs and SLOs essential for observability?
Observability is all about understanding how infrastructure is operating. SLIs and SLOs are essential to observability because they provide an objective method for measuring service performance. And used over a period of time, they can also be used to identify trends in service performance – looking at response times, error rates, or throughput, for example. Understanding these helps teams to accelerate the speed at which they can identify and solve future issues.
SLIs and SLOs can also help set the actual targets for service performance. DevOps teams can collect and analyse data from the indicators and objectives to understand what is feasible going forward. This includes striking the correct balance between performance and availability – such as understanding how an increase in servers boosts availability may also have a negative impact on response times.
However, the data required to understand whether products are meeting their SLAs and SLOs has traditionally been found separately to the Integrated Development Environment (IDE). Developers would have to rely on operations teams or wait for customers to report issues before they knew anything was going wrong. But with observability shifting left, developers are now expected to take full ownership of reliability – and to do so they need frictionless access to performance data to help them write optimal code throughout the software lifecycle.
Products like New Relic’s CodeStream can deliver insights into software performance directly to the IDE. With just one click, developers can now see how their performance is stacking up against SLIs and SLOs in real-time. This is already helping developers identify issues before they hit production, accelerating engineering velocity.
With these kinds of innovations, developers can monitor, debug, and improve their applications, regardless of which core language they use to code. This always-on visibility into all metrics lets developers work towards significantly reducing mean time to detection (MTTD) and mean time to resolution (MTTR) – increasing overall uptime – while also shortening development cycles.
Poor software quality is currently costing businesses trillions of dollars a year. Streamlining the processes that help businesses understand exactly how their software is failing will help optimise performance throughout the lifecycle.
About the Author
Aiden Cuffe is Senior Product Manager at New Relic. New Relic helps engineers and developers do their best work every day — using data, not opinions — at every stage of the software lifecycle. The world’s best engineering teams rely on New Relic to visualize, analyze and troubleshoot their software. New Relic One is the most powerful cloud-based observability platform built to help companies create more perfect software. Learn why customers trust New Relic for improved uptime and performance, greater scale and efficiency, and accelerated time to market at newrelic.com.
Featured image: ©vchalup