SLA, SLI, SLO

One again cool podcast on SED –ย Scalable Architecture with Lee Atchison. Lee is the author ofย Architecting for Scale: High Availability for Your Growing Applications. The topic seems to be very new. The book have only 4 reviews (and one of them is Lee himself ๐Ÿ™‚ )

These topics seems especially interesting:

  1. Different types of “Service Level ” :
  • Agreement
  • Indicators
  • Objectives

Usually one thinks about SLA, but other two are worth using, after wiki (https://en.wikipedia.org/wiki/Service_level_objective):

“A service level objective (SLO) is a key element of a service level agreement (SLA) between a service provider and a customer. SLOs are agreed as a means of measuring the performance of the Service Provider and are outlined as a way of avoiding disputes between the two parties based on misunderstanding.

There is often confusion in the use of SLA and SLO. The SLA is the entire agreement that specifies what service is to be provided, how it is supported, times, locations, costs, performance, and responsibilities of the parties involved. SLOs are specific measurable characteristics of the SLA such as availability, throughput, frequency, response time, or quality.

The SLO may be composed of one or more quality-of-service measurements (service level indicators, SLIs) that are combined to produce the SLO achievement value. As an example, an availability SLO may depend on multiple components, each of which may have a QOS availability measurement. The combination of Quality of Service (QOS) measures into an SLO achievement value will depend on the nature and architecture of the service.”

2. One size doesn’t fit all. Amazon is the example:

  • Amazon Retail is focused on customer experience
  • Amazon Web Services main worry is scalability

Probably Amazon would have failed if it tried to be best at the same time at experience and scalability at AWS or Retail.

3. Availability & Reliability

Availability is pretty simple – service is or is not available.

Reliability is a bit trickier though. Let’s say you have “adding” service. Request is 2 + 3. Your service returns 4.9999

Is this reliable answer? ๐Ÿ™‚

Short answer … It depends.

Factors to consider:

  • needs of the requester. Maybe she/he needs 0.5 accuracy? So 4.5 … 5.5 are “correct” answers
  • promises to the user. Maybe there is some error bar for the response. if the difference between returned and correct value is lower than maximum error, then we still have reliable answer.

4. Health checks

It’s good idea to constantly monitor your application. You can “inject” some signal as a input and constantly monitor output for the correct response. Deviation from norm can be detected so you practically in near real time know – “something is going wrong, I should check it and, if necessary, let my clients know “. Without this health check you’ll probably end up with your client calling you and complaining that your app is broken and you should do something to fix it!”. This is not a great publicity for your business…

5. Chaos Monkey

Test your framework under “unpredictable” circumstances. So monkey comes and kills at random parts of your infrastructure. This way you’ll be prepared when production crashes.

Leave a Reply

Your email address will not be published.