A λ's Journey To π Exploring finite axioms for the infinite world.

An init post for SRE

|

This year, I have a goal. I want to understand different aspects of building, maintaining and running large scale systems. There are bunch of things, that I should be able to argue about

  • Trade-offs of system design choices that were made, and way of evaluating them periodically with the help of tools and existing practices like benchmarking and load testing.
  • Defining the process of how updates/new features in product make it to the end-users or customers.
  • Understanding the ways of handling arbitrary workload requests and errors like excess traffic, downtime etc.

Fortunately, there is something called Site Reliability Engineering, which deals with the things that I mentioned above. I’ve decided to pursue my career in this direction. An SRE can typically strive to fulfill minimal needs of systems first, and gradually grow up one ladder in Maslow’s hierarchy of systems.

Dickerson's Hierarchy of reliability

So, What is SRE ?

SRE is famously an engineering position defined in Google, and a more concrete definition of SRE can be deduced from famous Google’s SRE book. But basically an SRE is person who can argue on most of these arguments.

  • Can you make your systems fault tolerant.
  • Can you recover or successfully diagnose the cascading failures .
  • Have you thought about the capacity planning of your system components.
  • What are strategies and workflows you would employ to manage or recover from disasters.
  • Every design choice has its trade offs. A proper understanding of these trade offs will help you reducing the uncertain aspects of your system.
  • Can you write a maintainable, production ready and flexible code for your systems.

So what are good practices for an SRE engineer ?

Well, there are different approaches to become good engineer, and you should follow the path which fits your style well. These are some of things, that I’ve been trying to follow.

  • Brevity of thought and clarity of systems can help you argue about different trade offs of your choices without getting frustrated.
  • Learn to ask right questions. Questions will help in parsing the clutter of a problem, and may help you in getting the root cause.
  • Preparing a checklist, or thoroughly organizing your thoughts while trouble-shooting. Keep noting things in a diary.
  • Multiple problems exist in system at any point of time. But important parameter is prioritizing one thing over the other.
  • Read system blogs and write some, if you can.
  • Conferences like SREcon or LISA are a good source of learning different aspects of operations and SRE. May be, strive yourself to share that kind of knowledge to the world.

Some excellent articles, I would recommend that every wannabe SRE should read:

  1. Log: What Every software engineer should know about real time data
  2. How linux networking stack receives data
  3. How linux networking stack sends data
  4. The secret to C10M problem
  5. 10 Things I learned making the fastest web site
  6. Julia Evans writes great short blogs on her discovery of system things. They are fun and insightful to read.
  7. Brendan Gregg is one person you should follow, if you like to know about performance of your system components.
  8. High Scalability is a great blog on systems. Make it must in your reading list.