An init post for SRE

This year, I have a goal. I want to understand different aspects of building, maintaining and running large scale systems. There are bunch of things, that I should be able to argue about

Trade-offs of system design choices that were made, and way of evaluating them periodically with the help of tools and existing practices like benchmarking and load testing.
Defining the process of how updates/new features in product make it to the end-users or customers.
Understanding the ways of handling arbitrary workload requests and errors like excess traffic, downtime etc.

Fortunately, there is something called Site Reliability Engineering, which deals with the things that I mentioned above. I’ve decided to pursue my career in this direction. An SRE can typically strive to fulfill minimal needs of systems first, and gradually grow up one ladder in Maslow’s hierarchy of systems.

Dickerson's Hierarchy of reliability

So, What is SRE ?

SRE is famously an engineering position defined in Google, and a more concrete definition of SRE can be deduced from famous Google’s SRE book. But basically an SRE is person who can argue on most of these arguments.

Can you make your systems fault tolerant.
Can you recover or successfully diagnose the cascading failures .
Have you thought about the capacity planning of your system components.
What are strategies and workflows you would employ to manage or recover from disasters.
Every design choice has its trade offs. A proper understanding of these trade offs will help you reducing the uncertain aspects of your system.
Can you write a maintainable, production ready and flexible code for your systems.

So what are good practices for an SRE engineer ?

Well, there are different approaches to become good engineer, and you should follow the path which fits your style well. These are some of things, that I’ve been trying to follow.

Brevity of thought and clarity of systems can help you argue about different trade offs of your choices without getting frustrated.
Learn to ask right questions. Questions will help in parsing the clutter of a problem, and may help you in getting the root cause.
Preparing a checklist, or thoroughly organizing your thoughts while trouble-shooting. Keep noting things in a diary.
Multiple problems exist in system at any point of time. But important parameter is prioritizing one thing over the other.
Read system blogs and write some, if you can.
Conferences like SREcon or LISA are a good source of learning different aspects of operations and SRE. May be, strive yourself to share that kind of knowledge to the world.

Log: What Every software engineer should know about real time data
How linux networking stack receives data
How linux networking stack sends data
The secret to C10M problem
10 Things I learned making the fastest web site
Julia Evans writes great short blogs on her discovery of system things. They are fun and insightful to read.
Brendan Gregg is one person you should follow, if you like to know about performance of your system components.
High Scalability is a great blog on systems. Make it must in your reading list.

A λ's Journey To π

A λ's Journey To π Exploring finite axioms for the infinite world.

So, What is SRE ?

So what are good practices for an SRE engineer ?

A λ's Journey To π

A λ's Journey To π Exploring finite axioms for the infinite world.

An init post for SRE

So, What is SRE ?

So what are good practices for an SRE engineer ?

Some excellent articles, I would recommend that every wannabe SRE should read:

Related Posts

And more, much more than this, I did it Github way 15 Mar 2017

Zombie or a Human 08 Mar 2017

Anatomy of a program 01 Mar 2017