Evan Harmon - Memex

Site Reliability Engineering

Site Reliability Engineering (SRE) is a set of principles and practices that applies aspects of software engineering to IT infrastructure and operations. SRE claims to create highly reliable and scalable software systems. Although they are closely related, SRE is slightly different from DevOps.
wikipedia:: Site reliability engineering
Reliability engineering is a sub-discipline of systems engineering that emphasizes the ability of equipment to function without failure. Reliability describes the ability of a system or component to function under stated conditions for a specified period of time. Reliability is closely related to availability, which is typically described as the ability of a component or system to function at a specified moment or interval of time.
wikipedia:: Reliability engineering

Also, reliability engineering, in general

Largely focuses on metrics, logging, monitoring, alerting, and recovery, and engineering for resilience like the chaos monkey, etc.

Site Reliability Engineering
Interactive graph
On this page
Site Reliability Engineering
Load balancing, high availability, fault tolerance, & redundancy
Software Measurement
Disaster Recovery