Monitoring is a critical part of SRE practices to manage and ensure systems and services reliability. 4 Golden Signals is key metrics that used to monitor the health of your service and underlying systems effectively.
Monitoring
Monitoring is the process of collecting, processing, aggregating, and displaying real-time quantitative data about a system.
This allows engineers to understand system behavior, detect anomalies, and make informed decisions based on metrics.
Here are few benefits of doing monitoring:
1. Analyzing Long-Term Trends 📈📉
This historical data supports better technical and business decision-making.
Monitoring helps track the growth and usage patterns of applications over time. You can observe metrics like database size or daily active user count.
2. Alerting 🚨‼️
Monitoring enables the system to notify you when something is broken or about to break.
When the system can't self-heal, alerts help engineers investigate the issue, determine the root cause, and take corrective action immediately.
3. Conducting Ad Hoc Retrospective Analysis ⏰📊
Monitoring provides a trail of metrics that can be analyzed after an incident.
For example, if your system experiences a spike in latency, you can correlate this with other metrics collected at that time to debug and find the root cause.
The Four Golden Signals - ⭐️⭐️⭐️⭐️
Foundational building blocks of an effective monitoring strategy. They cover the most essential aspects of system health and performance. Focusing on these core signals helps minimize noise and reduce maintenance overhead.
1. Latency — Request Service Time ⏱️⏳
The time it takes for a system to respond to a request. High latency can directly affect user experience and cause business downfall. It’s critical to monitor both successful and failed request latencies.
By applying monitoring on these metrics, high latency can be alerted and take actions before user's complains
2. Traffic — User Demand 📈📉
Represents the volume of demand on your system, typically measured in requests per second (RPS), transactions per second (TPS), or similar.
Monitoring traffic can helps anticipate scaling needs and detect abnormal usage patterns
3. Errors — Rate of Failed Requests ⚠️‼️
Measures the number of requests that fail either explicitly (e.g., HTTP 500 errors) or implicitly (e.g., timeouts or incorrect responses).
Monitor high error rate can helps indicating abnormalities and wrong in the systems
4. Saturation — System Capacity 💽💾
Indicates how "full" your system is. This can refer to CPU, memory, I/O usage, or any resource that might become a bottleneck.
Saturation metrics help predict and prevent outages caused by overutilization.
By focusing on these golden signals - you gain critical visibility for your system's health and performance.