Quick Recap: In past posts, we have talked about the basics of monitoring and tools like dashboards, alerts, etc that help us monitor the health of our applications.
Any discussion about monitoring will remain incomplete without talking about Golden signals. We understand that Monitoring an application is a complex task and most importantly different developers and architects can have a different vision when it comes to Monitoring an application. It is challenging to develop a standard set of monitoring practices as every application will have a different need. For some applications, performance would be more important and for others, it might be accuracy or optimized use of infrastructure.
Though coming up with an exact set of monitoring best practices will be difficult. Still, Golden signals are accepted as a general guideline that will help provide overall health of the system. The four golden signals that are generally accepted are
Latency: In simple words, the time it takes to service a request. This gives a measure of performance at a high-level or API level.
Traffic: Request Per second (In total and divided based on request type). An important indicator giving an idea of usage. Performance should be measured in percentile.
Errors: Requests resulting in errors i.e. 5XXs or 4XXs. A ratio of errored requests vs 200s is a good indicator of the success/ error ratio for the application. For example, a new deployment showing a sudden increase in error rate would help in timely rollback and save embarrassment.
Saturation: An indicator of “How full the service is?”. A load test can give you a threshold, and based on that threshold, we can calculate if the current infrastructure can handle X% more traffic.
https://www.ibm.com/garage/method/practices/manage/golden-signals/
https://sre.google/sre-book/monitoring-distributed-systems/#xref_monitoring_golden-signals