Monitoring Microservices: Part 4

Quick Recap: In past posts, we have talked about the basics of monitoring and Golden Signals. We also talked about tools like logs and Metrics which help us create Dashboards and Alerts for monitoring our applications.

In this final post for Monitoring services, I will cover basic areas, that one should consider monitoring when working with a Microservices-based application. One needs to create Dashboards and Alerts around these areas based on application requirements and SLAs.

Service Health: Overall health of the service. Overall health will be a combination of microservices. Ideally, you will put together a single dashboard that provides health details of all critical areas/microservices, so that you can confirm if end-to-end functionality is working fine. For example, the Health of service can be a combination of aspects like CPU load for critical microservices, the number of requests failing (compared to success rate), the response time (based on SLA), etc.

API Health/ Features Health: API Level health, how various APIs are performing. You can create separate dashboards for monitoring APIs/ Microservices.

Infrastructure Health: Another important aspect for monitoring is infrastructure, i.e. CPU, Memory, Network, IO, etc.

Cost: Cloud computing gives us many advantages, but at the same time also brings in additional complexities, managing cost is one of them. Fortunately, most cloud providers give easy ways to manage and monitor costs.

Error and Exceptions: The number of errors and exceptions thrown by your application and code is an important parameter to manage overall service health. For example, after a fresh deployment, if you see an increase in errors or exceptions, you know there is some issue.

Performance: Another important aspect is the performance of critical features. You would like to monitor how your services and APIs are performing and what response time is provided to clients.

Traffic: An increase or decrease in traffic impacts various aspects of the application, including scalability, infrastructure, performance, error rate, security (unexpected traffic might be due to a bot), etc. So it is important to track traffic details.

Success Rate: Success or Error rate helps understand the overall health of the system. A simple ratio would be 2XX to Errored (4XX+5XX) request.

Dependencies – Upstream/ Downstream: In a microservice-based architecture, the service cannot live in isolation. So it is important to track the health and performance of dependencies.

Request Tracing: In microservices-based architecture, where a service calls another service which in turn calls another, and so on, it is difficult to trace error and performance issues. So proper request traceability helps us monitor and debug issues.