Monitoring Microservices: Part 4

Quick Recap: In past posts, we have talked about the basics of monitoring and Golden Signals. We also talked about tools like logs and Metrics which help us create Dashboards and Alerts for monitoring our applications.

In this final post for Monitoring services, I will cover basic areas, that one should consider monitoring when working with a Microservices-based application. One needs to create Dashboards and Alerts around these areas based on application requirements and SLAs.

Service Health: Overall health of the service. Overall health will be a combination of microservices. Ideally, you will put together a single dashboard that provides health details of all critical areas/microservices, so that you can confirm if end-to-end functionality is working fine. For example, the Health of service can be a combination of aspects like CPU load for critical microservices, the number of requests failing (compared to success rate), the response time (based on SLA), etc.

API Health/ Features Health: API Level health, how various APIs are performing. You can create separate dashboards for monitoring APIs/ Microservices.

Infrastructure Health: Another important aspect for monitoring is infrastructure, i.e. CPU, Memory, Network, IO, etc.

Cost: Cloud computing gives us many advantages, but at the same time also brings in additional complexities, managing cost is one of them. Fortunately, most cloud providers give easy ways to manage and monitor costs.

Error and Exceptions: The number of errors and exceptions thrown by your application and code is an important parameter to manage overall service health. For example, after a fresh deployment, if you see an increase in errors or exceptions, you know there is some issue.

Performance: Another important aspect is the performance of critical features. You would like to monitor how your services and APIs are performing and what response time is provided to clients.

Traffic: An increase or decrease in traffic impacts various aspects of the application, including scalability, infrastructure, performance, error rate, security (unexpected traffic might be due to a bot), etc. So it is important to track traffic details.

Success Rate: Success or Error rate helps understand the overall health of the system. A simple ratio would be 2XX to Errored (4XX+5XX) request.

Dependencies – Upstream/ Downstream: In a microservice-based architecture, the service cannot live in isolation. So it is important to track the health and performance of dependencies.

Request Tracing: In microservices-based architecture, where a service calls another service which in turn calls another, and so on, it is difficult to trace error and performance issues. So proper request traceability helps us monitor and debug issues.

Monitoring Microservices: Part 3

Quick Recap: In past posts, we have talked about the basics of monitoring and tools like dashboards, alerts, etc that help us monitor the health of our applications.

Any discussion about monitoring will remain incomplete without talking about Golden signals. We understand that Monitoring an application is a complex task and most importantly different developers and architects can have a different vision when it comes to Monitoring an application. It is challenging to develop a standard set of monitoring practices as every application will have a different need. For some applications, performance would be more important and for others, it might be accuracy or optimized use of infrastructure.

Though coming up with an exact set of monitoring best practices will be difficult. Still, Golden signals are accepted as a general guideline that will help provide overall health of the system. The four golden signals that are generally accepted are

Latency: In simple words, the time it takes to service a request. This gives a measure of performance at a high-level or API level.

Traffic: Request Per second (In total and divided based on request type). An important indicator giving an idea of usage. Performance should be measured in percentile.

Errors: Requests resulting in errors i.e. 5XXs or 4XXs. A ratio of errored requests vs 200s is a good indicator of the success/ error ratio for the application. For example, a new deployment showing a sudden increase in error rate would help in timely rollback and save embarrassment.

Saturation: An indicator of “How full the service is?”. A load test can give you a threshold, and based on that threshold, we can calculate if the current infrastructure can handle X% more traffic.

Monitoring Microservices: Part 2

Quick Recap: In the last post, I discussed the importance of Monitoring and core tools available or monitoring in form of logs, metrics, and traceability.

In this post, I would like to talk more about how we can use the tools we discussed to get more insights into our application.

Dashboards: Well, you have logs, metrics, and traceability in place, but the important next step is to organize the data in a form that makes sense. The answer is dashboards.

A good dashboard is easy to understand and provides you with the point information. You will usually have multiple dashboards, providing information at different levels of detail. For example, a high-level dashboard shows the overall health of the system in terms of CPU usage, Memory usage, Load on the system, overall latency, Error rate, etc. When you see that average latency has gone up, you should be able to drill down to the next level dashboard which can give details based on APIs or services, letting you know which service or services are bringing the global average down. Similarly, if the overall system error rate is increasing, we should be able to drill down to details and find the root cause.

An important thing to highlight here is how many dashboards we need to create. We will try to cover various dashboards in this series, but an important factor is to note that if you have too many dashboards, you will end up not monitoring all, and if you have too few, you will not get all relevant information.

Alerts: An important aspect for monitoring alerts in the event that something out of place has occurred. For example, suddenly your application is seeing too many 5XX errors or CPU usage is increased or traffic has increased. Alerts can again have severity or priority. For example, a 10% increase in error rate is a low priority but if the error rate doubles, it is a different issue. Based on priority one should set channels for alerting, for example, a low-priority alert is sent via email whereas a high-priority alert might mean an automated call to a support person.

Again an important question will be how much alerting is good. If you have too many alerts coming to your inbox, you will start ignoring those. A rule of thumb here is to ask the question- is this alert actionable? Every alert should be associated with action, say an alert says CPU percentage is increasing, add more compute power (VMs or Pods). The action should be automated as much as possible.

Analytics on Monitoring data: Another important aspect is to add analytics to the data captured. It can help one find trends and anomalies. For example, if the traffic data shows every Friday night the load increased on the website, we can plan to add additional compute capacity to handle the traffic.

KPI and SLAs: Finally you will define KPIs or Key Performance Indicators and SLAs or Service Level Agreements on top of your monitoring data. For example, one of the key indicators for the overall health of the system can be latency or response time. Your clients might need an SLA that is 95 percentile (you monitor performance data mostly in percentile) of the requests to an API that will respond in less than 100 milliseconds. Based on the nature of your application these parameters will change.

Monitoring Microservices: Part 1

Microservice-based architecture comes with many advantages over monoliths, especially in areas of scalability, enhance-ability, and maintainability of the application, as instead of a big application we are dealing will smaller pieces that are easier to manage and update.

But every good thing comes with some challenges, and in the case of microservice-based architecture, monitoring of application is one such challenge.

Earlier you were looking at one place for logs, server health, etc for any issues or status. But with such a s distributed system where we have tens or hundreds of microservices, it is difficult to monitor the status of each service individually. For example, say we have a scenario where a service is calling another service which in turn might be calling another service to fulfill a user’s request. Now if a request is failing or responding very slow, which of the service is the culprit? Which logs are to be analyzed?

To solve this issue, we have a set of practices that can help us to build a robust and effective Monitoring Strategy.

Before getting into the Strategy to monitor microservices, let’s take a look at a few core concepts that one needs to be aware of, which are Logs, Metrics, and Traceability.

Logs: Logs are the first place you will look at if you see your application is not behaving in an expected manner. Your application emits logs to publish the current state. Logs are mostly categorized into, Debug, Info, Warning, and Error.

Metrics: Metrics are Time series data published by applications to provide a quick view of an aspect that changes with time depending on external conditions like request traffic. For example, Latency Metrics can show data like if 95% of all calls respond under 300ms.

Traceability: Traceability is very important when it comes to distributed systems with multiple microservices. Say Service A calls Service B which calls C and so on. If you see requests failing or responding slowly, you need to track which services are facing issues. Traceability helps track the journey of a request and monitor it at every step.


Listening is a very important skill for any leader.

The artical talks about how the importance of this skills increases multifold when it comes to a crisis situation. The improtance of Listening is in the fact that one cannot have all the information or perspectives. So it important to have discussions with team members and get different perspective.

Th article focuses on that fact that information available at the top level might be different from ground reality. This is another important reason to “listen” to diversified groups at different levels to not make biased decisions.
“Then there’s the echo chamber. Whether we know it or not, most of us gravitate to people (and information) that confirm things we already think and believe. We’re drawn to individuals and ideas that concur with, and even end up shaping, our worldview. “

Someone living in silos is bound to be away from ground reality and might convince himself that “this is not going to happen to me”, and take incorrect decisions. It might be too late in the game when they realise their mistake and it is difficult to make ammendemnts at that time.

We tend to downplay or dismiss threats along the lines of “it’ll never happen to me, and even if it does, it won’t be that bad.” And when the chips finally do fall, we can become anchored to one particular plan or solution, even as the crisis shifts or changes direction. We may continue down one path long after it makes sense to do so, because of sunk costs: “we’ve come this far; it’s too late to change course.”

Design vs Code – The curse of Agile

Found some old notes of mine about agile, almost a decade old (note I am referring to Agile as new 🙂 ), but surprisingly the core question I was pondering upon a decade back, still holds true. The question development teams still struggle to solve is – how much time is sufficient for the design phase when one is following Agile practices.

Agile is new, bold, and sexy. Everyone wants to be Agile. Every resume I have seen of late has “Agile Development” mentioned in one form or another.

But the question is, what is Agile? Now some would say Scrum is Agile, having status meetings every day is Agile, and development in Sprints is agile. Well, if you look at the dictionary, agile is “able to move quickly and easily”. And Scrum, Sprints, TDD, etc are just some tools to get things done faster. I have shared my thoughts on agile development here.

With Agile development, I have seen too often teams trying to jump on development from day one. Design and architecture sound old-fashioned. Why spend time on something that will not show up on the screen as the end product? Why not instead spend time developing something which one can demo to clients, and get appreciated for fast work?

When you are driving, speed is attractive, but it needs better clarity, stability, balance, and most importantly synchronization if we have a team. By pushing the design to the back burner, we are saying “let’s start driving, we will figure out the route map on the way”, and before we know it, every member of the team comes up with a different route map and is sitting miles away from each other.

As would not stop people from developing unless the design is complete, but definitely some ground rules should be set beforehand.

Rate Limiting- Basics

Rate limiting is a potent tool to avoid cascading failures in a distributed system. It helps applications to avoid resource starvation and handle attacks like Denial of Service (DOS).

Before getting in rate limit implementation in a distributed system, let’s take a scneario for basic rate-limiting.

Lets say, the provider service has the capacity of processing 100 requests per second. In this case, a simple rate limit can be implemented on the server-side where it will accept 100 requests in a second and drop any additional requests. The rate-limiting can be in form of requests accepted per second, per minute, or per day. Additionally, rate limit can be applied per API bases, say “/product” service can accept 100 requests per minute, but “/user” service can accept 200 requests. Also, the rate limit can be implemented based on the client or user. For example, client_id is being sent as part of the header and we have a rate limit of 10 requests that are allowed per minute for a unique client.

The above scenario talks about a very simple client-server implementation. But in the real world, the setup will be more complex with multiple clients and scaled-up services deployed behind a load balancer, usually being accessed through a proxy or an API gateway.

Before getting into the complexities of distributed systems and implementing Rate limiting, lets take a look at some of the basic alogorthms used to implement rate limiting.

Token Bucket: To start with you can think of a bucket full of tokens. Every time a request comes, it will take a token from a bucket. New tokens get added to the bucket after the given interval. When the bucket has no more tokens, requests will be throttled. Implementation wise, it is a simple approach where a counter is set to the max limit allowed, and each incoming request checks for the counter value and decreases it by one unless the counter reaches zero (at this point requests are throttled). The counter is updated/ reset based on the rate-limiting condition.

Leaky Bucket: A modification of token bucket algorithm, where requests are processed at a fixed rate. Think of implementation as a fixed-length queue, to which requests are added, and processed at a constant rate.

In a distributed system, it makes sense to implement a rate-limiting algorithm at the proxy or gateway layer, as it helps us fail fast and avoid unnecessary traffic on the backend. Secondly implementing it in a centralized place takes away the complexity from backend services as they need not implement rate-limiting now and can focus on their core business logic.

Implementing rate-limiting in the above-mentioned distributed design has its own challenges. For example, as there are N machines with proxy deployed, how to implement 100 requests per second for a service. Do we say each machine has a 100/N quota? But load balancers do not guarantee that kind of distribution of incoming requests.

Capital Budgeting- Investment Decisions

At times corporations need to make investment decisions. These decisions are important as they help firms build up their assets and future cash flows, at the same time would need considerable investments.

An important aspect of these investments is the time value of money, i.e. cash received earlier has more value than cash received at a later time.

Stages in Capital Budgeting

  • Stage 1: Investment screening and selection
  • Stage 2: Capital budget proposal
  • Stage 3: Budgeting approval and authorization
  • Stage 4: Project tracking
  • Stage 5: Post-completion audit

Financial Appraisal tools for Capital Budgeting

Payback Method: The payback period is the number of years it takes to recover the project cost. The payback method helps understand projects’ risk and liquidity and is easy to understand. On the downside, it does not consider the time value of money (TVM) and does not consider cash flows after the payback period.

An alternate to payback method is discounted payback, where instead of exact Cash Flow (CF), a discounted CF is considered to take care of TVM.

NPV: An important tool to evaluate projects is NPV or Net Present Value. In simple terms, NPV is is the difference between the present value of cash inflows and the present value of cash outflows over a period of time. If NPV>0, the project can be considered for acceptance.


Any project with NPV>0 is profitable. The higher the NPV, the more profitable is the project. So in the case of mutually exclusive projects (the only one that can be chosen), the one with higher NPV is preferred.

IRR or Internal Rate of Return: “The internal rate of return (IRR) is a metric used in financial analysis to estimate the profitability of potential investments. IRR is a discount rate that makes the net present value (NPV) of all cash flows equal to zero in a discounted cash flow analysis.” – A Higher IRR rate makes the project more desirable.

To calculate IRR, set NPV=0 in the NPV formula mentioned above.

If IRR>WACC (Weighted Average Cost of Capital), the project is profitable.

Sunk Cost: Sunk cost is a cost that has already been incurred and as such, exists irrespective of whether the project is undertaken or not. For example the salary of the employees. This cost should not be considered as part of project cash flows.

Opportunity Cost: For example, if the company has land which is to be used to set up a factory for the current project. This cost will be added to the project.

Profitability Index: When comparing multiple projects of different sizes, directly comparing NPV might not make sense as one project might be worth 10000 and another might be 1000000. Profitability index or PI is calculated as NPV/ Initial investment and helps us calculate profit generated per dollar invested. A PI> 1 means the project is profitable.


An old video but still relevant

Key Take aways

  • Complex SQL queries- Joins : SQL
  • Transaction management / ACID Properties: SQL (commit and rollback are by default whereas you need to tackle them in NSQL)
  • Huge quantity of data/ fast scalability – NOSQL
  • Write Heavy – Logging system – NoSQL
  • Read Heavy- Queries/ indexes – SQL
  • Fixed Schema- SQL, Flexible schema-NOSQL (Alter table statements are costly and have restrictions)
  • JPA/ hibernate/ django- by default support SQL
  • Archiving and managing huge data- NoSQL