Tag Archives: Microservices

Monitoring Microservices: Part 4

Quick Recap: In past posts, we have talked about the basics of monitoring and Golden Signals. We also talked about tools like logs and Metrics which help us create Dashboards and Alerts for monitoring our applications.

In this final post for Monitoring services, I will cover basic areas, that one should consider monitoring when working with a Microservices-based application. One needs to create Dashboards and Alerts around these areas based on application requirements and SLAs.

Service Health: Overall health of the service. Overall health will be a combination of microservices. Ideally, you will put together a single dashboard that provides health details of all critical areas/microservices, so that you can confirm if end-to-end functionality is working fine. For example, the Health of service can be a combination of aspects like CPU load for critical microservices, the number of requests failing (compared to success rate), the response time (based on SLA), etc.

API Health/ Features Health: API Level health, how various APIs are performing. You can create separate dashboards for monitoring APIs/ Microservices.

Infrastructure Health: Another important aspect for monitoring is infrastructure, i.e. CPU, Memory, Network, IO, etc.

Cost: Cloud computing gives us many advantages, but at the same time also brings in additional complexities, managing cost is one of them. Fortunately, most cloud providers give easy ways to manage and monitor costs.

Error and Exceptions: The number of errors and exceptions thrown by your application and code is an important parameter to manage overall service health. For example, after a fresh deployment, if you see an increase in errors or exceptions, you know there is some issue.

Performance: Another important aspect is the performance of critical features. You would like to monitor how your services and APIs are performing and what response time is provided to clients.

Traffic: An increase or decrease in traffic impacts various aspects of the application, including scalability, infrastructure, performance, error rate, security (unexpected traffic might be due to a bot), etc. So it is important to track traffic details.

Success Rate: Success or Error rate helps understand the overall health of the system. A simple ratio would be 2XX to Errored (4XX+5XX) request.

Dependencies – Upstream/ Downstream: In a microservice-based architecture, the service cannot live in isolation. So it is important to track the health and performance of dependencies.

Request Tracing: In microservices-based architecture, where a service calls another service which in turn calls another, and so on, it is difficult to trace error and performance issues. So proper request traceability helps us monitor and debug issues.

Monitoring Microservices: Part 3

Quick Recap: In past posts, we have talked about the basics of monitoring and tools like dashboards, alerts, etc that help us monitor the health of our applications.

Any discussion about monitoring will remain incomplete without talking about Golden signals. We understand that Monitoring an application is a complex task and most importantly different developers and architects can have a different vision when it comes to Monitoring an application. It is challenging to develop a standard set of monitoring practices as every application will have a different need. For some applications, performance would be more important and for others, it might be accuracy or optimized use of infrastructure.

Though coming up with an exact set of monitoring best practices will be difficult. Still, Golden signals are accepted as a general guideline that will help provide overall health of the system. The four golden signals that are generally accepted are

Latency: In simple words, the time it takes to service a request. This gives a measure of performance at a high-level or API level.

Traffic: Request Per second (In total and divided based on request type). An important indicator giving an idea of usage. Performance should be measured in percentile.

Errors: Requests resulting in errors i.e. 5XXs or 4XXs. A ratio of errored requests vs 200s is a good indicator of the success/ error ratio for the application. For example, a new deployment showing a sudden increase in error rate would help in timely rollback and save embarrassment.

Saturation: An indicator of “How full the service is?”. A load test can give you a threshold, and based on that threshold, we can calculate if the current infrastructure can handle X% more traffic.



Monitoring Microservices: Part 2

Quick Recap: In the last post, I discussed the importance of Monitoring and core tools available or monitoring in form of logs, metrics, and traceability.

In this post, I would like to talk more about how we can use the tools we discussed to get more insights into our application.

Dashboards: Well, you have logs, metrics, and traceability in place, but the important next step is to organize the data in a form that makes sense. The answer is dashboards.


A good dashboard is easy to understand and provides you with the point information. You will usually have multiple dashboards, providing information at different levels of detail. For example, a high-level dashboard shows the overall health of the system in terms of CPU usage, Memory usage, Load on the system, overall latency, Error rate, etc. When you see that average latency has gone up, you should be able to drill down to the next level dashboard which can give details based on APIs or services, letting you know which service or services are bringing the global average down. Similarly, if the overall system error rate is increasing, we should be able to drill down to details and find the root cause.

An important thing to highlight here is how many dashboards we need to create. We will try to cover various dashboards in this series, but an important factor is to note that if you have too many dashboards, you will end up not monitoring all, and if you have too few, you will not get all relevant information.

Alerts: An important aspect for monitoring alerts in the event that something out of place has occurred. For example, suddenly your application is seeing too many 5XX errors or CPU usage is increased or traffic has increased. Alerts can again have severity or priority. For example, a 10% increase in error rate is a low priority but if the error rate doubles, it is a different issue. Based on priority one should set channels for alerting, for example, a low-priority alert is sent via email whereas a high-priority alert might mean an automated call to a support person.

Again an important question will be how much alerting is good. If you have too many alerts coming to your inbox, you will start ignoring those. A rule of thumb here is to ask the question- is this alert actionable? Every alert should be associated with action, say an alert says CPU percentage is increasing, add more compute power (VMs or Pods). The action should be automated as much as possible.

Analytics on Monitoring data: Another important aspect is to add analytics to the data captured. It can help one find trends and anomalies. For example, if the traffic data shows every Friday night the load increased on the website, we can plan to add additional compute capacity to handle the traffic.

KPI and SLAs: Finally you will define KPIs or Key Performance Indicators and SLAs or Service Level Agreements on top of your monitoring data. For example, one of the key indicators for the overall health of the system can be latency or response time. Your clients might need an SLA that is 95 percentile (you monitor performance data mostly in percentile) of the requests to an API that will respond in less than 100 milliseconds. Based on the nature of your application these parameters will change.

Monitoring Microservices: Part 1

Microservice-based architecture comes with many advantages over monoliths, especially in areas of scalability, enhance-ability, and maintainability of the application, as instead of a big application we are dealing will smaller pieces that are easier to manage and update.

But every good thing comes with some challenges, and in the case of microservice-based architecture, monitoring of application is one such challenge.

Earlier you were looking at one place for logs, server health, etc for any issues or status. But with such a s distributed system where we have tens or hundreds of microservices, it is difficult to monitor the status of each service individually. For example, say we have a scenario where a service is calling another service which in turn might be calling another service to fulfill a user’s request. Now if a request is failing or responding very slow, which of the service is the culprit? Which logs are to be analyzed?

To solve this issue, we have a set of practices that can help us to build a robust and effective Monitoring Strategy.

Before getting into the Strategy to monitor microservices, let’s take a look at a few core concepts that one needs to be aware of, which are Logs, Metrics, and Traceability.

Logs: Logs are the first place you will look at if you see your application is not behaving in an expected manner. Your application emits logs to publish the current state. Logs are mostly categorized into, Debug, Info, Warning, and Error.

Metrics: Metrics are Time series data published by applications to provide a quick view of an aspect that changes with time depending on external conditions like request traffic. For example, Latency Metrics can show data like if 95% of all calls respond under 300ms.

Traceability: Traceability is very important when it comes to distributed systems with multiple microservices. Say Service A calls Service B which calls C and so on. If you see requests failing or responding slowly, you need to track which services are facing issues. Traceability helps track the journey of a request and monitor it at every step.

Branching Strategies

When one starts a project, one question that needs to be answered immediately is how to manage the code? The branching strategy needs to be finalized, in the agile world, we want to make sure code gets merged and deployed asap. Continuous Integration and Continuous delivery want us to make sure our code is always in a state that is ready to be deployed on production.

There are two most common branching strategies which are used in Industry. I have written about these in my book on microservices. Here is an overview.

Feature-Based Branching

The idea is to separate branches for each feature. The feature gets merged back to the production branch once it is implemented and tested completely. An important advantage this kind of approach gives us is that at any point in time there is no unused code in the production branch.

Single Development Branch 

In this approach, we maintain a single branch in which we keep on merging even if the feature is incomplete. We need to make sure the half-built feature code is behind some kind of flag mechanism so that it does not get executed unless the feature is complete. 


HTTP/2 has become a first-class citizen in the past couple of years where development teams have started realizing its true potential. Some of the core features of HTTP/2 are

  • Data compression of HTTP headers
  • Server Push
  • Multiplexing multiple requests over a single TCP connection

Here is a very good illustration of HTTP/2

While we talk about HTTP/2, it makes sense to touch base upon HTTP/3. HTTP/3 suggests using QUIC protocol instead of TLS for faster communication. Here is a good article on comparing HTTP with HTTP/2 and HTTP/3.

REST vs GraphQL vs gRPC

Some days back I wrote about SOAP vs REST. It will not be a stretch to say that with the popularity of REST, SOAP had a very limited business usage. REST seems like a natural extension of HTTP based communication where we used HTTP verbs like GET, POST, PUT, DELETE patch to update a resource state. REST as the name Representational State Transfer suggests an intuitive set of APIs

GET /employees – One can conclude that this will return a list of employees
GET /employees/1 – returns employee with ID 1

Though REST is good, at times it feels like a restriction, say for a simple use case where I want to tell the API that for a given employee just give me a name. To handle these cases, GraphQL has gained a lot of popularity. GraphQL is a query language or way to get data from APIs.

emplyee(id: “111”) {

more on GrpahQL queries https://graphql.org/learn/queries/

GraphQL is a specification and not an implementation in itself. So you will find multiple client and server implementations available to choose from. For example, Facebook providers with Relay library. Another popular implementation is Apollo. In a typical REST APIs availability, you have a separate endpoint for each call. In the case of GraphQL, you have a single endpoint where you send your query.

gRPC is an open-source Remote Procedure Call system initially developed by Google. It uses HTTP/2 for transport, and protocol buffers as its underlying message interchange format. Remote procedure calls give users a feel of making calls to remote code as if they are available locally. The streaming approach behind the scenes makes sure the calling method receives a response as soon as it is ready on the remote machine.

A comparative view of the three approaches i.e REST vs GraphQL vs gRPC


An interesting comparison between gRPC vs REST

Mutual TLS

Sometime back, I talked about HTTPS, and discussed how SSL or Secured Socket Layer can help us establish a secured connection between client and server. TLS or Transport Layer Security can be thought of as an updated version of SSL protocol.

The image below shows the standard handshake process in TLS based communication. When a client tries to communicate with the server, the server sends back its certificate to the client along with its public key. Client after authenticating the server certificate sends the symmetric key encrypted by the public key shared by the server earlier. The server decrypts the key and finally, their communication is done using the symmetric key.

image source https://blog.cloudflare.com/protecting-the-origin-with-tls-authenticated-origin-pulls/

In the case mentioned above, we can see how the client, mostly the browser, validates the server certificate to trust it. An extension to the above communication style is mutual TLS or mTLS. As the name suggests, we are referring to a scenario where both server and client need to trust each other, mostly in this case client is another service or IOT based device. In this case, both the client and server will share their certificates and build mutual trust by authenticating the certificates.

image source https://developers.cloudflare.com/access/service-auth/mtls

API Proxy vs API Gateway

Off late I have been working on and exploring some API proxy tools. I have seen the terms API proxy and API gateway being used interchangeably most of the time, and probably somewhere both try to solve similar problems, with a slight change.

When we talk about API proxy, we are talking about reverse proxy functionality where we can expose our existing APIs to the outside world through a proxy server. So the user applications are not directly talking to your APIs but interacting through the proxy. The idea here is simple, that you need not let everyone know internal details about how and where your servers are deployed. API proxy can also work as a load balancer for API deployed on multiple servers. It can provide additional features like monitoring, quota implementation, rate limiting for your APIs.

An API gateway does all that and provides add-on features like orchestration, for example, when the client hits “/products” API, the gateway might talk to multiple services and collaborate the data before sending back. Additionally, it can provide Mediation like converting request or response formats. It can also help expose a SOAP service as a set of REST endpoints and vice versa. It can provide sophisticated features like handling DDOS attacks.

There is no hard boundary line between API Proxy and API Gateway, and we will find the terms being used interchangeably. Features too, at a certain level, are overlapping. So at the end of the day just look for the tool that fulfils your requirements.

Apigee – Getting started

With the popularity of microservices-based design, we are ending up with tens and hundreds of APIs being created for our applications. Apigee is a platform for developing and managing APIs. It provides a proxy layer for our backend APIs. We can reply on this proxy layer to provide us with a lot of features like security, rate limiting, quotas implementation, analytics, monitoring, and so on.

The core features provided by Apigee are

Design: One can create APIs, and configure or code policies as steps in API flows. Policies can help implement features like rate limiting, quota implementation, payload transformation, and so on.

Security: One can enforce security best practices and governance policies. Implement features like authentication, encryption, SSL offloading, etc at the proxy layer.

Publish: We can provide reference documentation and manage the audience for the API from the portal.

Analyze: Drill down usage data, investigate the spikes and trace calls in real time.

Monitor: Ensure API availability and provide a seamless experience to the APIs.

Hands on example

Getting started with Apigee is easy, go to https://apigee.com and create a free 60-day evaluation account for testing. Once logged in, choose the API proxies option to get selected.

Create a new proxy, and we will select reverse proxy to start with.

To start with, we will use the sample API provided by apigee- http://mocktarget.apigee.net (just prints string “Hello Guest”), and provide name, base path, and some description. You can see we have several other options like SOAP service, which is very useful if you have an existing SOAP service and want to expose it as REST services. Also, there is a No Target option which is helpful when we are creating Mock APIs to test our client applications. But as mentioned, for now, we are creating a reverse proxy for this example.

For the first iteration, we will not add any policy and leave all other options as default (no policy selected).

Once created, we will deploy the API to the test environment. For a free evaluation version, we are provided two environments only, test and prod.

After successful deployment, a proxy URL will be provided by apigee, something like https://kamalmeet-eval-test.apigee.net/first, which we can hit and get the output (Hello Guest). We can see that we are hitting proxy URL which in turn hits the actual URL, in this case, http://mocktarget.apigee.net/.

Go get an understanding of what is happening behind the scenes, we will look into the “Develop” tab.

We can see two sets of endpoint settings, a proxy endpoint, and a target endpoint. The proxy endpoint is the one exposed to the calling service. The target endpoint is the one that is communicating with the provider service.

Currently, we have not done much with the endpoints so we see very basic XML settings. On the proxy side, all we see is connection /first and the fact that we are exposing a secured URL. Similarly, on the target side, all we have is the target URL.

To make the API more meaningful, let’s modify the target endpoint to https://mocktarget.apigee.net/json, which returns a sample JSON like

"city":"San Jose",

We can update the same Proxy that we created earlier or create a new one. To update the existing one, just go to the develop tab and update the target endpoint URL.

When you click on save, it will deploy automatically (else you can undeploy and deploy). Once done, you will see the existing proxy URL that we created now returns the JSON response https://kamalmeet-eval-test.apigee.net/first. So far we have not added any policy to our APIs. Well, that is easy, in the same develop tab, just above the XML definitions we see a flow diagram.

We can add a step to request flow or response flow. Let’s add a policy to request flow by clicking on the +Step button. To start, let’s add a quota policy.

By default, the quota policy gives settings for 2000 requests per month. For sake of this test, let’s modify it to 3 calls per minute.

To test the quota implementation, we can go to the trace tab and hit the URL multiple times. We can see response code 429 (too many requests), caused by the quota limit.

We have implemented the quota policy, which can be very useful to stop a bot attack on our API. Similarly, Apigee provides us with a set of policies categorized under traffic management, security, and mediation. For traffic management, we have a quota, spike arrest, caching, etc. For Security, we have policies like Basic Auth, JSON threat protection, XML threat protection, OAuth, and so on.

For mediation, we have policy likes, JSON to XML, XML to JSON, XSL transformation, etc.

In the example we created, we added a policy to the incoming request, similarly, we can add a policy to response, for example, if we apply JSON to XML policy to the response, we will start receiving XML response instead of JSON.