Tag Archives: Cloud computing

Monitoring Microservices: Part 4

Quick Recap: In past posts, we have talked about the basics of monitoring and Golden Signals. We also talked about tools like logs and Metrics which help us create Dashboards and Alerts for monitoring our applications.

In this final post for Monitoring services, I will cover basic areas, that one should consider monitoring when working with a Microservices-based application. One needs to create Dashboards and Alerts around these areas based on application requirements and SLAs.

Service Health: Overall health of the service. Overall health will be a combination of microservices. Ideally, you will put together a single dashboard that provides health details of all critical areas/microservices, so that you can confirm if end-to-end functionality is working fine. For example, the Health of service can be a combination of aspects like CPU load for critical microservices, the number of requests failing (compared to success rate), the response time (based on SLA), etc.

API Health/ Features Health: API Level health, how various APIs are performing. You can create separate dashboards for monitoring APIs/ Microservices.

Infrastructure Health: Another important aspect for monitoring is infrastructure, i.e. CPU, Memory, Network, IO, etc.

Cost: Cloud computing gives us many advantages, but at the same time also brings in additional complexities, managing cost is one of them. Fortunately, most cloud providers give easy ways to manage and monitor costs.

Error and Exceptions: The number of errors and exceptions thrown by your application and code is an important parameter to manage overall service health. For example, after a fresh deployment, if you see an increase in errors or exceptions, you know there is some issue.

Performance: Another important aspect is the performance of critical features. You would like to monitor how your services and APIs are performing and what response time is provided to clients.

Traffic: An increase or decrease in traffic impacts various aspects of the application, including scalability, infrastructure, performance, error rate, security (unexpected traffic might be due to a bot), etc. So it is important to track traffic details.

Success Rate: Success or Error rate helps understand the overall health of the system. A simple ratio would be 2XX to Errored (4XX+5XX) request.

Dependencies – Upstream/ Downstream: In a microservice-based architecture, the service cannot live in isolation. So it is important to track the health and performance of dependencies.

Request Tracing: In microservices-based architecture, where a service calls another service which in turn calls another, and so on, it is difficult to trace error and performance issues. So proper request traceability helps us monitor and debug issues.

Monitoring Microservices: Part 3

Quick Recap: In past posts, we have talked about the basics of monitoring and tools like dashboards, alerts, etc that help us monitor the health of our applications.

Any discussion about monitoring will remain incomplete without talking about Golden signals. We understand that Monitoring an application is a complex task and most importantly different developers and architects can have a different vision when it comes to Monitoring an application. It is challenging to develop a standard set of monitoring practices as every application will have a different need. For some applications, performance would be more important and for others, it might be accuracy or optimized use of infrastructure.

Though coming up with an exact set of monitoring best practices will be difficult. Still, Golden signals are accepted as a general guideline that will help provide overall health of the system. The four golden signals that are generally accepted are

Latency: In simple words, the time it takes to service a request. This gives a measure of performance at a high-level or API level.

Traffic: Request Per second (In total and divided based on request type). An important indicator giving an idea of usage. Performance should be measured in percentile.

Errors: Requests resulting in errors i.e. 5XXs or 4XXs. A ratio of errored requests vs 200s is a good indicator of the success/ error ratio for the application. For example, a new deployment showing a sudden increase in error rate would help in timely rollback and save embarrassment.

Saturation: An indicator of “How full the service is?”. A load test can give you a threshold, and based on that threshold, we can calculate if the current infrastructure can handle X% more traffic.



Monitoring Microservices: Part 2

Quick Recap: In the last post, I discussed the importance of Monitoring and core tools available or monitoring in form of logs, metrics, and traceability.

In this post, I would like to talk more about how we can use the tools we discussed to get more insights into our application.

Dashboards: Well, you have logs, metrics, and traceability in place, but the important next step is to organize the data in a form that makes sense. The answer is dashboards.


A good dashboard is easy to understand and provides you with the point information. You will usually have multiple dashboards, providing information at different levels of detail. For example, a high-level dashboard shows the overall health of the system in terms of CPU usage, Memory usage, Load on the system, overall latency, Error rate, etc. When you see that average latency has gone up, you should be able to drill down to the next level dashboard which can give details based on APIs or services, letting you know which service or services are bringing the global average down. Similarly, if the overall system error rate is increasing, we should be able to drill down to details and find the root cause.

An important thing to highlight here is how many dashboards we need to create. We will try to cover various dashboards in this series, but an important factor is to note that if you have too many dashboards, you will end up not monitoring all, and if you have too few, you will not get all relevant information.

Alerts: An important aspect for monitoring alerts in the event that something out of place has occurred. For example, suddenly your application is seeing too many 5XX errors or CPU usage is increased or traffic has increased. Alerts can again have severity or priority. For example, a 10% increase in error rate is a low priority but if the error rate doubles, it is a different issue. Based on priority one should set channels for alerting, for example, a low-priority alert is sent via email whereas a high-priority alert might mean an automated call to a support person.

Again an important question will be how much alerting is good. If you have too many alerts coming to your inbox, you will start ignoring those. A rule of thumb here is to ask the question- is this alert actionable? Every alert should be associated with action, say an alert says CPU percentage is increasing, add more compute power (VMs or Pods). The action should be automated as much as possible.

Analytics on Monitoring data: Another important aspect is to add analytics to the data captured. It can help one find trends and anomalies. For example, if the traffic data shows every Friday night the load increased on the website, we can plan to add additional compute capacity to handle the traffic.

KPI and SLAs: Finally you will define KPIs or Key Performance Indicators and SLAs or Service Level Agreements on top of your monitoring data. For example, one of the key indicators for the overall health of the system can be latency or response time. Your clients might need an SLA that is 95 percentile (you monitor performance data mostly in percentile) of the requests to an API that will respond in less than 100 milliseconds. Based on the nature of your application these parameters will change.

Monitoring Microservices: Part 1

Microservice-based architecture comes with many advantages over monoliths, especially in areas of scalability, enhance-ability, and maintainability of the application, as instead of a big application we are dealing will smaller pieces that are easier to manage and update.

But every good thing comes with some challenges, and in the case of microservice-based architecture, monitoring of application is one such challenge.

Earlier you were looking at one place for logs, server health, etc for any issues or status. But with such a s distributed system where we have tens or hundreds of microservices, it is difficult to monitor the status of each service individually. For example, say we have a scenario where a service is calling another service which in turn might be calling another service to fulfill a user’s request. Now if a request is failing or responding very slow, which of the service is the culprit? Which logs are to be analyzed?

To solve this issue, we have a set of practices that can help us to build a robust and effective Monitoring Strategy.

Before getting into the Strategy to monitor microservices, let’s take a look at a few core concepts that one needs to be aware of, which are Logs, Metrics, and Traceability.

Logs: Logs are the first place you will look at if you see your application is not behaving in an expected manner. Your application emits logs to publish the current state. Logs are mostly categorized into, Debug, Info, Warning, and Error.

Metrics: Metrics are Time series data published by applications to provide a quick view of an aspect that changes with time depending on external conditions like request traffic. For example, Latency Metrics can show data like if 95% of all calls respond under 300ms.

Traceability: Traceability is very important when it comes to distributed systems with multiple microservices. Say Service A calls Service B which calls C and so on. If you see requests failing or responding slowly, you need to track which services are facing issues. Traceability helps track the journey of a request and monitor it at every step.


An old video but still relevant

Key Take aways

  • Complex SQL queries- Joins : SQL
  • Transaction management / ACID Properties: SQL (commit and rollback are by default whereas you need to tackle them in NSQL)
  • Huge quantity of data/ fast scalability – NOSQL
  • Write Heavy – Logging system – NoSQL
  • Read Heavy- Queries/ indexes – SQL
  • Fixed Schema- SQL, Flexible schema-NOSQL (Alter table statements are costly and have restrictions)
  • JPA/ hibernate/ django- by default support SQL
  • Archiving and managing huge data- NoSQL

Azure Networking – 2

User-defined routes
You can use a user-defined route to override the default system routes so traffic can be routed through firewalls or NVAs.

For example, you might have a network with two subnets and want to add a virtual machine in the perimeter network to be used as a firewall. You can create a user-defined route so that traffic passes through the firewall and doesn’t go directly between the subnets.

When creating user-defined routes, you can specify these next hop types:

Virtual appliance: A virtual appliance is typically a firewall device used to analyze or filter traffic that is entering or leaving your network. You can specify the private IP address of a NIC attached to a virtual machine so that IP forwarding can be enabled. Or you can provide the private IP address of an internal load balancer.
Virtual network gateway: Use to indicate when you want routes for a specific address to be routed to a virtual network gateway. The virtual network gateway is specified as a VPN for the next hop type.
Virtual network: Use to override the default system route within a virtual network.
Internet: Use to route traffic to a specified address prefix that is routed to the internet.
None: Use to drop traffic sent to a specified address prefix.

If there are multiple routes with the same address prefix, Azure selects the route based on the type in the following order of priority:

  • User-defined routes
  • BGP routes
  • System routes

A network virtual appliance (NVA) is a virtual appliance that consists of various layers like:

  • a firewall
  • a WAN optimizer
  • application-delivery controllers
  • routers
  • load balancers
  • proxies

Azure Networking

VNet Peering: Virtual network peering enables you to seamlessly connect two Azure virtual networks. Once peered, the virtual networks appear as one, for connectivity purposes. There are two types of VNet peering.

Regional VNet peering connects Azure virtual networks in the same region.
Global VNet peering connects Azure virtual networks in different regions.

A VPN gateway is a specific type of virtual network gateway that is used to send encrypted traffic between an Azure virtual network and an on-premises location over the public Internet. You also use a VPN gateway to send encrypted traffic between Azure virtual networks over the Microsoft network.

Site-to-site connections connect on-premises datacenters to Azure virtual networks
VNet-to-VNet connections connect Azure virtual networks (custom)
Point-to-site (User VPN) connections connect individual devices to Azure virtual networks

There are two types of load balancers: public and internal.

A public load balancer maps the public IP address and port number of incoming traffic to the private IP address and port number of the VM. Mapping is also provided for the response traffic from the VM. By applying load-balancing rules, you can distribute specific types of traffic across multiple VMs or services. For example, you can spread the load of incoming web request traffic across multiple web servers.

An internal load balancer directs traffic to resources that are inside a virtual network or that use a VPN to access Azure infrastructure.

Application gateway: There are two primary methods of routing traffic, path-based routing, and multiple site routing.

path: /images, /videos
site: kamalmeet.com, bizt.com

Gateway transit
You can connect to your on-premises network from a peered virtual network if you enable gateways transit from a virtual network that has a VPN gateway. Using gateway transit, you can enable on-premises connectivity without deploying virtual network gateways to all your virtual networks.

Overlapping address spaces
IP address spaces of connected networks within Azure, between Azure and your on-premises network, can’t overlap. This is also true for peered virtual networks.

A is the host record and is the most common type of DNS record. It maps the domain or hostname to the IP address.
CNAME is a Canonical Name record that’s used to create an alias from one domain name to another domain name.
MX is the mail exchange record. It maps mail requests to your mail server, whether hosted on-premises or in the cloud.
TXT is the text record. It’s used to associate text strings with a domain name. Azure and Microsoft 365 use TXT records to verify domain ownership.

Azure Application Insights

Application Insights is aimed at the development team, to help you understand how your app is performing and how it’s being used. It monitors:

Request rates, response times, and failure rates – Find out which pages are most popular, at what times of day, and where your users are. See which pages perform best. If your response times and failure rates go high when there are more requests, then perhaps you have a resourcing problem.
Dependency rates, response times, and failure rates – Find out whether external services are slowing you down.
Exceptions – Analyze the aggregated statistics, or pick specific instances and drill into the stack trace and related requests. Both server and browser exceptions are reported.
Performance counters from your Windows or Linux server machines, such as CPU, memory, and network usage.
Diagnostic trace logs from your app – so that you can correlate trace events with requests.
Custom events and metrics that you write yourself in the client or server code, to track business events such as items sold or games won.

Azure Role Based Access Control

Role-based access control (RBAC) helps you manage who has access to Azure resources, what they can do with those resources, and what areas they have access to.

Security principal (who). An object that represents something that is requesting access to resources. Examples: user, group, service principal, managed identity
Role definition (what). Collection of permissions that lists the operations that can be performed. Examples: Reader, Contributor, Owner, User Access Administrator
Scope (where). Boundary for the level of access that is requested. Examples: management group, subscription, resource group, resource
Assignment. Attaching a role definition to a security principal at a particular scope. Users can grant access described in a role definition by creating an assignment. Deny assignments are currently read-only and can only be set by Azure.

You want the external team to collaborate with the internal developer team in a process that’s easy and secure. With Azure Active Directory (Azure AD) business-to-business (B2B), you can add people from other companies to your Azure AD tenant as guest users.

Why use Azure AD B2B instead of the federation?
With Azure AD B2B, you don’t take on the responsibility of managing and authenticating the credentials and identities of partners. Giving access to external users is much easier than in a federation. You don’t need an AD administrator to create and manage external user accounts.