Category Archives: Design

Cloud Native Application Design – Compute

Cloud-native application design consists of three core pillar decisions- Compute, Database, and Storage. Most cloud providers give various options for all.

When starting with application design, the first decision the team needs to make is how and where we will deploy it.

Virtual Machine: Age old classic option is to go for a virtual machine, where the development team takes complete control over setting up servers, managing virtual machines, Magaing the health of machines, Scalability, etc. Examples are AWS EC2, Azure Virtual Machine, and Google Virtual Machine.

  • Pros– Easiest to get started, gives more control over setup.
  • Cons– The dev team is responsible for managing the machine, the least cost-effective solution.

Containers: Lightweight containers give the perfect solution for a microservices-based design. Container Manager like Kubernetes gives off-the-shelf solutions for managing the health and scalability of container-based implementation. Most Service Providers give options for off-the-shelf service like Azure Kubernetes Service or Amazon Elastic Kubernetes Service.

  • Pros: Best suited for Microservice-based solutions, lightweight hence cost-effective, and easy to manage.
  • Cons: Learning curve to making sure correct use of tools like docker and Kubernetes

Functions: Next deployment option is to deploy using Functions as a Service. Again a very good fit for Microservices gives you on-demand execution of code options. Service Provider gives us options like Azure Functions or AWS Lambda to build our code as Functions.

  • Pros: Can be cost-effective with pay based on the execution model as you only pay for execution time.
  • Cons: Does not fit all use cases, not all scenarios are supported (most vendors have a limit on the time it will take for execution), and Vendor locking.

Specialized options: Apart from the options mentioned above, most service providers give you specialized options like Azure gives you App Service and AWS has Elastic Bean Stalk that helps you deploy popular technologies like Java, Python, etc directly.

  • Pros: Being Managed services, helps the dev team get free from managing underlying infra and focus on development. Also provides features like monitoring and scaling off the shelf.
  • Cons: Learning curve to understand the framework and vendor locking as the deployable is specific to the vendor.

Cloud Native Application Design – Load Balancing

Load Balancing is an important technique in cloud-native application design to achieve scalability, reliability, and availability. The load can be distributed among nodes (physical or containers), based on rules like round robin, weighted, performance-based, geographical distribution, etc.

Load Balancing can be achieved at the following levels

DNS Level: DNS level load balancing is a method of distributing incoming network traffic across multiple servers or IP addresses by using DNS (Domain Name System) servers to resolve domain names to IP addresses. You can choose distribution riles based on need, for example,  you might want to send traffic originating from Europe to hit Europe servers whereas traffic from North America to hit North America servers. While resolving the DNS, the traffic manager will choose the backend endpoint based on the rules set.

Layer 7 or Application Layer: In Layer 7 load balancing, the load balancer analyzes the content of the incoming requests, including the HTTP headers, URLs, and other application-specific data, to determine how to distribute the traffic. For example, we can set rules that /images pattern is getting redirected to a backend, whereas /videos pattern is to another. Additionally one can have features like SSL termination, and WAF (Web Application Firewall, that will protect from threats like SQL injection attacks, Cross Site Scripting or XSS attacks, etc.) implemented.

Layer 4 or Transport Layer: Layer 4 load balancers can route traffic based on basic criteria such as source IP address, destination IP address, source port, destination port, and protocol type. At the transport layer, the load balancer does not have access to request data, hence decisions can only be taken at IP or Port level. At the same time as no parsing is involved, the overall performance is better.


Edge Computing

Edge Computing takes distributed computing close to the information source, rather than relying on centralized data centers. This approach is relatively popular for systems that involve IOT devices.


The idea is to keep computation near to source, to reduce latency. Decisions can be made faster as data does not need to be sent to long distance. This also reduce the amount of data sent to central servers as some level of data filtering and analysis is pre-processed at edge locations.

Edge Computing architecture usually contains following components

  • Edge devices: Devices that collect data from sensors, cameras, and other sources. Examples include IoT devices, cameras, and industrial equipment.
  • Edge gateway: An edge gateway acts as a bridge between the edge devices and the back-end systems.
  • Edge server: This is a server located at the edge of the network that is responsible for processing and analyzing data. It can run applications and services that are optimized for low-latency and high-performance requirements.
  • Fog nodes: These are intermediate devices that sit between the edge devices and the cloud or data center. They are responsible for processing and analyzing data, similar to edge servers, but they are typically more powerful and capable of running more complex applications and services.
  • Cloud/Data center: The data that is processed at the edge is then sent to a cloud or data center for further analysis, storage and sharing.
  • Management and orchestration platform: This is a platform that manages and monitors the edge devices, gateways and servers, and allows for the deployment, configuration, and management of edge applications and services.

Clean Code: Error Handling, Testing, and Classes

In the clean code series, I will cover Error Jandling, Testing and Clean classes in this post.

Error Handling

  • Catch only Exceptions meant to be caught, for example, checked exceptions in Java
  • Log as much information as possible when an error or exception occurs for analysis
  • Null objects should not be returned, instead, return empty objects


  • Code Coverage targets should be set and achieved
  • TDD helps write testable code and reduce the number of issues
  • Use the F.I.R.S.T rule for testing:
    • The test is fast-running.
    • The tests are independent of others.
    • The test is repeatable in various environments.
    • The test is self-validating.
    • The test is timely.

Clean Class

  • Single Responsibility
  • Open for extension and closed for Modifications
  • Readability: self-documenting class names, function names, and variable names

Clean Code: Comments, Formatting, and Objects

In continuation with the clean code series, here I am covering Comments, Formatting, and Objects and data structures.


  • Code should be self-explanatory, the purpose of the code is that humans should be able to understand and not that computer is able to execute it
  • Comment only where logic is complex
  • Private API should not have comments
  • Use comments when you want to caution other developers, for example, why List instead of Queue was chosen and should not be modified
  • Comments should answer why (was a decision made) and not how the code works


  • Formatting is a way to communicate with fellow developers
  • Readability = Maintainability = Extensibility
  • Verticle Alignment: Keep connected functions together for better readability
  • Horizontal Alignment: one should never need to scroll right
  • Team Formatting Rules: everyone should follow the same rules, braces, ident size, spaces/tabs

Objects and Data Structures

  • Follow OOPS Principles, e.g. parameters and behavior should be encapsulated
  • Law of Demeter:  M method of an object O can only consume services of the following types of objects:
    • The object itself, O.
    • The M parameters.
    • Any object created or instantiated by M.
    • Direct components of O.
  • Avoid Static methods wherever possible

Clean Code: Naming and Functions

Inspired by Clean Code by Robert C Martin, trying to summarize coding best practices. Starting with Naming and Functions best practices in this post.


  • Names should encode the intent, for example, studentBirthYear.
  • Use Good distinction: Do not use list1, list2, etc
  • Use Pronounceable name: dobmmyy vs dateOfBirthInMonthAndYear
  • use searchable names: int i, j, when you will try to search you will find a lot of them in code
  • Do not add type: phoneString, name String, name and phone should be sufficient
  • Avoid unclear prefixes: m_name vs manager_name
  • nouns for names and verbs for functions: employee for class and paySalary for function
  • Use Consistent concept: controller vs manager
  • Don’t use the same name twice to mean 2 different things. paymentInfo at one place returns bank details and another place user payment
  • Use Domain specific names
  • Avoid too long or too short names: Long is fine if it conveys better information, but not too long that makes it difficult to pronounce


  • Write Small functions, functions larger than 20 lines should be avoided
  • Make sure the function does only one thing
  • Use minimum arguments: max 2, if the function takes too many arguments, it is doing too much
  • DRY, Do not Repeat yourself: IF you are doing the same thing in multiple functions, move it to commonplace
  • Don’t use flag element, parameters of the Boolean type as a parameter already clearly state that it does more than one thing.

Design: Twitter

In the series of exploring designs for popular systems, I will try to come up with Twitter’s system design today.

Functional Requirements

  • User should be able to Tweet
  • User should be able to follow and view tweets of others on their timeline
  • User should be able to search for tweets
  • User should be able to view current trends

Non Functional requirements

  • Availability
  • Performance
  • Eventual Consistency
  • Scale 150 million users with 500 million tweets per day or ~5500 Tweets/Second

30000 view of Architecture

30000 view of Twitter Architecture

There are one or two aspects of the above design which are very interesting. The first one we can see is the user timeline. This can be a complicated piece, whenever a user login into the app, he should see his timeline, which will show all the tweets from people he is following. The user might be following hundreds of accounts, it will not be feasible to calculate tweets from all these accounts at runtime and create timeline data. So a passive approach makes sense here, where we can keep the user timeline data in a cache beforehand.

Say user A is following user B, and user B publishes a new tweet, at that time itself, user A timeline will be updated with a new timeline getting added to user A timeline data. Similarly, if 100 users are following user B, all the timelines get updated (Fanout the tweet and update all timelines).

It can get tricky if user B has millions of followers. A different approach can be used in this case. Assuming there are not many such popular users, we can create a separate bucket for handling these popular users. Say user A is following user C (celebrity), so instead of updating the timeline for C beforehand, tweets for all such celebrity users can be pulled in real-time.

Another important aspect is hashtagging and trends exploration. For all the tweets coming in, the text can be tokenized and tokens can be analyzed for most usage. For example, when a cricket match is going on in India, many people might tweet with the term match or cricket. Again these trends might be geo-location-based as this particular trend is a country-specific one.

Design: Whatsapp

In this series of trying to understand designs for popular systems, I am taking up Whatsapp in this post. Please remember all the designs I am sharing in this series are my personal view for educational purposes and might not be the same as actual implementation.

To get started let us take a look at the requirements

Functional Requirements

  • User should be able to create and manage an account (Covered already)
  • User should be able to send a message to contact
  • User should be able to send a message to a group
  • User should be able to send a media message (image/ video)
  • Message Received and Message Read receipts to sender
  • Voice Calling (Not covering here)

Non Functional Requirements

  • Encryption
  • Scaleability
  • Availability

At a very high level, the design looks very straightforward

The first important thing that we see here is that communication is not one-way like a normal web application where the client sends a request and receives a response. Here the mobile app (client) should be able to receive live messages from the messaging server as well. This kind of communication is called Duplex-Connection where both parties can send messages. To achieve this, one can use long polling or web sockets (preferred).

Communication Management: When a user sends a message, it will be sent to the queue of messages received, from where it will be processed and sent to the queue for messages to be sent to users.

Media Management: Before sending the message for processing, media is uploaded and stored in a storage bucket, and the link is shared with users, which can be used to fetch the actual media file.

Single/ Double/ Blue Tick: When a message is received and processed by the server, the information is sent back to the user and marked single ticked. Similarly, when the message is sent successfully the receiver is marked double tick and finally, when the receiver opens the message, it is blue ticked for the sender.

Additional Perpectives
WhatsApp system architecture design

Design: URL Shortner

The problem we are trying to solve is to create a service that can take a large URL and return a shorter version, for example, say take as input and give me, a URL easy to share.

The application looks simple, but it does provide a few interesting aspects.

Database: The Main database will be used to store long URLs, short URLs, created dates, created by, last used, etc. as we can see this will be a read-heavy database and should be able to handle large datasets, a NoSQL document-based database should be good for scalability.

Data Scale:

  • Long URL – 2 KB (2048 chars)
  • Short URL – 7 bytes (7 chars)
  • Created at – 7 bytes (7 chars for epoch time)
  • last used – 7 bytes
  • created by – 16 bytes (userid)
  • Total: ~2KB

2KB * 30 million URLs per month = ~60 GB per month or 7.2 TB in 10 years

Format: The next challenge is to decide the format of the tiny URL. The decision is an easy one, Base 10 URL would give you 10^7 or 10 million combinations for a 7-character string whereas a Base 62 format will give 62^7 or 3.5 trillion combinations for 7 character string.

Short URL Generator: Another challenge to solve is how to choose a random 7 Base 62 string for each URL.

Soln 1: Use MD5 which returns a string of 20+ chars, we can take the first 7 characters. The problem here is taking the first 7 characters might lead to a collision where multiple strings have MD5 with the same first 7 characters

Soln 2: Use a counter-based approach. A counter service will generate the counter which gets converted to Base 62, making sure all requests get a unique Base 62 string. To scale it better, we will have a distributed counter generator.

Design: Dropbox

In the series of exploring design for popular systems, I will look at a file share system like dropbox.

Functional Requirements

  • User is able to upload or download files via a client application or web application
  • User is able to sync and share files
  • User is able to view the history of updates

Non Functional Requirements

  • Performance: Low latency while uploading the files
  • Availability
  • Concurrency: Multiple users are able to update the same file

Scaling Assumptions

  • Average size file – say 200 MB
  • Total user base- 500 million
  • Daily active users- 100 million
  • Daily file creations- 10 per user
  • Total files per user- 100
  • Average Ingress per day: 10 * 100 million * 200 MB = 200 petabytes per day

Services Needed

  • User management Service
  • File Handler Service
  • Notification Service
  • Synchronization Service

File Sync

When Syncing the files we will break the file into smaller chunks, so that only the chunk which has undergone updates will be sent to the server. This architecture is helpful in contrast to sending the file to the server for every update. Say a 40 MB file gets broken into 2 MB chunks each.

This architecture helps solve problems like

  • Concurrency: If two users are updating different chunks, there is no conflict
  • Latency: Faster and parallel upload
  • Bandwidth: Only chunk updated is sent
  • History Storage: New version only need a chunk of data rather than full file space

The most important part of this design is the client component.

  • Watcher: This component keeps an eye on a local folder for any changes. It informs Chunker and Indexer about changes.
  • Chunker: As discussed above, the chunker is responsible for breaking a file into manageable chunks
  • Indexer: On receiving an update from watcher, Indexer updates the internal database with metadata details. It also communicates with Synchrnozation service sending or receiving information on updates happening to files and syncing the latest version.
  • Internal DB: To maintain file metadata locally on the client.

Cloud Storage finally stores the files and updates. Metadata server maintains metadata and helps inform clients about any updates through synchronization service. Synchronization service adds data to the queue which is then picked by various clients based on availability (if a client is offline, it can read messages later and sync up the data). Edge store helps provide details to clients from the nearest location.