Cloud Native Application Design – Capacity Planning

An important aspect of architecting a system is capacity planning. You need to estimate the resources that your software is going to consume. Based on the estimate one can easily calculate the overall cost/ budget needed for the project. Most cloud service providers have pricing estimation tools, where one can provide their requirements and calculate an estimated price for a year or specific time period.

When estimating the price one needs to come up with high-level infrastructural requirements.

The most important areas are

Storage
Database
Compute

There are other areas as well, but these three constitute the major portion and if we are able to estimate these, others should be easy.

Database

For estimating your database needs, you need to understand whall all entities you will be storing. Estimate the amount of data being stored for each entity for a specific time period, for example, a year. The core process would be the same for NoSQL and RDBMS databases, for example, you will store a document instead of a row in a document-based database.

Taking RDBMS as the base case, we will try to calculate capacity requirements for the sample table. Practically you will identify a few important tables that will help you estimate for complete requirements.

So let’s say you have a product table, first, we will check how much storage will be needed for single tow.

Column	Type	Size
Name	Varchar (512)	512 bytes
Description	Varchar (2048)	2048 bytes
Price	Float	4 bytes
Quantity	Number	4 bytes
..	..	..
..	..	..
..	..	..

Say Total we have 10K bytes

Total 10000 bytes or ~0.01MB
say we are anticipating 1 million records in a year, that would translate to
0.01 * 1,000,000 or 10000 MB or ~10 GB

Say we have 10 tables, we would have a storage need of a Total of 100 GB (add a buffer for indexing, metadata, etc)

The second is memory usage + CPU usage + Network bandwidth
Say I will never run queries with more than 2 tables join with a max 10 GB data, so I know I need RAM to support at least 20GB

Storage

Storage is easier to calculate

You are storing x mb of file and in 1 year you expect n number of files

x* n mb

Compute

Now compute is the most important and complex area for estimation. The battle-tested method for calculating compute requirements is through load test.

A load test gives you an idea of how much load can be handled by a node (VM or Pod)

Following are the usual steps for Load testing

Identify APIs that will be used most frequently
Identify APIs that are most heavy in terms of resource usage
Come up with h perfect mix of load (historical data helps) and load test the system
Let the load run for longer durations (a few days) for getting better results (memory leak)
Check data performance at different percentiles 50, 90 95, 99, 99.9
Check for TPS handled with load
Check for error rate and dropped requests
Monitor infrastructure performance – CPU, Memory, Request queues, Garbage collection, etc.

Once you have all the data, based on your SLAs you can figure out TPS (Transaction per second) your node can handle. Once you have the TPS number, it is easy to calculate the overall requirement.

For example, if your load test confirms that one node can handle 100 TPS and you have overall requirements for 1000 TPS, you can easily set

TPS calculation: Number of transactions/time in seconds, say your load test reveals 10000 requests were processed in 1 minute, TPS = 10000/60 or 166.6

Additional Consideration: conditions one need to take into consideration when finalizing capacity requirements

Disaster recover

If one or more nodes are down in one cloud
If a complete cloud region is down

Noisy Neighbour

Especially in a SaaS-based system, there is a phenomenon of a noisy neighbor where one tenant can eat up threads causing other tenants to wait.

Bulkhead pattern and/or rate limiting are common solution options to handle noisy neighbors. But one needs to consider threads/infrastructure that can be realistically blocked by noisy neighbors.

Performance Tuning

An important aspect of capacity optimization is to make sure we are using our resources in the best possible manner. For example, most applications or web servers have configurations that one can set to their requirements in order to get the best possible performance. Some examples are

Number of concurrent threads
Memory setting e.g. Java Heap Memory (Set to max)
Compression (true vs false)
Request queue (throttling)