An important aspect of architecting a system is capacity planning. You need to estimate the resources that your software is going to consume. Based on the estimate one can easily calculate the overall cost/ budget needed for the project. Most cloud service providers have pricing estimation tools, where one can provide their requirements and calculate an estimated price for a year or specific time period.
When estimating the price one needs to come up with high-level infrastructural requirements.
The most important areas are
- Storage
- Database
- Compute
There are other areas as well, but these three constitute the major portion and if we are able to estimate these, others should be easy.
Database
For estimating your database needs, you need to understand whall all entities you will be storing. Estimate the amount of data being stored for each entity for a specific time period, for example, a year. The core process would be the same for NoSQL and RDBMS databases, for example, you will store a document instead of a row in a document-based database.
Taking RDBMS as the base case, we will try to calculate capacity requirements for the sample table. Practically you will identify a few important tables that will help you estimate for complete requirements.
So let’s say you have a product table, first, we will check how much storage will be needed for single tow.
Column | Type | Size |
---|---|---|
Name | Varchar (512) | 512 bytes |
Description | Varchar (2048) | 2048 bytes |
Price | Float | 4 bytes |
Quantity | Number | 4 bytes |
.. | .. | .. |
.. | .. | .. |
.. | .. | .. |
Say Total we have 10K bytes
Total 10000 bytes or ~0.01MB
say we are anticipating 1 million records in a year, that would translate to
0.01 * 1,000,000 or 10000 MB or ~10 GB
Say we have 10 tables, we would have a storage need of a Total of 100 GB (add a buffer for indexing, metadata, etc)
The second is memory usage + CPU usage + Network bandwidth
Say I will never run queries with more than 2 tables join with a max 10 GB data, so I know I need RAM to support at least 20GB
Storage
Storage is easier to calculate
You are storing x mb of file and in 1 year you expect n number of files
x* n mb
Compute
Now compute is the most important and complex area for estimation. The battle-tested method for calculating compute requirements is through load test.
A load test gives you an idea of how much load can be handled by a node (VM or Pod)
Following are the usual steps for Load testing
- Identify APIs that will be used most frequently
- Identify APIs that are most heavy in terms of resource usage
- Come up with h perfect mix of load (historical data helps) and load test the system
- Let the load run for longer durations (a few days) for getting better results (memory leak)
- Check data performance at different percentiles 50, 90 95, 99, 99.9
- Check for TPS handled with load
- Check for error rate and dropped requests
- Monitor infrastructure performance – CPU, Memory, Request queues, Garbage collection, etc.
Once you have all the data, based on your SLAs you can figure out TPS (Transaction per second) your node can handle. Once you have the TPS number, it is easy to calculate the overall requirement.
For example, if your load test confirms that one node can handle 100 TPS and you have overall requirements for 1000 TPS, you can easily set
TPS calculation: Number of transactions/time in seconds, say your load test reveals 10000 requests were processed in 1 minute, TPS = 10000/60 or 166.6
Additional Consideration: conditions one need to take into consideration when finalizing capacity requirements
Disaster recover
- If one or more nodes are down in one cloud
- If a complete cloud region is down
Noisy Neighbour
Especially in a SaaS-based system, there is a phenomenon of a noisy neighbor where one tenant can eat up threads causing other tenants to wait.
Bulkhead pattern and/or rate limiting are common solution options to handle noisy neighbors. But one needs to consider threads/infrastructure that can be realistically blocked by noisy neighbors.
Performance Tuning
An important aspect of capacity optimization is to make sure we are using our resources in the best possible manner. For example, most applications or web servers have configurations that one can set to their requirements in order to get the best possible performance. Some examples are
- Number of concurrent threads
- Memory setting e.g. Java Heap Memory (Set to max)
- Compression (true vs false)
- Request queue (throttling)