Cloud Native Application Design – Capacity Planning

An important aspect of architecting a system is capacity planning. You need to estimate the resources that your software is going to consume. Based on the estimate one can easily calculate the overall cost/ budget needed for the project. Most cloud service providers have pricing estimation tools, where one can provide their requirements and calculate an estimated price for a year or specific time period.

When estimating the price one needs to come up with high-level infrastructural requirements.

The most important areas are

  • Storage
  • Database
  • Compute

There are other areas as well, but these three constitute the major portion and if we are able to estimate these, others should be easy.

Database

For estimating your database needs, you need to understand whall all entities you will be storing. Estimate the amount of data being stored for each entity for a specific time period, for example, a year. The core process would be the same for NoSQL and RDBMS databases, for example, you will store a document instead of a row in a document-based database.

Taking RDBMS as the base case, we will try to calculate capacity requirements for the sample table. Practically you will identify a few important tables that will help you estimate for complete requirements.

So let’s say you have a product table, first, we will check how much storage will be needed for single tow.

ColumnTypeSize
NameVarchar (512)512 bytes
DescriptionVarchar (2048)2048 bytes
PriceFloat4 bytes
QuantityNumber4 bytes
......
......
......

Say Total we have 10K bytes

Total 10000 bytes or ~0.01MB
say we are anticipating 1 million records in a year, that would translate to
0.01 * 1,000,000 or 10000 MB or ~10 GB

Say we have 10 tables, we would have a storage need of a Total of 100 GB (add a buffer for indexing, metadata, etc)

The second is memory usage + CPU usage + Network bandwidth
Say I will never run queries with more than 2 tables join with a max 10 GB data, so I know I need RAM to support at least 20GB


Storage

Storage is easier to calculate

You are storing x mb of file and in 1 year you expect n number of files

x* n mb


Compute

Now compute is the most important and complex area for estimation. The battle-tested method for calculating compute requirements is through load test.

A load test gives you an idea of how much load can be handled by a node (VM or Pod)

Following are the usual steps for Load testing

  • Identify APIs that will be used most frequently
  • Identify APIs that are most heavy in terms of resource usage
  • Come up with h perfect mix of load (historical data helps) and load test the system
  • Let the load run for longer durations (a few days) for getting better results (memory leak)
  • Check data performance at different percentiles 50, 90 95, 99, 99.9
  • Check for TPS handled with load
  • Check for error rate and dropped requests
  • Monitor infrastructure performance – CPU, Memory, Request queues, Garbage collection, etc.

Once you have all the data, based on your SLAs you can figure out TPS (Transaction per second) your node can handle. Once you have the TPS number, it is easy to calculate the overall requirement.

For example, if your load test confirms that one node can handle 100 TPS and you have overall requirements for 1000 TPS, you can easily set

TPS calculation: Number of transactions/time in seconds, say your load test reveals 10000 requests were processed in 1 minute, TPS = 10000/60 or 166.6

Additional Consideration: conditions one need to take into consideration when finalizing capacity requirements

Disaster recover

  • If one or more nodes are down in one cloud
  • If a complete cloud region is down

Noisy Neighbour

Especially in a SaaS-based system, there is a phenomenon of a noisy neighbor where one tenant can eat up threads causing other tenants to wait.

Bulkhead pattern and/or rate limiting are common solution options to handle noisy neighbors. But one needs to consider threads/infrastructure that can be realistically blocked by noisy neighbors.

Performance Tuning

An important aspect of capacity optimization is to make sure we are using our resources in the best possible manner. For example, most applications or web servers have configurations that one can set to their requirements in order to get the best possible performance. Some examples are

  • Number of concurrent threads
  • Memory setting e.g. Java Heap Memory (Set to max)
  • Compression (true vs false)
  • Request queue (throttling)

Clean Code: Java

Summarizing clean code practices for Java

Naming Conventions: Use meaningful names conveying the intent of the object.

Constants: Use Constants to manage static values (Constants help improve memory as they are cached by the JVM. For values that are reused across multiple places, create a constant file that holds static values.) use ENUMs to group constants.

Clean Code: Remove Console print statements, Remove Unnecessary comments

Deprecate Methods: Use @deprecated on method/variable names that aren’t meant for future use

Strings: If you need to perform a lot of operations on a String, use StringBuilder or StringBuffer.

Switch statement: Rather than using multiple if-else conditions, use the cleaner and more readable switch-case.

Exception Handling: https://kamalmeet.com/java/exception-handling-basic-principles/

Code Structure: Follow the Separation of Concerns strategy – controller, service, model, utility

Memory Leaks: Unclosed resources, e.g. unclosed URL connections can cause memory leaks. https://rollbar.com/blog/how-to-detect-memory-leaks-in-java-causes-types-tools/

Concurrent code: Avoid unnecessary synchronization, and at the same time identify areas to be synchronized where multiple threads can cause problems.

Lambdas and Streams: If you’re using Java 8+, replacing loops and extremely verbose methods with streams and lambdas makes the code look cleaner. Lambdas and streams allow you to write functional code in Java. The following snippet filters odd numbers in the traditional imperative way:

List<Integer> oddNumbers = new ArrayList<>();
for (Integer number : Arrays.asList(1, 2, 3, 4, 5, 6)) {
    if (number % 2 != 0) {
      oddNumbers.add(number);
  }
}

This is the functional way of filtering odd numbers:

List<Integer> oddNumbers = Stream.of(1, 2, 3, 4, 5, 6)
  .filter(number -> number % 2 != 0)
  .collect(Collectors.toList());

NullPointerException: When writing new methods, try to avoid returning nulls if possible. It could lead to null pointer exceptions.

Use final: When you want to make sure a method should not be overridden

Avoid static: Static can cause issues if not used properly as it shares variables at class level 

Data Structures: Java collections provide ArrayListLinkedListVectorStackHashSetHashMapHashtable. It’s important to understand the pros and cons of each to use them in the correct context. A few hints to help you make the right choice

Least visibility: Use of Public, private, and protected

Stay SOLID: https://kamalmeet.com/design/solid-principles-for-object-oriented-design/

DRY: Don’t Repeat Yourself, common code should be part of utilities and libraries

YAGNI: You Are not Going to Need It, code only what is needed

Static Code Review: SonarQube in Eclipse

Size of Class and Functions: Class and Function should be small – 400 / 40

Input checks: Inputs into methods should be checked for valid data size and range

Database Access: Use best practices like Connection Pool, JPA, Prepared statements, etc.

Cloud Native Application Design – Pillars of Cloud Architecture

Operational Excellence

Efficiently deploy, operate, monitor, and manage your cloud workload

  • automate deployments
  • monitoring, alerting, and logging
  • manage capacity and quota
  • plan for scale – peak traffic
  • automate whenever possible

Reliability

Design for the resilient and highly available system

  • automatically recover from failure
  • test recovery procedures
  • scale horizontally to manage the workload

Security

Secure your data and workload, align with regulatory requirement

  • data security
  • data encryption – at rest and in transit
  • apply security at all layers
  • enable traceability to investigate and take actions automatically
  • no direct interaction with data

Performance Efficiency

Design and tune your resources for the best performance

  • monitor and analyze the performance
  • compute performance optimization
  • optimize storage performance
  • anticipate load and scale
  • best practices like caching, CQRS, sharding, throttling, etc to be used

Cost optimization

Maximize the business value for infrastructure used

  • Monitor and control cost
  • optimize cost- compute, database, storage, etc
  • identify and free unused and underused resources

Tech Trends to Watch- 2023

2023 Gartner Emerging Technologies and Trends Impact Radar
https://www.gartner.com/en/articles/4-emerging-technologies-you-need-to-know-about

My personal favorites

AI and ML: Well, calling this trend to watch will not be correct as AI is already happening. Right from recommending you the next product to buy to self-driving cars.
IoT: World is more connected with home appliances and vehicles publishing data, that gets analyzed in real-time.
Edge Computing: Computation being done near the source of data and quick real-time decisions being taken.
Quantum computing: This will become more practical and accessible providing help in research and performance in different fields.
Digital Twins: Digital replication of physical systems helping analyze and predict the impact of different parameters.
CyberSecurity: With tech becoming part of the day today life, security is a major concern.
Blockchain: The decentralized ledger is going beyond cryptocurrencies to supply chain management and digital identity.
Robotics and Drones: By automating the way work is done and use for delivery, surveillance will increase.
Virtual and Augmented Reality(VR/AR): Providing new ways for people to interact with and experience the world.
Metaverse: Virtual world experience enabled by VR, AR, IoT, NFTs, and blockchain.
Web 3.0: Decentralization of Internet empowered by blockchain, decentralization and AI.

Kafka Basics

Apache Kafka is an open-source, distributed, publish-subscribe messaging system designed to handle large amounts of data.

Important terms

Topic: Messages or data are published to and read from topics.

Partition: Topics can be split into multiple partitions, allowing for parallel processing of data streams. Each partition is an ordered, immutable sequence of records. Partitions provide a way to horizontally scale data processing within a Kafka cluster.

Broker: Kafka cluster supporting pub-sub

Producer: Publish data to the topic.

Consumer: Subscribe to the topic.

Offset: Unique identifiers for messages. Each record in a partition is assigned a unique, sequential offset, and the order of the records within a partition is maintained. This means that data is guaranteed to be processed in the order it was written to the partition.

ZooKeeper: Apache ZooKeeper is a distributed coordination service for managing distributed systems. If a node fails, another node takes over its responsibilities, ensuring high availability. ZooKeeper uses a consensus algorithm to ensure that all nodes in the system have a consistent view of the data. It helps Kafka to manage coordination between brokers and to maintain configuration information.

BigData Ecosystem

Apache Hadoop: an open-source software framework for storing and processing large volumes of data in a distributed computing environment.

Major components- HDFS (Hadoop Distributed File System) – for storing data, and MapReduce – for processing data.

Apache Pig: Works on top of Hadoop to write data analysis programs in an easier way.

Apache Hive: SQL-like query language for Hadoop.

Apache Storm: It is a distributed, real-time processing system for big data. It has the ability to process data in real time, which makes it well-suited for use cases such as real-time analytics for stocks, fraud detection, and event-driven applications.

Apache Spark: It provides an in-memory data processing engine, which makes it faster and more flexible than Hadoop’s MapReduce for many use cases.

4 Vs of Big Data:

  • Volume: which normal RDBMS databases cannot store or are not meant to process.
  • Velocity: Speed at which data is getting added.
  • Variety: Structures, Semi-structured and unstructured. Traditional systems are meant only for Structured data. For example, reviews, comments, images, etc are nonstructured data
  • Veracity: Non-verified data. Data that may or may not be useful. Inconsistent data which cannot be used straight away.

Java Updates since JDK-8

Found this interesting article on updates that have happened in Java since version 8- https://ondro.inginea.eu/index.php/new-features-in-java-versions-since-java-8/

Java 8 is still the most popular version of Java, though Java 17 is a recent Long term support version. Java 8 was an instant hit with the release of features like functional interfaces, Lambda expressions, Streams, Optional classes, etc. The post mentioned above talks about updates since Java 8. Here are some important features introduced as per my understanding

ChatGPT by ChatGPT

ChatGPT: The Advancements in Natural Language Processing

Artificial intelligence (AI) has been revolutionizing various fields, and one of the areas where it has made the most impact is natural language processing (NLP). NLP is the field of computer science and AI that focuses on developing algorithms that can understand and process human language. With the development of powerful language models such as ChatGPT, NLP has taken a significant step forward in recent years.

What is ChatGPT?

ChatGPT is a language model developed by OpenAI, which is one of the largest AI research organizations in the world. ChatGPT is a transformer-based language model that uses deep learning to generate human-like text. It is trained on a massive amount of text data, which allows it to generate coherent and contextually appropriate responses to questions and prompts.

The name ChatGPT is a combination of “chat” and “GPT,” which stands for “Generative Pre-trained Transformer.” The “GPT” part of the name refers to the transformer architecture used in the model, which is a type of neural network that has been very successful in NLP tasks such as language generation and translation.

How Does ChatGPT Work?

ChatGPT is a pre-trained language model, which means that it is trained on a massive amount of text data before it is released to the public. During training, the model is presented with pairs of prompts and text, and it learns to generate a continuation of the text given the prompt. The model uses this training data to learn patterns and relationships in the data, which allows it to generate coherent and contextually appropriate responses.

Once the model is trained, it can be fine-tuned for specific tasks or used as is. For example, it can be fine-tuned for tasks such as question-answering, conversation generation, and summarization. The pre-training allows the model to learn a large amount of general information about the world, which makes it well-suited for a wide range of NLP tasks.

Applications of ChatGPT

ChatGPT has a wide range of applications, from customer service and chatbots to content generation and text summarization. One of the most popular applications of ChatGPT is in the field of customer service, where it can be used to provide fast and accurate answers to customer questions. ChatGPT can also be used in chatbots, where it can generate coherent and contextually appropriate responses to user queries.

Another application of ChatGPT is in the field of content generation, where it can be used to generate articles, summaries, and other types of text. For example, it can be used to generate summaries of long articles, which can save users time and effort.

Finally, ChatGPT can also be used in the field of machine translation, where it can be used to translate text from one language to another. This can be useful for organizations that need to translate large amounts of text quickly and accurately.

Conclusion

ChatGPT is a powerful language model developed by OpenAI, which has taken NLP to new heights. With its pre-training and fine-tuning capabilities, it is well-suited for a wide range of NLP tasks, from customer service and chatbots to content generation and machine translation. With its ability to generate coherent and contextually appropriate responses, it has the potential to change the way we interact with computers and the way we process information.

(The above article is generated by ChatGPT)

Choosing the right database

Choosing the right database is never easy. I have already discussed types of NoSQL databases and choosing between NoSQL and SQL.

I will try to cover some common use cases here

Use CaseChoice
Temporary fast access as Key-ValueRedis Cache
Data to be stored in a time-series fashionOpenTSDB
Object/ File dataBlob Data
Text SearchElastic Search
Structured Data, with relations between objects, need transactional properties, ACID complianceRDBMS
Semi-Structured Data, XML/ JSON document but the structure is not fixed, Flexible queriesDocument Based- MongoDB
Data increases with time, and a limited set of queriesColumnar database- Cassandra
Graph relation between objectsGraphDB- Neo4J
Database choice

Some useful resources from the Internet.

Image
https://storage.googleapis.com/gweb-cloudblog-publish/images/Which-Database_v07-10-21_1.max-2000x2000.jpeg
https://cloud.google.com/blog/topics/developers-practitioners/your-google-cloud-database-options-explained
https://aws.amazon.com/startups/start-building/how-to-choose-a-database/

Microservice Best Practices

Development 

  • Single responsibility- One Task per Microservice
  • Strangler Fig Pattern: https://martinfowler.com/bliki/StranglerFigApplication.html 
  • API Gateway: API Gateway should provide Routing, Aggregating, and SSL Offloading 
  • Offload Non-core responsibilities: Non-core responsibilities including – Security, logging, tracking, etc., should be offloaded to Sidecar or Libraries.   

Design for Failure 

  • Fail fast: Patterns like Circuit breaker Pattern, time out Pattern, rate limit Pattern, etc, help applications fail fast  
  • Isolate Failure: A failure should not propagate and impact other services. Bulkhead Pattern helps maintain such a configuration. 
  • Self-healing system: Health checkpoints and scalability settings will help make sure that system is able to manage a server or pod failure. 

Monitoring  

  • Health Monitoring: Applications should expose health endpoint (e.g. actuator in Java), which is used by Load balancers to keep a check.  
  • Golden Signals: Every application should monitor latency, traffic, and error rate https://sre.google/sre-book/monitoring-distributed-systems/
  • Distributed Tracing: Distributed tracing to check downstream and upstream dependencies.
  • Infrastructure Monitoring: Monitor CPU and Memory usage.

Performance 

  • Stateless: Keep APIs stateless
  • Asynchronous: Asynchronous communication wherever possible.
  • Caching: Cache data for better performance wherever possible.
  • Connection Pool: Database and HTTP connection pools should be enabled wherever possible.  

Good to have  

  • Separate Datastores: This is a double-edged sword, we need to be careful in separating data stores.
  • SAGA for Transaction Management: Commonly used pattern for transaction management in microservices. 
  • 12 Factor App: Generic best practices for developing a web application https://12factor.net/