Rate limiting is a potent tool to avoid cascading failures in a distributed system. It helps applications to avoid resource starvation and handle attacks like Denial of Service (DOS).
Before getting in rate limit implementation in a distributed system, let’s take a scneario for basic rate-limiting.
Lets say, the provider service has the capacity of processing 100 requests per second. In this case, a simple rate limit can be implemented on the server-side where it will accept 100 requests in a second and drop any additional requests. The rate-limiting can be in form of requests accepted per second, per minute, or per day. Additionally, rate limit can be applied per API bases, say “/product” service can accept 100 requests per minute, but “/user” service can accept 200 requests. Also, the rate limit can be implemented based on the client or user. For example, client_id is being sent as part of the header and we have a rate limit of 10 requests that are allowed per minute for a unique client.
The above scenario talks about a very simple client-server implementation. But in the real world, the setup will be more complex with multiple clients and scaled-up services deployed behind a load balancer, usually being accessed through a proxy or an API gateway.
Before getting into the complexities of distributed systems and implementing Rate limiting, lets take a look at some of the basic alogorthms used to implement rate limiting.
Token Bucket: To start with you can think of a bucket full of tokens. Every time a request comes, it will take a token from a bucket. New tokens get added to the bucket after the given interval. When the bucket has no more tokens, requests will be throttled. Implementation wise, it is a simple approach where a counter is set to the max limit allowed, and each incoming request checks for the counter value and decreases it by one unless the counter reaches zero (at this point requests are throttled). The counter is updated/ reset based on the rate-limiting condition.
Leaky Bucket: A modification of token bucket algorithm, where requests are processed at a fixed rate. Think of implementation as a fixed-length queue, to which requests are added, and processed at a constant rate.
In a distributed system, it makes sense to implement a rate-limiting algorithm at the proxy or gateway layer, as it helps us fail fast and avoid unnecessary traffic on the backend. Secondly implementing it in a centralized place takes away the complexity from backend services as they need not implement rate-limiting now and can focus on their core business logic.
Implementing rate-limiting in the above-mentioned distributed design has its own challenges. For example, as there are N machines with proxy deployed, how to implement 100 requests per second for a service. Do we say each machine has a 100/N quota? But load balancers do not guarantee that kind of distribution of incoming requests.