Building a Scalable and Reliable Notification System: A Complete Guide

Notifications are of peak importance when it comes to social media, you don’t want to miss who liked your photo or who is mentioning you in comments, You also don’t want to miss notifications from your favourite YouTubers so you don’t miss watching their content.

With the large scale at which social media and other apps operate today, it is a great challenge to provide close to real-time notifications to users, But today we will discuss in detail how to develop a robust, scalable and reliable Notification system.

Core Idea Implementation

The above diagram represents a high-level system design architecture for the Notification Service, The System comprises of different components which are listed below.

1. Event Bus: This basically gets all the user events without any filter, this can be considred as a data stream like AWS Kinisis or Kafka.
2. Consumer Listeners: This component listens to all the events which get published to the Event bus, here we can implement a logic which will filter out the events which do not require a notification. Furthermore this consumer listener will pass the event to a SQS queue, the reason behind using a queue here is to achieve non-blocking behaviour, we don’t want the consumer to process the event entirely, instead delegate the task among the services so that the system can operate efficiently with high volume of load.
3. Fan out workers: These workers are responsible for consuming the event payload and fetching required details from Database and further sending it to a SQS queue from which the notification is sent across all the required subscribers, For example, Let's Assume we need to send a notification to all the followers of Elon Musk when he posts a tweet, In such case the fan out service which fetch the required details like the list of followers Elon Musk has and all the metadata required. Now with the information, we can span out notifications to all the respective subscribers.

Making the system Reliable

There are a few aspects and best practices which we can implement so that our system is more reliable and fault-tolerant.

Addition of Dead Letter Queues (DLQ): We are using SQS queues in the System, there might be a case where the payload in the queue is not processed and the request might fail, now after multiple retries we don’t want to block the queue. DLQ ensures problematic messages don't block the queue.
Notification Prioritization: Add multiple priority queues, this priority must be set based on the type of notification. For example, If it’s a direct message it can be a high-priority notification, compared to a notification for a like. The different priority queues can also be attached to workers who have a fixed SLA (Service Level Agreement) so the notification is sent out on time.
Auto Scaling: Enable Auto scaling to the fan out workers, so when there is a surge of requests it can be handled effectively.
Cache Layer: For more frequent requests in the Fan out service data can be stored in a cache layer (Redis) so we can reduce the Database calls.

Conclustion

In conclusion, designing a notification service is essential for ensuring reliability and consistency in large-scale systems. By following this approach, we can build a robust notification service capable of operating efficiently at scale.

Designing Scalable Notification Service

Table of contents

Core Idea Implementation

Making the system Reliable

Conclustion