Notification System — Interview Preparation Guide

Master your notification system project for SDE-2 interviews. This guide covers architecture, design decisions, and 31 practice Q&A.

How to Use This Guide

Time Available	What to Read	Sections
10 minutes	Quick scan before walking in	Part 1: Quick Reference Card + Elevator Pitch
1 hour	Core understanding	Part 1 + Part 2 (Architecture & Components)
3 hours	Full preparation	Parts 1-4 (everything except AWS)
Full day	Complete mastery	All 5 parts including AWS deployment

Part 1 — Quick Review (read 10 min before your interview)

Quick Reference Card
60-Second Elevator Pitch
How to Whiteboard This System
Keywords to Use in Interviews

Part 2 — Understand the System (build your mental model)

Architecture & Request Flow
Core Components — Rate Limiter, Caching, Kafka, Retry, Channel Handlers
Database Design — Schema, Indexes, Design Rationale
Design Decisions & Trade-offs

Part 3 — Performance & Operations (show you think about production)

Performance Tuning (10k req/sec)
Monitoring & Stress Testing
End-to-End Data Lifecycle Reference

Part 4 — Interview Q&A (31 questions with model answers)

System Design (Q1-Q5)
Architecture & Patterns (Q6-Q8)
Database (Q9-Q11)
Kafka & Messaging (Q12-Q14)
Redis & Caching (Q15-Q20)
API Design (Q21-Q22)
Error Handling (Q23-Q24)
Scaling & Performance (Q25-Q26)
Code Quality & Testing (Q27-Q28)
Behavioral (Q29-Q31)

Part 5 — AWS Production Deployment (advanced — for principal/staff-level depth)

Global Architecture & Service Mapping
Multi-Region, Compliance, Cost, DR

Part 1 — Quick Review

Read this section 10 minutes before your interview. It gives you the cheat sheet, your opening pitch, and a whiteboard plan.

Quick Reference Card

Topic	Your Answer
Why Kafka?	Decoupling, reliability, handles spikes
Why Redis?	Rate limiting + caching, fast, atomic, TTL
What do you cache?	User lookups (by ID, email, phone), device tokens, templates
Cache strategy?	TTL 1hr, eviction on changes, Jackson serialization
Why PostgreSQL?	ACID, reliable, good enough for scale
Rate limiting algo?	Token Bucket
Retry strategy?	Exponential backoff (5^n), max 3 retries
Design pattern?	Strategy (handlers), Repository (data)
Delivery guarantee?	At-least-once
Handle duplicates?	Idempotent processing (check status)

60-Second Elevator Pitch

Use this when asked: "Tell me about a project you've worked on"

I built a multi-channel notification system following system design principles. It sends notifications via Email, SMS, Push, and In-App channels.

The key technical highlights are:

Asynchronous processing using Kafka for decoupling and reliability
Rate limiting with Redis using the Token Bucket algorithm
Redis caching for user lookups and notification templates to reduce database load
Template system for reusable message content
Retry mechanism with exponential backoff for failed deliveries
Clean layered architecture following SOLID principles

The system handles the full lifecycle: API receives request → validates → saves to PostgreSQL → publishes to Kafka → worker consumes and delivers via channel handlers → updates status with retry on failure."

How to Whiteboard This System

When asked to design on a whiteboard:

Step 1: Clarify Requirements (2 min)

"What channels? Email, SMS, Push, In-App?"
"Expected volume? 1000/sec?"
"Delivery guarantee needed? At-least-once?"

Step 2: High-Level Design (5 min)

Client → API → DB → Queue → Worker → Provider
                ↓
              Redis (rate limit)

Step 3: Deep Dive (8 min)

Pick ONE area based on interviewer interest:

Rate limiting algorithm
Retry mechanism
Database schema
Kafka configuration

Step 4: Trade-offs (2 min)

"I chose X over Y because..."
"If we needed Z, I would add..."

Keywords to Use in Interviews

Architecture:

Microservices, Layered Architecture, Clean Architecture
Dependency Injection, Inversion of Control
Event-Driven, Async Processing

Patterns:

Strategy Pattern, Repository Pattern, Builder Pattern
Token Bucket Algorithm, Cache-Aside Pattern
Exponential Backoff

Caching:

Spring Cache Abstraction, @Cacheable, @CacheEvict
TTL (Time To Live), Cache Invalidation
Serialization, Jackson ObjectMapper

Reliability:

At-least-once delivery, Idempotency
Circuit Breaker, Retry with Backoff
Dead Letter Queue

Scalability:

Horizontal Scaling, Stateless Services
Partitioning, Sharding
Consumer Groups

Data:

ACID, Eventually Consistent
Indexing, Composite Index
Connection Pooling

Part 2 — Understand the System

Build your mental model of how the system works. Read this section when you have 1+ hours to prepare.

Project Architecture Overview

High-Level Flow

┌─────────────┐     ┌─────────────────────────────────────────────┐     ┌─────────────┐
│   Client    │────▶│         NOTIFICATION SERVICE                │────▶│  PostgreSQL │
│  (REST API) │     │                                             │     │  (Storage)  │
└─────────────┘     │  ┌───────────┐   ┌───────────┐   ┌───────┐  │     └─────────────┘
                    │  │Controller │──▶│  Service  │──▶│  Repo │  │
                    │  └───────────┘   └─────┬─────┘   └───────┘  │     ┌─────────────┐
                    │                        │                    │────▶│    Redis    │
                    │                        ▼                    │     │(Cache + Rate│
                    │                  ┌───────────┐              │     │   Limit)    │
                    │                  │   Kafka   │              │     └─────────────┘
                    │                  │   Kafka   │              │
                    │                  │ (Publish) │              │     ┌─────────────┐
                    │                  └─────┬─────┘              │────▶│    Kafka    │
                    │                        │                    │     │   (Queue)   │
                    └────────────────────────┼────────────────────┘     └──────┬──────┘
                                             │                                 │
                    ┌────────────────────────┼─────────────────────────────────┘
                    │                        ▼
                    │  ┌─────────────────────────────────────────────────────┐
                    │  │              KAFKA CONSUMER (Worker)                │
                    │  │                                                     │
                    │  │   ┌──────────────────┐    ┌───────────────────────┐ │
                    └──│──▶│ NotificationConsumer │──▶│  ChannelDispatcher │ │
                       │   └──────────────────┘    └───────────┬───────────┘ │
                       │                                       │             │
                       │              ┌────────────────────────┼─────┐       │
                       │              ▼            ▼           ▼     ▼       │
                       │         ┌───────┐   ┌───────┐   ┌───────┐ ┌──────┐  │
                       │         │ Email │   │  SMS  │   │ Push  │ │In-App│  │
                       │         │Handler│   │Handler│   │Handler│ │Handler│ |
                       │         └───────┘   └───────┘   └───────┘ └──────┘  │
                       └─────────────────────────────────────────────────────┘

Request Lifecycle (Step by Step)

Step	Component	Action	Data Flow Details
1	`NotificationController`	Receives POST request, validates input	Input: JSON request body `json<br>{<br> "userId": "550e8400-e29b-41d4-a716-446655440001",<br> "channel": "EMAIL",<br> "templateName": "welcome-email",<br> "templateVariables": {"userName": "John"}<br>}<br>` Validation: Bean Validation on `SendNotificationRequest` DTO
2	`NotificationService`	Validates user (cached), checks dedup + rate limit via Redis	User Lookup: `userService.findById(userId)` (cached in Redis with key `"users::id:{id}"`) Dedup Check: `deduplicationService.isDuplicate(eventId)` (Redis key `"event:{eventId}"`) Rate Limit Check: `rateLimiterService.checkAndIncrement(userId, channel)` Redis Key: `"ratelimit:{userId}:{channel}"` Throws: `RateLimitExceededException` if limit exceeded
3	`TemplateService`	Processes template (cached lookup)	Input: `templateName`, `templateVariables` Template Lookup: `templateService.getTemplateByName(name)` (cached with key `"templates::name:{name}"`) Processing: Variable substitution in template content Output: `subject`, `content` strings Example: Template `"Welcome {{userName}}!"` → `"Welcome John!"`
4	`NotificationRepository`	Saves notification with PENDING status	Entity Creation: `java<br>Notification notification = Notification.builder()<br> .user(user)<br> .channel(request.getChannel())<br> .priority(request.getPriority())<br> .subject(subject)<br> .content(content)<br> .status(NotificationStatus.PENDING)<br> .build();<br>` Database: ACID transaction ensures durability
5	`KafkaTemplate`	Publishes notification ID to Kafka topic	Message Key: `notification.getId().toString()` Message Value: `notification.getId().toString()` Topic: Channel-specific (e.g., `notifications.email`) Purpose: Only ID sent to avoid large messages
6	API Response	Returns 201 Created with notification ID	Response: `json<br>{<br> "success": true,<br> "message": "Notification queued successfully",<br> "data": {<br> "id": "550e8400-e29b-41d4-a716-446655440002",<br> "status": "PENDING"<br> }<br>}<br>` HTTP Status: 201 (Created) - notification record created, delivery happens async
7	`NotificationConsumer`	Picks up message from Kafka	Consumer Record: `ConsumerRecord<String, String>` Value: Notification ID string Processing: Parse UUID, fetch from database Status Update: `PENDING` → `PROCESSING`
8	`ChannelDispatcher`	Routes to correct handler (Email/SMS/Push/In-App)	Routing Logic: `java<br>ChannelHandler handler = handlers.get(notification.getChannel());<br>return handler.send(notification);<br>` Strategy Pattern: O(1) lookup via HashMap
9	`EmailChannelHandler` (etc.)	Sends via external provider (SendGrid/Twilio)	Handler Selection: Based on `notification.getChannel()` External API Call: SendGrid/Twilio/Firebase/etc. Data Passed: `user.email`, `notification.subject`, `notification.content` Return: `true` (success) or `false` (failure)
10	`NotificationRepository`	Updates status to SENT or schedules retry	Success: `status = SENT` Failure: `status = PENDING`, `retry_count++`, `next_retry_at` set with exponential backoff

Detailed Data Flow Example

Client Request:

POST /api/v1/notifications
{
  "userId": "550e8400-e29b-41d4-a716-446655440001",
  "channel": "EMAIL",
  "templateName": "welcome-email",
  "templateVariables": {
    "userName": "John",
    "activationLink": "https://example.com/activate/abc123"
  }
}

1. Controller → Service:

SendNotificationRequest object passed to notificationService.sendNotification(request)
Contains: userId, channel, templateName, templateVariables

2. Service Processing:

User lookup: userService.findById(request.getUserId()) (Redis-cached, avoids DB hit per request)
Deduplication check: deduplicationService.isDuplicate(eventId) (Redis key event:{eventId})
Rate limit check: Redis counter increment
Template processing: Variables substituted in template content
Entity creation: Notification object built with processed content

3. Database Persistence:

INSERT INTO notifications (id, user_id, channel, subject, content, status, created_at)
VALUES ('uuid', 'user-uuid', 'EMAIL', 'Welcome to Our Platform', 'Hi John, ...', 'PENDING', NOW());

4. Kafka Publishing:

Topic: notifications.email
Key: "550e8400-e29b-41d4-a716-446655440002"
Value: "550e8400-e29b-41d4-a716-446655440002"
Purpose: Decouple API response from slow email sending

5. Consumer Processing:

Kafka message received: "550e8400-e29b-41d4-a716-446655440002"
Database fetch: notificationRepository.findById(uuid)
Status update: PENDING → PROCESSING

6. Channel Dispatch:

Handler lookup: handlers.get(ChannelType.EMAIL) → EmailChannelHandler
Method call: emailHandler.send(notification)

7. Email Handler Execution:

User data: notification.getUser().getEmail()
Content data: notification.getSubject(), notification.getContent()
External API: SendGrid/Mailgun/SES integration
Result: Success/failure boolean

8. Final Status Update:

Success: notification.markAsSent() → status = SENT
Failure: notification.scheduleRetry("Delivery failed") → status = PENDING, retry scheduled

Component Deep Dives

1. Rate Limiter (Token Bucket Algorithm)

Location: RateLimiterService.java

How it works:
┌────────────────────────────────────────────┐
│  User requests to send EMAIL notification  │
└─────────────────┬──────────────────────────┘
                  ▼
┌────────────────────────────────────────────┐
│  Build Redis key: "ratelimit:user123:EMAIL" │
└─────────────────┬──────────────────────────┘
                  ▼
┌────────────────────────────────────────────┐
│  Get current count from Redis              │
│  Example: 7 (user sent 7 emails this hour) │
└─────────────────┬──────────────────────────┘
                  ▼
┌────────────────────────────────────────────┐
│  Compare: 7 < 10 (limit)?                  │
│  YES → Increment counter, allow request    │
│  NO  → Throw RateLimitExceededException    │
└────────────────────────────────────────────┘

Key Code:

// Redis key format: "ratelimit:{userId}:{channel}"
String key = String.format("ratelimit:%s:%s", userId, channel);

// Atomic increment with TTL
Long newCount = redisTemplate.opsForValue().increment(key);
if (newCount == 1) {
    redisTemplate.expire(key, Duration.ofSeconds(3600)); // 1 hour window
}

Why Token Bucket?

Simple to implement and explain
Allows bursts (up to bucket capacity)
Self-resets via Redis TTL (no cleanup job needed)

2. Redis Caching (Spring Cache Abstraction)

How it works:

// Service layer caching with Spring @Cacheable
@Service
public class UserService {
    
    @Cacheable(value = "users", key = "'email:' + #email")
    public User findByEmail(String email) {
        // First call hits database, result cached
        // Subsequent calls serve from Redis
        return userRepository.findByEmail(email).orElseThrow();
    }
    
    @CacheEvict(value = "users", key = "'email:' + #oldEmail")
    public void evictUserCacheByEmail(String oldEmail) {
        // Removes stale cache entry when email changes
    }
}

Cache Configuration:

@Bean
public CacheManager cacheManager(RedisConnectionFactory connectionFactory) {
    ObjectMapper objectMapper = new ObjectMapper();
    objectMapper.registerModule(new JavaTimeModule());
    objectMapper.enableDefaultTyping(ObjectMapper.DefaultTyping.NON_FINAL, 
                                   JsonTypeInfo.As.PROPERTY);
    
    GenericJackson2JsonRedisSerializer serializer = 
        new GenericJackson2JsonRedisSerializer(objectMapper);
    
    RedisCacheConfiguration config = RedisCacheConfiguration.defaultCacheConfig()
        .serializeKeysWith(new StringRedisSerializer())
        .serializeValuesWith(serializer)
        .entryTtl(Duration.ofHours(1))  // 1 hour TTL
        .disableCachingNullValues();
        
    return RedisCacheManager.builder(connectionFactory)
        .cacheDefaults(config)
        .build();
}

Serialization Challenges Solved:

ClassCastException: Fixed with Jackson default typing
Lazy Loading: Prevented with @JsonIgnore on JPA relationships
Complex Objects: Proper type information in JSON

Cache Keys:

users::id:{id} - User by ID lookup (used in notification hot path)
users::email:{email} - User by email lookup
users::phone:{phone} - User by phone lookup
users::deviceTokens - List of users with push tokens
templates::name:{name} - Template by name lookup

Why This Approach:

Performance: Reduces database load by 60-80% for cached data
Consistency: Cache eviction ensures data accuracy
Scalability: Redis cluster-ready for horizontal scaling
Simplicity: Spring Cache abstraction hides complexity

3. Message Queue (Kafka)

Why Kafka?

Without Kafka	With Kafka
API waits for email to send (slow)	API returns immediately (fast)
If email fails, request fails	Worker retries independently
Can't handle traffic spikes	Queue absorbs spikes
Single point of failure	Decoupled, fault-tolerant

Key Design Decisions (Alex Xu's Pattern):

Channel-Specific Topics: notifications.email, notifications.sms, notifications.push, notifications.in-app
- Each channel scales independently
- Email issues don't affect push delivery
- Different partition counts per channel based on volume
Key = Notification ID: Ensures ordering per notification
Manual Acknowledgment: Only commit offset after successful processing
Separate Consumer Groups: Each channel has its own consumer group for independent offset tracking

Configuration (KafkaConfig.java):

// Don't auto-commit - we'll manually acknowledge
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);

// Manual acknowledgment mode
factory.getContainerProperties().setAckMode(ContainerProperties.AckMode.MANUAL);

3. Retry Mechanism

Retry Flow:

┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│   PENDING   │────▶│  PROCESSING  │────▶│    SENT     │ (Success!)
└─────────────┘     └──────┬───────┘     └─────────────┘
                           │
                           │ (Failure)
                           ▼
                    ┌──────────────┐
                    │ Schedule     │
                    │ Retry with   │
                    │ Backoff      │
                    └──────┬───────┘
                           │
              ┌────────────┴────────────┐
              ▼                         ▼
        retry_count < max?        retry_count >= max?
              │                         │
              ▼                         ▼
        Back to PENDING           Status = FAILED
        (with next_retry_at)

Exponential Backoff:

// Calculate next retry time with exponential backoff
// Retry 1: 5 min, Retry 2: 25 min, Retry 3: exceeds max → FAILED
long delayMinutes = (long) Math.pow(5, retryCount);
this.nextRetryAt = OffsetDateTime.now().plusMinutes(delayMinutes);

Two Retry Mechanisms:

Kafka Consumer: Immediate retry on failure
RetryScheduler: Picks up PENDING notifications due for retry (runs every 60s)

4. Channel Handler Pattern (Strategy Pattern)

Interface:

public interface ChannelHandler {
    ChannelType getChannelType();
    boolean send(Notification notification);
    default boolean canHandle(Notification notification) { ... }
}

Implementations:

EmailChannelHandler → SendGrid/SES
SmsChannelHandler → Twilio
PushChannelHandler → Firebase FCM
InAppChannelHandler → Database storage

Dispatcher (O(1) Routing):

// Build handler map at startup
this.handlers = handlerList.stream()
    .collect(Collectors.toMap(
        ChannelHandler::getChannelType,
        Function.identity()
    ));

// Dispatch: O(1) lookup
ChannelHandler handler = handlers.get(notification.getChannel());
return handler.send(notification);

Why Strategy Pattern?

Open/Closed Principle: Add new channels without modifying existing code
Single Responsibility: Each handler knows only its channel
Easy to test: Mock individual handlers

Database Schema Design

Entity Relationship Diagram

┌─────────────────┐       ┌─────────────────────┐
│     users       │       │  user_preferences   │
├─────────────────┤       ├─────────────────────┤
│ id (UUID)       │◄──────┤ id (UUID)           │
│ email (VARCHAR) │       │ user_id (FK)        │
│ phone (VARCHAR) │       │ channel (VARCHAR)   │
│ device_token    │       │ enabled (BOOLEAN)   │
│ created_at      │       │ quiet_hours_start   │
│ updated_at      │       │ quiet_hours_end     │
└─────────────────┘       └─────────────────────┘
         │                           │
         │                           │
         ▼                           │
┌─────────────────┐       ┌─────────────────────┐
│ notifications   │       │ notification_       │
├─────────────────┤       │    templates        │
│ id (UUID)       │       ├─────────────────────┤
│ user_id (FK)    │──────►│ id (UUID)           │
│ template_id (FK)│◄──────┤ name (VARCHAR)      │
│ channel         │       │ channel (VARCHAR)   │
│ priority        │       │ subject_template    │
│ subject         │       │ body_template (TEXT)│
│ content (TEXT)  │       │ is_active (BOOLEAN) │
│ status          │       └─────────────────────┘
│ retry_count     │
│ max_retries     │
│ next_retry_at   │
│ error_message   │
│ sent_at         │
│ delivered_at    │
│ read_at         │
│ created_at      │
└─────────────────┘

Table Schemas

Users Table

CREATE TABLE users (
    id UUID PRIMARY KEY,
    email VARCHAR(255) UNIQUE,
    phone VARCHAR(20),
    device_token VARCHAR(500),
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

User Preferences Table

CREATE TABLE user_preferences (
    id UUID PRIMARY KEY,
    user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
    channel VARCHAR(20) NOT NULL,
    enabled BOOLEAN DEFAULT true,
    quiet_hours_start TIME,
    quiet_hours_end TIME,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(user_id, channel)
);

Notification Templates Table

CREATE TABLE notification_templates (
    id UUID PRIMARY KEY,
    name VARCHAR(100) NOT NULL UNIQUE,
    channel VARCHAR(20) NOT NULL,
    subject_template VARCHAR(500),
    body_template TEXT NOT NULL,
    is_active BOOLEAN DEFAULT true,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

Notifications Table (Main Table)

CREATE TABLE notifications (
    id UUID PRIMARY KEY,
    user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
    template_id UUID REFERENCES notification_templates(id),
    channel VARCHAR(20) NOT NULL,
    priority VARCHAR(10) NOT NULL DEFAULT 'MEDIUM',
    subject VARCHAR(500),
    content TEXT NOT NULL,
    status VARCHAR(20) NOT NULL DEFAULT 'PENDING',
    retry_count INTEGER DEFAULT 0,
    max_retries INTEGER DEFAULT 3,
    next_retry_at TIMESTAMP WITH TIME ZONE,
    error_message TEXT,
    sent_at TIMESTAMP WITH TIME ZONE,
    delivered_at TIMESTAMP WITH TIME ZONE,
    read_at TIMESTAMP WITH TIME ZONE,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

Design Decisions & Intuition

Current implementation note: IDs are generated in application code using UUIDv7 in @PrePersist (time-ordered), not by database defaults.

Why UUID Primary Keys?

Globally Unique: No collisions across distributed systems
Time-Ordered (UUIDv7): Better index locality and reduced page splits vs random UUIDv4
No Sequence Gaps: Don't reveal how many records exist (security)
Client Generation: Can generate IDs without database round-trip
Microservices Ready: Perfect for distributed architectures

Why Separate User Preferences?

Flexibility: Each user can have different preferences per channel
Scalability: Preferences change less frequently than notifications
Compliance: Easy to implement "Do Not Disturb" features
Analytics: Track preference patterns across user base

Why Template System?

Consistency: Standardized messaging across the platform
Maintainability: Change message content in one place
Localization: Easy to support multiple languages
Personalization: Template variables for dynamic content
Testing: Templates can be tested independently

Why Comprehensive Notification Tracking?

Audit Trail: Complete history of all notifications sent
Analytics: Track delivery rates, user engagement, failures
Debugging: Detailed error messages and retry information
Compliance: Prove notifications were sent (legal requirements)
Business Intelligence: Understand user behavior patterns

Status Flow Design

PENDING → PROCESSING → SENT → DELIVERED
    ↓           ↓
  FAILED     READ (in-app only)

PENDING: Queued for processing
PROCESSING: Currently being sent (prevents duplicate processing)
SENT: Successfully delivered to provider (SMS gateway, email service)
DELIVERED: Confirmed received by user (webhook/callback)
FAILED: All retries exhausted
READ: User opened/acknowledged (in-app notifications only)

Retry Strategy

Exponential Backoff: next_retry_at = now + 5^retryCount minutes (5min, 25min)
Maximum Retries: Default 3 attempts before marking as FAILED
Configurable: Different retry counts for different priority levels
Smart Scheduling: Use database indexes for efficient retry queries

Indexing Strategy

-- User inbox (most common query)
CREATE INDEX idx_notifications_user_created ON notifications(user_id, created_at DESC);

-- Processing queue (find pending notifications)
CREATE INDEX idx_notifications_status ON notifications(status);

-- Retry scheduling (find notifications ready for retry)
CREATE INDEX idx_notifications_retry ON notifications(next_retry_at) 
    WHERE status = 'PENDING' AND next_retry_at IS NOT NULL;

-- Analytics (filter by channel/time)
CREATE INDEX idx_notifications_channel ON notifications(channel);

Why These Indexes?

User Inbox: user_id + created_at DESC - Fast pagination for user's notification history
Queue Processing: status - Quickly find PENDING notifications to process
Retry Logic: Partial index on next_retry_at - Only index rows that need retry
Analytics: channel - Group notifications by delivery method

Data Types Rationale

UUID: Globally unique identifiers (vs auto-increment integers)
TIMESTAMP WITH TIME ZONE: Proper timezone handling across regions
TEXT: Unlimited message content (vs VARCHAR limits)
VARCHAR(500): Reasonable limits for subjects/tokens
BOOLEAN: Simple true/false flags (vs CHAR(1) 'Y'/'N')

Constraints & Validation

UNIQUE(email): Prevent duplicate user accounts
UNIQUE(user_id, channel): One preference per user per channel
UNIQUE(name): Template names must be unique
NOT NULL: Critical fields that must always have values
FOREIGN KEY: Maintain referential integrity
ON DELETE CASCADE: Clean up related data when users are deleted

Database Performance Considerations

Read-Heavy Workload

Notification System: 90% reads (checking preferences, templates) vs 10% writes
User Inbox: Frequent queries for user's notification history
Analytics: Aggregate queries on notification data

Write Patterns

Notifications: High-volume inserts (millions per day)
Status Updates: Frequent updates during processing
Retry Logic: Updates to retry_count and next_retry_at

Optimization Strategies

Connection Pooling: Reuse database connections (HikariCP)
Read Replicas: Separate read traffic from writes
Partitioning: Partition notifications table by month/year
Archiving: Move old notifications to cold storage
UUIDv7 IDs: Time-ordered inserts improve B-tree write behavior compared to UUIDv4

Interview Questions: Database Design

Question	Answer
Why UUIDv7?	Global uniqueness + time ordering for better write locality and index performance
Why separate preferences?	User control, compliance, different settings per channel
Why templates?	Consistency, maintainability, localization support
Why track everything?	Audit trail, analytics, debugging, compliance
Status flow?	PENDING→PROCESSING→SENT→DELIVERED, with FAILED/READ
Retry strategy?	Exponential backoff, max 3 retries, configurable
Indexing strategy?	User inbox, queue processing, retry scheduling, analytics
Performance?	Read-heavy, connection pooling, read replicas, partitioning
Data integrity?	Foreign keys, constraints, cascading deletes
Scalability?	Partitioning, archiving, read replicas

Design Decisions & Trade-offs

Decision	Why	Trade-off
Kafka over RabbitMQ	Better durability, replay capability, high throughput	More complex to operate
Redis for rate limiting	Fast (in-memory), TTL support, atomic operations	Additional infrastructure
PostgreSQL	ACID compliance, JSON support, reliability	Harder to scale horizontally
UUID primary keys	Globally unique, can generate client-side	Larger than integers, slower indexes
Save-then-publish	Notification survives Kafka failure	Possible duplicate sends
Channel-specific Kafka topics	Independent scaling, channel isolation, follows Alex Xu's design	More topics to manage
Polling retry scheduler	Simple, reliable	Less immediate than push-based

Part 3 — Performance & Operations

Show interviewers you think about production. This covers tuning, monitoring, and detailed data flow.

Performance Tuning & Throughput (10k req/sec)

Current Configuration (Tuned for 10k req/sec)

Layer	Setting	Value	Why
Tomcat	max-threads	400	Handle concurrent requests (up from default 200)
Tomcat	max-connections	10,000	Accept 10k simultaneous connections
Tomcat	connection-timeout	5,000ms	Fail fast instead of holding resources
HikariCP	maximum-pool-size	50	At 5ms/query, 50 conns = ~10,000 queries/sec
HikariCP	connection-timeout	5,000ms	Fail fast if pool exhausted
Hibernate	batch_size	50	Batch inserts reduce DB round-trips
Hibernate	provider_disables_autocommit	true	Avoids extra DB call per transaction
Kafka Producer	acks	1	Leader-only ack (5x faster than acks=all)
Kafka Producer	linger.ms	5	Wait 5ms to fill batches before sending
Kafka Producer	batch-size	64KB	Larger batches = fewer network calls
Kafka Producer	compression	lz4	Compress messages to reduce network I/O
Kafka Consumer	concurrency	10	10 parallel consumer threads
Kafka Consumer	max-poll-records	500	Process 500 records per poll cycle
Kafka Topics	total partitions	26	email:8, push:8, in-app:6, sms:4

Back-of-Envelope Capacity

Component	Per-Request Cost	Capacity
Tomcat accept	~0ms	10,000 conns
Redis (dedup + rate limit)	~1ms (2 calls)	~50,000/s
User lookup (Redis cache hit)	~0.5ms	~50,000/s
DB save (notification INSERT)	~5ms	50 conns / 5ms = 10,000/s
Kafka send (batched, acks=1)	~0.1ms amortized	50,000+/s

Estimated single-instance throughput: ~8,000-12,000 req/sec

Interview Answer: "How did you tune for 10k req/sec?"

"I identified bottlenecks layer by layer:

User lookup: Each request was hitting PostgreSQL. I added @Cacheable to UserService.findById() so user lookups go to Redis (~0.5ms vs ~5ms DB).
DB connection pool: Increased HikariCP from 10 to 50. At 5ms per query, 50 connections supports ~10k queries/sec.
Kafka producer: Changed from acks=all to acks=1 (leader-only). Added batching with linger.ms=5 and batch-size=64KB with lz4 compression. This makes sends effectively fire-and-forget at ~0.1ms amortized.
Tomcat threads: Increased to 400 threads with 10k max connections and 5s timeout.
Logging: Moved hot-path logs from INFO to DEBUG to avoid I/O overhead at 10k/s.

The bottleneck shifts to PostgreSQL INSERTs at ~10k/sec. Beyond that, horizontal scaling (multiple instances behind a load balancer) is needed."

Monitoring & Stress Testing

Monitoring Stack (Implemented)

The project includes a complete Prometheus + Grafana monitoring setup:

Application (Micrometer) → Prometheus (scrape) → Grafana (dashboards)

Prometheus Metrics Exposed:

HTTP request rate, latency percentiles (p95, p99)
JVM heap usage, GC frequency
HikariCP connection pool utilization
Kafka consumer lag per topic
Custom notification counters per channel

Grafana Dashboard Panels:

HTTP throughput by endpoint
Response time distribution
JVM heap and GC pressure
Database connection pool health
Kafka consumer lag

Configuration: monitoring/prometheus/prometheus.yml scrapes the /actuator/prometheus endpoint.

Stress Testing (k6)

The project includes three k6 load test profiles in stress-test/k6/:

Script	Purpose	Peak VUs
`stress-test.js`	Steady-state throughput and latency baseline	10,000
`spike-test.js`	Burst handling and recovery behavior	1,000
`heap-test.js`	JVM heap growth and GC under pressure	—

Thresholds defined:

http_req_failed: < 5% error rate
http_req_duration: p95 < 750ms, p99 < 1200ms

How to run:

k6 run stress-test/k6/stress-test.js

# POST endpoint:
BASE_URL=http://localhost:8080 TARGET_PATH=/api/v1/notifications METHOD=POST k6 run stress-test/k6/stress-test.js

End-to-End Data Lifecycle Deep Dive

Complete Data Transformation Flow

1. Request DTO → Entity Transformation:

// Input: SendNotificationRequest (JSON)
{
  "userId": "550e8400-e29b-41d4-a716-446655440001",
  "channel": "EMAIL",
  "templateName": "welcome-email",
  "templateVariables": {"userName": "John"},
  "priority": "HIGH"
}

// Transformation in NotificationService.sendNotification():
User user = userService.findById(request.getUserId()); // Redis-cached lookup

String subject = templateService.processTemplate(request.getTemplateName(),
    request.getTemplateVariables()).getSubject(); // Template processing

// Output: Notification Entity
Notification notification = Notification.builder()
    .id(UUID.randomUUID()) // Generated
    .user(user) // Foreign key relationship
    .channel(ChannelType.EMAIL) // Enum conversion
    .priority(Priority.HIGH) // Enum conversion
    .subject(subject) // Processed from template
    .content(content) // Processed from template
    .status(NotificationStatus.PENDING) // Initial status
    .retryCount(0) // Initial retry count
    .maxRetries(3) // Configurable
    .createdAt(OffsetDateTime.now()) // Timestamp
    .build();

2. Entity → Database Record:

-- PostgreSQL INSERT (via JPA/Hibernate)
INSERT INTO notifications (
    id, user_id, channel, priority, subject, content,
    status, retry_count, max_retries, created_at
) VALUES (
    '550e8400-e29b-41d4-a716-446655440002',
    '550e8400-e29b-41d4-a716-446655440001',
    'EMAIL', 'HIGH',
    'Welcome to Our Platform, John!',
    'Hi John,\n\nThank you for joining...',
    'PENDING', 0, 3, '2024-01-27T10:30:00Z'
);

3. Entity → Kafka Message:

// Only ID sent to Kafka (small message, decoupling)
String messageKey = notification.getId().toString();
String messageValue = notification.getId().toString();
String topic = "notifications.email"; // Channel-specific

kafkaTemplate.send(topic, messageKey, messageValue);

4. Kafka Message → Consumer Processing:

// Consumer receives minimal message
ConsumerRecord<String, String> record = ...;
String notificationIdStr = record.value(); // "550e8400-e29b-41d4-a716-446655440002"

// Fetch full entity from database
UUID notificationId = UUID.fromString(notificationIdStr);
Notification notification = notificationRepository.findById(notificationId)
    .orElseThrow(() -> new RuntimeException("Notification not found"));

5. Entity → Channel Handler Data:

// EmailChannelHandler.send()
String recipientEmail = notification.getUser().getEmail(); // "john@example.com"
String subject = notification.getSubject(); // "Welcome to Our Platform, John!"
String body = notification.getContent(); // Full processed content

// External API call (SendGrid example)
SendGrid sg = new SendGrid(apiKey);
Email from = new Email("noreply@company.com");
Email to = new Email(recipientEmail);
Content content = new Content("text/html", body);
Mail mail = new Mail(from, subject, to, content);

6. Status Updates → Database:

// Success path
notification.setStatus(NotificationStatus.SENT);
notification.setSentAt(OffsetDateTime.now());
notificationRepository.save(notification);

// Failure path
notification.setRetryCount(notification.getRetryCount() + 1);
notification.setNextRetryAt(OffsetDateTime.now().plusMinutes(
    (long) Math.pow(5, notification.getRetryCount()))); // 5^n backoff (5min, 25min)
notification.setErrorMessage("SMTP connection failed");
notificationRepository.save(notification);

Error Handling & Failure Scenarios

Scenario 1: Rate Limit Exceeded

Client Request → Controller → Service.checkRateLimit()
    ↓
RateLimitExceededException thrown
    ↓
GlobalExceptionHandler catches → 429 Too Many Requests
    ↓
Response: {"error": "Rate limit exceeded", "retryAfter": "3600"}

Data State: No notification record created, no Kafka message sent

Scenario 2: Template Not Found

Service.processTemplate() → TemplateNotFoundException
    ↓
Wrapped in NotificationException
    ↓
GlobalExceptionHandler → 400 Bad Request

Data State: No notification record created

Scenario 3: Kafka Publishing Failure

kafkaTemplate.send() throws exception
    ↓
Logged as error, but transaction continues
    ↓
Notification saved to DB with PENDING status
    ↓
RetryScheduler will pick it up later

Data State: Notification exists in DB, will be retried

Scenario 4: Channel Handler Failure

EmailChannelHandler.send() returns false
    ↓
Consumer calls notification.scheduleRetry()
    ↓
Status: PENDING, retry_count++, next_retry_at set
    ↓
Kafka message acknowledged (prevents infinite retry)

Data State: Notification scheduled for retry with exponential backoff

Scenario 5: Database Connection Failure

notificationRepository.save() throws exception
    ↓
@Transactional rolls back entire operation
    ↓
No notification record created
    ↓
API returns 500 Internal Server Error

Data State: Consistent state maintained via ACID transactions

Retry Mechanism Deep Dive

Exponential Backoff Schedule:

// RetryScheduler.processRetries()
List<Notification> readyNotifications = notificationRepository
    .findReadyForProcessing(OffsetDateTime.now());

// For each failed notification:
int retryCount = notification.getRetryCount();
Duration backoff = Duration.ofMinutes((long) Math.pow(5, retryCount));

// retryCount=1: 5 minutes, retryCount=2: 25 minutes, retryCount=3: FAILED (max reached)
notification.setNextRetryAt(OffsetDateTime.now().plus(backoff));

Retry State Machine:

PENDING → PROCESSING → SENT (success)
    ↓           ↓
   FAIL       FAIL → PENDING (retry scheduled)
                      ↓
                   Max retries exceeded → FAILED

Dead Letter Queue Pattern:

// After maxRetries exceeded
if (notification.getRetryCount() >= notification.getMaxRetries()) {
    notification.setStatus(NotificationStatus.FAILED);
    notification.setErrorMessage("Max retries exceeded");
    // Could publish to dead letter topic for manual review
}

Data Consistency & Concurrency

Optimistic Locking (Version Fields):

// Notification entity
@Version
private Long version; // Hibernate increments on update

// Prevents concurrent modifications
notificationRepository.save(notification); // Throws exception if version mismatch

Idempotency for Duplicate Prevention:

// DeduplicationService
public boolean isDuplicate(String eventId) {
    String key = "event:" + eventId;
    Boolean exists = redisTemplate.hasKey(key);
    if (Boolean.TRUE.equals(exists)) {
        return true; // Duplicate!
    }
    // Mark as seen with 24-hour TTL
    redisTemplate.opsForValue().set(key, "1",
        Duration.ofSeconds(eventTtlSeconds));
    return false;
}

Transactional Boundaries:

@Transactional
public NotificationResponse sendNotification(SendNotificationRequest request) {
    // User lookup, rate limiting, template processing
    // ALL succeed or ALL fail together
    Notification saved = notificationRepository.save(notification);
    
    // Kafka publish (non-transactional)
    sendToKafka(saved);
    
    return NotificationResponse.from(saved);
}

Monitoring & Observability Data Flow

Metrics Collection:

// NotificationService
@Timed(value = "notification.send", histogram = true)
public NotificationResponse sendNotification(SendNotificationRequest request) {
    // Implementation
}

// Custom metrics
meterRegistry.counter("notification.sent", "channel", channel.name()).increment();
meterRegistry.timer("notification.delivery.time", "channel", channel.name())
    .record(() -> channelHandler.send(notification));

Logging Data Points:

// Structured logging with MDC
MDC.put("notificationId", notification.getId().toString());
MDC.put("userId", notification.getUser().getId().toString());
MDC.put("channel", notification.getChannel().name());

log.info("Processing notification for delivery");

Health Check Data:

// Actuator endpoint data
@Component
public class NotificationHealthIndicator implements HealthIndicator {
    public Health health() {
        long pendingCount = notificationRepository.countByStatus(PENDING);
        if (pendingCount > 10000) { // Threshold
            return Health.down()
                .withDetail("pendingNotifications", pendingCount)
                .build();
        }
        return Health.up().build();
    }
}

Data Retention & Cleanup

Notification Archival:

// Scheduled job for old notifications
@Scheduled(cron = "0 0 2 * * ?") // Daily at 2 AM
@Transactional
public void archiveOldNotifications() {
    OffsetDateTime cutoff = OffsetDateTime.now().minusMonths(6);
    List<Notification> oldNotifications = notificationRepository
        .findByCreatedAtBeforeAndStatusIn(cutoff, 
            List.of(SENT, FAILED));
    
    // Move to archive table or delete
    notificationRepository.deleteAll(oldNotifications);
}

Redis Key Expiration:

// Rate limiting keys auto-expire
redisTemplate.expire(rateLimitKey, Duration.ofHours(1));

// Cache keys have TTL
@Cacheable(value = "users", key = "'email:' + #email", 
           unless = "#result == null")
public User findByEmail(String email) {
    return userRepository.findByEmail(email);
}

Edge Cases & Data Validation

Invalid User Scenarios:

User deleted between API call and consumer processing
User email/phone changed during processing
User preferences disable channel during processing

Template Processing Edge Cases:

Missing template variables
Malformed template syntax
Template channel mismatch

Concurrent Processing:

Multiple consumers processing same notification
Rate limit counters updated simultaneously
Template cache invalidation during updates

Network & External Service Failures:

Kafka broker unavailable
Email provider rate limited
Database connection pool exhausted
Redis cluster failover

Part 4 — Interview Questions & Answers

31 questions with model answers. Don't memorize word-for-word — understand the key points and explain in your own words.

System Design Questions

Q1: Walk me through how a notification is sent in your system.

Answer: "When a client sends a POST request to /api/v1/notifications:

The Controller validates the request body using Bean Validation
The Service checks for duplicate events via Redis, then checks rate limits - if exceeded, returns 429
If using a template, the TemplateService processes variable substitution
The notification is saved to PostgreSQL with PENDING status
The notification ID is published to Kafka (channel-specific topic)
We return 201 Created immediately (delivery happens async)
A Kafka Consumer picks up the message
The ChannelDispatcher routes to the right handler (Email/SMS/Push/In-App)
The handler sends via the external provider
Status is updated to SENT, or retry is scheduled if failed"

Q2: Why did you use a message queue? Why Kafka specifically?

Answer: "I used a message queue for three main reasons:

Decoupling: The API returns immediately instead of waiting for email to send. Users get a fast response.
Reliability: If SendGrid is down, the notification is safe in Kafka. The worker will retry later.
Handling spikes: If we get 10,000 notifications at once, Kafka absorbs the load. Workers process at their own pace.

I chose Kafka over RabbitMQ because:

Durability: Messages are persisted to disk
Replay: If something goes wrong, we can replay messages
High throughput: Kafka handles millions of messages per second
Consumer groups: Easy horizontal scaling of workers"

Q3: How does your rate limiting work?

Answer: "I implemented the Token Bucket algorithm using Redis:

Each user+channel combination has a Redis key like ratelimit:user123:EMAIL
The value is a counter of notifications sent in the current hour
When a request comes in, I check: is counter < limit?
If yes: increment counter and proceed
If no: throw RateLimitExceededException (HTTP 429)

The counter auto-expires after 1 hour via Redis TTL, so it self-resets.

Why Redis?

Fast: In-memory, sub-millisecond lookups
Atomic: INCR operation is thread-safe
TTL: Built-in expiration, no cleanup job needed
Shared: Multiple app instances see the same counters"

Q4: How do you handle failures? What if an email fails to send?

Answer: "I have a two-layer retry mechanism:

Layer 1 - Immediate (Kafka Consumer): When delivery fails, I update the notification with:

retry_count incremented
next_retry_at set with exponential backoff (1min, 2min, 4min...)
status back to PENDING
last_error with the failure reason

Layer 2 - Scheduled (RetryScheduler): A cron job runs every 60 seconds, finds PENDING notifications where next_retry_at < now, and reprocesses them.

Safeguards:

Max 3 retries with 5^n backoff (5min, 25min), then status = FAILED
Stuck notification reset (if PROCESSING > 10 minutes, reset to PENDING)
All failures are logged for monitoring"

Q5: What happens if Kafka is down when you try to publish?

Answer: "The notification is safe because of my save-then-publish pattern:

I save to PostgreSQL first (transactional, durable)
Then publish to Kafka (fire-and-forget with acks=1)

If Kafka publish fails:

I catch the exception and log it
The transaction still commits (notification is in DB)
The RetryScheduler will find PENDING notifications and reprocess them

This ensures at-least-once delivery - we might send duplicates, but we'll never lose a notification."

Architecture & Design Pattern Questions

Q6: What design patterns did you use?

Answer: "Several patterns:

Strategy Pattern (Channel Handlers):
- ChannelHandler interface with send() method
- Implementations: EmailChannelHandler, SmsChannelHandler, etc.
- ChannelDispatcher routes to the right handler
- Benefit: Add new channels without changing existing code (Open/Closed Principle)
Repository Pattern:
- Spring Data JPA repositories abstract database access
- Service layer doesn't know SQL or JPA details
- Easy to switch databases or add caching
Builder Pattern:
- Used for constructing Notification and ApiResponse objects
- Makes object creation readable: Notification.builder().user(user).channel(EMAIL).build()
Template Method Pattern (implicit):
- ChannelHandler.canHandle() has a default implementation
- Subclasses can override for custom validation"

Q7: How did you ensure loose coupling between components?

Answer: "Three main techniques:

Dependency Injection: Constructor injection everywhere. Components receive their dependencies, they don't create them. Makes testing easy - just pass mocks.
Interface-based design: ChannelHandler interface means the dispatcher doesn't know about specific handlers. I can add WhatsAppChannelHandler without touching dispatcher code.
Event-driven architecture: The API and worker are decoupled via Kafka. They don't call each other directly. The API publishes events, workers consume them."

Q8: How would you add a new notification channel (e.g., WhatsApp)?

Answer: "It's a 2-step process thanks to the Strategy Pattern:

Create the handler:

@Component
public class WhatsAppChannelHandler implements ChannelHandler {
    @Override
    public ChannelType getChannelType() {
        return ChannelType.WHATSAPP;
    }
    
    @Override
    public boolean send(Notification notification) {
        // Call WhatsApp Business API
    }
}

Add the enum value:

public enum ChannelType {
    EMAIL, SMS, PUSH, IN_APP, WHATSAPP
}

That's it. Spring auto-discovers the new handler, the dispatcher registers it automatically."

Database & Data Modeling Questions

Q9: Explain your database schema design.

Answer: "I have 4 main tables:

users: Basic user info (email, phone, device_token)
user_preferences: Channel preferences per user (enabled/disabled, quiet hours)
notification_templates: Reusable message templates with variable placeholders
notifications: The core table - one row per notification sent

Key design decisions:

UUIDs as primary keys: Globally unique, can generate client-side
Timestamps with timezone: All times stored in UTC
Indexes on common queries: user_id, status, created_at
Composite index: (user_id, created_at DESC) for inbox pagination

The notifications table is the busiest, so I indexed heavily:

CREATE INDEX idx_notifications_user_id ON notifications(user_id);
CREATE INDEX idx_notifications_status ON notifications(status);
CREATE INDEX idx_notifications_user_created ON notifications(user_id, created_at DESC);
```"

---

**Q10: How do you handle the notifications table growing very large?**

**Answer:**
"For a production system at scale, I would consider:

1. **Time-based partitioning:**
   Partition by month. Old partitions can be archived or dropped.
   ```sql
   CREATE TABLE notifications (...) PARTITION BY RANGE (created_at);

Archival strategy: Move notifications older than 90 days to an archive table or cold storage (S3).
Read replicas: Route inbox queries to read replicas to reduce load on primary.
Caching hot data: Cache recent notifications in Redis for fast inbox loading.

However, for the current scope, the indexes I have are sufficient for millions of records."

Q11: Why UUIDs instead of auto-increment IDs?

Answer: "I use UUIDv7 (time-ordered UUIDs) instead of auto-increment IDs.

Core reasons:

Globally unique: No collisions across distributed systems or merging databases
Generate client-side: Don't need a database round-trip to get an ID
Write performance: UUIDv7 is time-ordered, so inserts are much more index-friendly than random UUIDv4
Security: Doesn't reveal how many records exist (ID enumeration attack prevention)

Trade-offs I accept:

Larger (16 bytes vs 4 bytes for int)
Still larger than BIGINT, but better locality than UUIDv4
Not human-readable

For this use case, the benefits outweigh the costs."

Kafka & Messaging Questions

Q12: How do you ensure messages aren't lost in Kafka?

Answer: "Multiple safeguards:

Producer side:
- acks=1 (leader acknowledges before confirming)
- Idempotent producer enabled (enable.idempotence=true) to prevent duplicates
- Producer batching with retries=3 and retry.backoff.ms=1000
- Kafka persists to disk before acknowledging
Consumer side:
- Manual acknowledgment (AckMode.MANUAL)
- Only commit offset after successful processing
- If consumer crashes, message will be redelivered
Application level:
- Save to PostgreSQL before publishing to Kafka
- If Kafka fails, RetryScheduler picks up PENDING notifications

This gives at-least-once delivery. We might process duplicates, but never lose messages."

Q13: How do you handle duplicate messages?

Answer: "I implemented a two-layer deduplication strategy:

1. Event-Level Deduplication (Prevention): Clients can provide an eventId in notification requests. Before creating any notification, I check:

if (request.getEventId() != null && deduplicationService.isDuplicate(request.getEventId())) {
    return NotificationResponse.builder()
        .status(FAILED)
        .errorMessage("Duplicate event: notification already processed")
        .build();
}

Redis Storage: Event IDs stored with 24-hour TTL using keys like event:{eventId}
Atomic Check: Uses Redis SET NX semantics to prevent race conditions
Distributed Safe: Works across multiple application instances

2. Processing-Level Deduplication (Recovery): For Kafka message duplicates, I check status before processing:

if (notification.getStatus() != PENDING) {
    log.info("Notification {} already processed", notificationId);
    return; // Skip duplicate processing
}

3. Database-Level Deduplication: PostgreSQL constraints prevent duplicate data insertion.

This gives me exactly-once delivery at the event level and at-least-once with deduplication at the processing level."

Q14: Why did you use channel-specific Kafka topics?

Answer: "I implemented channel-specific topics following Alex Xu's system design pattern:

notifications.email - 8 partitions (high volume, can tolerate delay)
notifications.sms - 4 partitions (rate-limited by providers like Twilio)
notifications.push - 8 partitions (needs to be fast for UX)
notifications.in-app - 6 partitions (moderate volume)

Benefits of this approach:

Independent Scaling: I can have 10 email consumers but only 4 SMS consumers based on volume
Fault Isolation: If the email provider is down, push notifications still work
Different SLAs: Push notifications can have higher processing priority
Better Monitoring: Track lag and throughput per channel separately

Implementation:

// Route to correct topic based on channel
private String getTopicForChannel(ChannelType channel) {
    return switch (channel) {
        case EMAIL -> emailTopic;
        case SMS -> smsTopic;
        case PUSH -> pushTopic;
        case IN_APP -> inAppTopic;
    };
}

Each channel also has its own consumer group for independent offset tracking."

Redis & Caching Questions

Q15: Why Redis for rate limiting instead of in-memory?

Answer: "In a distributed system with multiple application instances:

In-memory (HashMap):

Instance A counts: 5 requests
Instance B counts: 5 requests
User actually sent 10, but each instance thinks only 5

Redis (shared):

Both instances read/write to same counter
Accurate count across all instances

Also, Redis gives me:

Atomic operations: INCR is thread-safe
TTL: Counter auto-expires, no cleanup needed
Persistence: Survives app restart"

Q16: What if Redis goes down?

Answer: "I have a few options:

Fail open (current): If Redis is unavailable, allow the request. Better user experience, but rate limiting is bypassed.
Fail closed: Reject all requests if Redis is down. Safer, but poor UX.
Local fallback: Maintain an in-memory rate limiter as backup. Less accurate but functional.

For production, I'd recommend:

Redis cluster/sentinel for high availability
Health check endpoint to monitor Redis
Alerting when Redis becomes unavailable"

Q17: Tell me about your caching implementation. What do you cache and why?

Answer: "I implemented Redis caching for frequently accessed data to reduce database load:

What I cache:

User lookups by ID: users::id:{id} → User entity (hot path for notifications)
User lookups by email: users::email:{email} → User entity
User lookups by phone: users::phone:{phone} → User entity
Users with device tokens: users::deviceTokens → List
Notification templates: templates::name:{name} → TemplateResponse

Why these specifically:

User lookups by ID: Called on every notification send; Redis cache eliminates a DB round-trip per request
User lookups by email/phone: Called frequently during user operations, users don't change often
Device tokens: Push notifications need to find all users with tokens, expensive query
Templates: Reusable content, read-heavy, write-rare

Cache strategy:

TTL: 1 hour for all cached data
Serialization: Jackson with default typing for complex objects
Eviction: @CacheEvict when data changes (user email update, template modification)
Cache misses: Only successful lookups are cached, exceptions are not"

Q18: How do you handle cache consistency when data changes?

Answer: "I use cache eviction strategies:

For user data:

@CacheEvict(value = "users", key = "'email:' + #oldEmail")
public void evictUserCacheByEmail(String oldEmail) {
    // Removes stale cache entry
}

For templates:

@CacheEvict(value = "templates", allEntries = true)
public TemplateResponse updateTemplate(...) {
    // Clears all template cache on any change
}

Why this approach:

Immediate consistency: Cache is invalidated when data changes
Simple: No complex cache update logic
Safe: Next request will hit database and refresh cache
Performance: Better than cache invalidation storms"

Q19: What serialization challenges did you face with Redis caching?

Answer: "Two main issues:

1. ClassCastException with LinkedHashMap:

Problem: Jackson deserialized User objects as LinkedHashMap
Root cause: Missing type information in JSON
Solution: Enabled default typing in ObjectMapper:

objectMapper.enableDefaultTyping(ObjectMapper.DefaultTyping.NON_FINAL, JsonTypeInfo.As.PROPERTY);

2. Lazy loading exceptions:

Problem: @OneToMany relationships caused lazy loading in cached objects
Root cause: JPA proxy objects can't be serialized
Solution: Added @JsonIgnore to lazy relationships:

@OneToMany(fetch = FetchType.LAZY)
@JsonIgnore
private List<Notification> notifications;

Result: Clean JSON serialization without database dependencies"

Q20: How do you test your caching implementation?

Answer: "I test both cache hits and cache misses:

Cache Miss Testing:

# Clear cache
docker exec notification-redis redis-cli FLUSHALL

# First request - hits database
http :8080/api/v1/users/email/john@example.com
# Logs show: Hibernate SQL query executed

# Check Redis - data is now cached
docker exec notification-redis redis-cli keys "*"
# Shows: users::email:john@example.com

# Second request - serves from cache
http :8080/api/v1/users/email/john@example.com  
# No database query in logs

Failed Lookup Testing:

# Non-existent user
http :8080/api/v1/users/email/nonexistent@example.com
# Returns 404, no cache entry created
# Redis keys still shows only successful lookups

Cache Eviction Testing:

# Update user email (would trigger @CacheEvict)
# Verify old cache key is removed, new one is created
```"

---

### API Design Questions


---

**Q21: Why return 201 Created instead of 200 OK?**

**Answer:**
"HTTP 201 means 'a new resource was created as a result of the request.'

This is semantically correct because:
- A notification record is created in the database
- The notification ID is returned so clients can track it
- Actual delivery happens asynchronously via Kafka

200 OK would also be acceptable, but 201 better communicates that a new resource (the notification) was created.

I also return the notification ID so clients can check status later:
```json
{
  "success": true,
  "data": {
    "id": "abc-123",
    "status": "PENDING"
  }
}
```"

---

**Q22: How do you handle validation errors?**

**Answer:**
"I use Bean Validation annotations and a global exception handler:

**Request DTO:**
```java
@NotNull(message = "User ID is required")
private UUID userId;

@NotNull(message = "Channel is required")
private ChannelType channel;

Global Exception Handler:

@ExceptionHandler(MethodArgumentNotValidException.class)
public ResponseEntity<ApiResponse<Map<String, String>>> handleValidation(...) {
    // Extract field errors
    Map<String, String> errors = ex.getBindingResult()
        .getFieldErrors()
        .stream()
        .collect(toMap(FieldError::getField, FieldError::getDefaultMessage));
    
    return ResponseEntity.badRequest().body(ApiResponse.error(errors));
}

Response:

{
  "success": false,
  "message": "Validation failed",
  "errors": {
    "userId": "User ID is required",
    "channel": "Channel is required"
  }
}
```"

---

### Error Handling & Reliability Questions


---

**Q23: How do you handle partial failures in bulk notifications?**

**Answer:**
"For bulk notifications, I use a **best-effort** approach:

1. Process each user independently in a loop
2. Catch exceptions per-user, don't fail the whole batch
3. Track successes and failures separately
4. Return a detailed report

```java
for (UUID userId : request.getUserIds()) {
    try {
        // Process notification
        response.addSuccess(notificationId);
    } catch (Exception e) {
        response.addFailure(userId, e.getMessage());
    }
}

Response:

{
  "totalRequested": 100,
  "successCount": 95,
  "failedCount": 5,
  "successIds": ["abc", "def", ...],
  "failures": [
    {"userId": "xyz", "reason": "Rate limit exceeded"}
  ]
}
```"

---

**Q24: What monitoring did you implement for production readiness?**

**Answer:**
"I built a full observability stack into the project:

1. **Metrics (Prometheus + Micrometer) — implemented:**
   - Spring Boot Actuator exposes `/actuator/prometheus`
   - Prometheus scrapes metrics every 15s (configured in `monitoring/prometheus/prometheus.yml`)
   - Grafana dashboards visualize HTTP throughput, JVM heap, DB pool health, and Kafka consumer lag
   - Custom counters for notifications sent per channel, rate limit rejections, retry counts

2. **Stress Testing (k6) — implemented:**
   - `stress-test.js`: 10,000 VU ramp-up to validate throughput target
   - `spike-test.js`: Sudden burst to test circuit-breaker behavior
   - `heap-test.js`: Long-running load to detect JVM memory leaks
   - Thresholds: p95 < 750ms, error rate < 5%

3. **Health endpoints (Spring Boot Actuator) — implemented:**
   - `/health` - Overall health with DB, Redis, Kafka checks
   - `/actuator/prometheus` - Prometheus-format metrics

4. **Logging:**
   - Hot-path logs set to DEBUG to avoid I/O overhead at scale
   - WARN level for errors and anomalies
   - Structured logging ready for ELK stack integration

5. **Alerting (Grafana — ready to configure):**
   - Failed notification rate > 5%
   - Kafka consumer lag > 1000
   - HikariCP pool utilization > 80%
   - Redis unavailable"

---

### Scaling & Performance Questions


---

**Q25: How would you scale this system to handle 10x traffic?**

**Answer:**
"I would scale horizontally at multiple layers:

1. **API Layer:**
   - Run multiple instances behind a load balancer
   - They're stateless, so easy to scale
   - Tomcat tuned: 400 threads, 10k max connections

2. **Kafka:**
   - Already configured with high partition counts (email:8, push:8, in-app:6, sms:4)
   - More partitions = more parallel consumers
   - Producer batching: linger.ms=5, batch-size=64KB, lz4 compression

3. **Workers:**
   - Add more consumer instances (same group ID)
   - Currently 10 concurrent consumer threads
   - Kafka distributes partitions among them

4. **Database:**
   - HikariCP pool at 50 connections with batch_size=50
   - Read replicas for inbox queries
   - Consider sharding by user_id for extreme scale

5. **Redis:**
   - User lookups cached by ID (avoids DB hit per notification)
   - Redis Cluster for distributed rate limiting
   - Consistent hashing for key distribution"

---

**Q26: What's the bottleneck in your current design?**

**Answer:**
"The most likely bottlenecks:

1. **Database writes:** Every notification = one INSERT. At high volume, this could slow down.
   **Current mitigation:** HikariCP pool=50, Hibernate batch_size=50, provider_disables_autocommit.
   **Further solution:** Batch inserts, or use a write-optimized DB like Cassandra.

2. **External providers:** SendGrid/Twilio have their own rate limits.
   **Solution:** Respect their limits, queue excess, retry with backoff.

3. **Kafka producer throughput:** Sync sends would bottleneck at high volume.
   **Current mitigation:** Fire-and-forget sends with producer batching (linger.ms=5, batch-size=64KB, lz4 compression, acks=1).

With current tuning (50 DB conns, 400 Tomcat threads, 10 Kafka consumer threads), a single instance can sustain ~8,000-12,000 req/sec. Beyond that, horizontal scaling is needed."

---

### Code Quality & Testing Questions


---

**Q27: How do you test this system?**

**Answer:**
"Multiple test levels:

1. **Unit Tests:**
   - Test services with mocked dependencies
   - `RateLimiterServiceTest` - mock Redis
   - `NotificationServiceTest` - mock repos, Kafka

2. **Integration Tests:**
   - Use real PostgreSQL (via Docker)
   - Test full request flow
   - Verify database state

3. **Manual/E2E Testing:**
   - Swagger UI for API testing
   - Postman collection included
   - Docker Compose for local environment

**Example unit test:**
```java
@Test
void shouldRejectWhenRateLimitExceeded() {
    when(redisTemplate.opsForValue().get(anyString())).thenReturn("10");
    
    assertThrows(RateLimitExceededException.class, () -> 
        rateLimiterService.checkAndIncrement(userId, EMAIL)
    );
}
```"

---

**Q28: How do you handle configuration across environments?**

**Answer:**
"Spring profiles and externalized configuration:

1. **application.yml:** Default configuration
2. **application-test.yml:** Test overrides (different DB, disabled schedulers)
3. **Environment variables:** Override for production (DB_URL, KAFKA_SERVERS)

**Example:**
```yaml
# application.yml
notification:
  rate-limit:
    email: ${RATE_LIMIT_EMAIL:10}  # Default 10, override via env var

This follows 12-factor app principles - config in environment, not code."

Behavioral/Situational Questions

Q29: What was the most challenging part of building this?

Answer: "The retry mechanism with exactly-once semantics was tricky.

Challenge: Ensuring a failed notification gets retried without:

Losing it (if worker crashes)
Sending duplicates (if processed twice)

Solution:

Save to DB before publishing to Kafka (notification survives Kafka failure)
Check status before processing (skip if already SENT)
Manual Kafka acknowledgment (only commit after success)
Scheduled job as backup (catches anything missed)

This gives at-least-once delivery with idempotent processing."

Q30: What would you do differently if you rebuilt this?

Answer: "A few things:

Event Sourcing: Store notification events instead of just current state. Enables audit trails and replay.
Dead Letter Queue: Instead of max retries → FAILED, move to a DLQ for manual investigation.
Circuit Breaker: If SendGrid is down, stop trying and fail fast. Prevent cascade failures.
Outbox Pattern: Instead of publish after save, use a transactional outbox for guaranteed delivery.

These are production-grade improvements I'd add in a real system."

Q31: How did you decide what to include vs. exclude?

Answer: "I followed Alex Xu's approach: solve the core problem well, explicitly scope out complexity.

Included (core features):

Multi-channel delivery ✓
Rate limiting ✓
Async processing ✓
Retry mechanism ✓
Template system ✓

Excluded (complexity traps):

Analytics dashboard ✗
A/B testing ✗
Complex scheduling (cron) ✗
Multi-tenancy ✗

This keeps the project explainable in an interview without inviting questions I can't answer well."

Part 5 — AWS Production Deployment

Advanced topic for principal/staff-level interviews. Shows you can plan a multi-region production deployment.

AWS Production Deployment (Multi-Country Scenario)

Global Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                    GLOBAL SERVICES (Single Region)              │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐   │
│  │   Route 53      │  │   CloudFront    │  │   WAF & Shield  │   │
│  │   (DNS)         │  │   (CDN)         │  │   (Security)     │   │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘   │
└─────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────┐
│                    REGIONAL SERVICES (Per Region)               │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐   │
│  │   Application   │  │   ElastiCache   │  │   Aurora Global  │   │
│  │   Load Balancer │  │   (Redis)       │  │   Database       │   │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘   │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐   │
│  │   ECS/EKS       │  │   MSK (Kafka)   │  │   S3             │   │
│  │   (Containers)  │  │   (Messaging)   │  │   (Storage)      │   │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

AWS Service Mapping

Component	AWS Service	Configuration	Multi-Region
Application	ECS Fargate / EKS	Auto-scaling groups, health checks	Regional deployment
Database	Aurora Global Database	PostgreSQL 15, multi-master	Global write/read endpoints
Cache	ElastiCache Global Datastore	Redis 7, cluster mode	Global replication
Message Queue	Amazon MSK	Kafka 3.x, multi-AZ	Regional with cross-region replication
Load Balancer	Application Load Balancer	SSL termination, WAF integration	Regional
CDN	CloudFront	Edge locations, caching rules	Global
DNS	Route 53	Geo-routing, health checks	Global
Storage	S3	Versioning, cross-region replication	Multi-region replication
Security	AWS WAF + Shield	Rate limiting, DDoS protection	Global
Monitoring	CloudWatch + X-Ray	Metrics, logs, tracing	Cross-region aggregation

Multi-Region Deployment Strategy

Primary-Secondary Model

US-East-1 (Primary)     EU-West-1 (Secondary)     AP-Southeast-1 (Secondary)
├── Aurora Writer       ├── Aurora Reader         ├── Aurora Reader
├── Redis Primary       ├── Redis Replica         ├── Redis Replica
├── MSK Primary         ├── MSK Replica           ├── MSK Replica
└── ECS Tasks           └── ECS Tasks             └── ECS Tasks

Active-Active Model (Advanced)

US-East-1               EU-West-1                AP-Southeast-1
├── Aurora Multi-Master ├── Aurora Multi-Master  ├── Aurora Multi-Master
├── Redis Global        ├── Redis Global         ├── Redis Global
├── MSK Global          ├── MSK Global           ├── MSK Global
└── ECS Global          └── ECS Global           └── ECS Global

Compliance & Data Residency

GDPR (Europe)

Data Location: EU-West-1 (Ireland) primary
Data Processing: Consent management, right to erasure
Cross-Border: Explicit user consent required
Retention: Configurable per regulation requirements

CCPA (California)

Data Location: US-West-2 (Oregon) primary
Data Subject Rights: Access, delete, opt-out
Data Mapping: Track all personal data flows
Breach Notification: <72 hours requirement

PDPA (Singapore/Asia)

Data Location: AP-Southeast-1 (Singapore) primary
Data Protection: Consent-based collection
Cross-Border: PDPC approval for transfers
Retention: Business necessity principle

Cost Optimization

Compute (ECS/EKS)

Spot Instances: 70% savings for non-critical workloads
Auto-scaling: Scale to zero during low traffic
Graviton Processors: 20% cost reduction vs x86

Database (Aurora)

Serverless v2: Pay per usage, auto-scaling
Aurora Optimized Reads: Up to 8x faster queries
Storage Auto-scaling: No over-provisioning

Cache (ElastiCache)

Reserved Nodes: 50% savings for predictable workloads
Data Tiering: Automatic cost optimization
Cluster Mode: Better memory utilization

Network (CloudFront)

Edge Locations: Reduced latency, lower data transfer costs
Compression: Gzip/Brotli for smaller payloads
Caching: Reduce origin requests by 80%

Monitoring & Observability

Application Metrics

CloudWatch Metrics:
├── Application: Response time, error rates, throughput
├── Database: Connection count, query latency, deadlocks
├── Cache: Hit rate, memory usage, eviction count
└── Queue: Message count, consumer lag, throughput

Distributed Tracing

X-Ray Integration:
├── Service mesh tracing across regions
├── Performance bottleneck identification
├── Error root cause analysis
└── User journey tracking

Alerting Strategy

Critical Alerts (Page immediately):
├── Service unavailable (>5 minutes)
├── Database connection failures
├── Queue backlog > 1M messages
└── Security incidents

Warning Alerts (Monitor trends):
├── High latency (>500ms p95)
├── Error rate > 1%
├── Cache hit rate < 80%
└── Storage utilization > 80%

Disaster Recovery

RTO/RPO Targets

Critical Services: RTO < 1 hour, RPO < 5 minutes
Standard Services: RTO < 4 hours, RPO < 1 hour
Data Services: RTO < 2 hours, RPO < 15 minutes

Failover Strategy

Automatic Failover:
├── Route 53 health checks trigger DNS failover
├── Aurora Global Database promotes secondary region
├── ElastiCache Global Datastore switches primary
└── ECS services scale up in secondary region

Security Considerations

Network Security

VPC Design:
├── Public subnets: Load balancers only
├── Private subnets: Application and data layers
├── Isolated subnets: Database and cache
└── Transit Gateway: Cross-region connectivity

Data Protection

Encryption:
├── Data at rest: AWS KMS with customer keys
├── Data in transit: TLS 1.3 minimum
├── Database: Transparent Data Encryption (TDE)
└── Cache: Redis encryption in transit/at rest

Access Control

IAM Strategy:
├── Least privilege principle
├── Service roles for ECS tasks
├── Cross-account access via IAM roles
└── Multi-factor authentication required

Performance Optimization

Global Latency Reduction

CloudFront: 200+ edge locations worldwide
Route 53: Geo-based routing to nearest region
Global Accelerator: TCP/UDP optimization
Aurora Global Database: Sub-second cross-region replication

Scaling Strategies

Horizontal Scaling:
├── ECS: Auto-scaling based on CPU/memory
├── Aurora: Auto-scaling read replicas
├── ElastiCache: Cluster scaling
└── MSK: Broker scaling

Vertical Scaling:
├── Instance types: Graviton3 for cost/performance
├── Storage: Aurora I/O optimization
├── Network: Enhanced networking
└── Memory: Redis cluster mode

Cost Estimation (Monthly)

Base Infrastructure (3 Regions)

Compute (ECS Fargate):     $12,000 - $25,000
Database (Aurora Global):   $8,000 - $15,000
Cache (ElastiCache):        $3,000 - $6,000
Message Queue (MSK):        $2,000 - $4,000
Storage (S3):               $500 - $1,000
CDN (CloudFront):           $2,000 - $4,000
Monitoring (CloudWatch):    $800 - $1,500
Security (WAF/Shield):      $1,000 - $2,000
─────────────────────────────────────────────
Total Estimate:            $29,300 - $58,500

Traffic-Based Scaling

1M daily notifications: +20% infrastructure cost
10M daily notifications: +100% infrastructure cost
100M daily notifications: +300% infrastructure cost

Interview Questions: AWS Production Deployment

Question	Answer
Why multi-region?	Compliance, latency, disaster recovery
Database choice?	Aurora Global - cross-region replication, PostgreSQL compatibility
Cache strategy?	ElastiCache Global - worldwide replication, Redis 7
CDN choice?	CloudFront - 200+ locations, WAF integration
DNS routing?	Route 53 geo-routing, health-based failover
Security layers?	WAF + Shield (edge), Security Groups (VPC), IAM (access)
Monitoring stack?	CloudWatch + X-Ray, centralized logging
Cost optimization?	Reserved instances, spot instances, auto-scaling
Disaster recovery?	RTO < 1hr, RPO < 5min for critical services
Compliance handling?	Regional data residency, consent management

Good luck with your interviews!

FilesExpand file tree

NOTIFICATION_SYSTEM_INTERVIEW_PREPARATION.md

Latest commit

History