Master your notification system project for SDE-2 interviews. This guide covers architecture, design decisions, and 31 practice Q&A.
| Time Available | What to Read | Sections |
|---|---|---|
| 10 minutes | Quick scan before walking in | Part 1: Quick Reference Card + Elevator Pitch |
| 1 hour | Core understanding | Part 1 + Part 2 (Architecture & Components) |
| 3 hours | Full preparation | Parts 1-4 (everything except AWS) |
| Full day | Complete mastery | All 5 parts including AWS deployment |
Part 1 — Quick Review (read 10 min before your interview)
- Quick Reference Card
- 60-Second Elevator Pitch
- How to Whiteboard This System
- Keywords to Use in Interviews
Part 2 — Understand the System (build your mental model)
- Architecture & Request Flow
- Core Components — Rate Limiter, Caching, Kafka, Retry, Channel Handlers
- Database Design — Schema, Indexes, Design Rationale
- Design Decisions & Trade-offs
Part 3 — Performance & Operations (show you think about production)
Part 4 — Interview Q&A (31 questions with model answers)
- System Design (Q1-Q5)
- Architecture & Patterns (Q6-Q8)
- Database (Q9-Q11)
- Kafka & Messaging (Q12-Q14)
- Redis & Caching (Q15-Q20)
- API Design (Q21-Q22)
- Error Handling (Q23-Q24)
- Scaling & Performance (Q25-Q26)
- Code Quality & Testing (Q27-Q28)
- Behavioral (Q29-Q31)
Part 5 — AWS Production Deployment (advanced — for principal/staff-level depth)
Read this section 10 minutes before your interview. It gives you the cheat sheet, your opening pitch, and a whiteboard plan.
| Topic | Your Answer |
|---|---|
| Why Kafka? | Decoupling, reliability, handles spikes |
| Why Redis? | Rate limiting + caching, fast, atomic, TTL |
| What do you cache? | User lookups (by ID, email, phone), device tokens, templates |
| Cache strategy? | TTL 1hr, eviction on changes, Jackson serialization |
| Why PostgreSQL? | ACID, reliable, good enough for scale |
| Rate limiting algo? | Token Bucket |
| Retry strategy? | Exponential backoff (5^n), max 3 retries |
| Design pattern? | Strategy (handlers), Repository (data) |
| Delivery guarantee? | At-least-once |
| Handle duplicates? | Idempotent processing (check status) |
Use this when asked: "Tell me about a project you've worked on"
I built a multi-channel notification system following system design principles. It sends notifications via Email, SMS, Push, and In-App channels.
The key technical highlights are:
- Asynchronous processing using Kafka for decoupling and reliability
- Rate limiting with Redis using the Token Bucket algorithm
- Redis caching for user lookups and notification templates to reduce database load
- Template system for reusable message content
- Retry mechanism with exponential backoff for failed deliveries
- Clean layered architecture following SOLID principles
The system handles the full lifecycle: API receives request → validates → saves to PostgreSQL → publishes to Kafka → worker consumes and delivers via channel handlers → updates status with retry on failure."
When asked to design on a whiteboard:
- "What channels? Email, SMS, Push, In-App?"
- "Expected volume? 1000/sec?"
- "Delivery guarantee needed? At-least-once?"
Client → API → DB → Queue → Worker → Provider
↓
Redis (rate limit)
Pick ONE area based on interviewer interest:
- Rate limiting algorithm
- Retry mechanism
- Database schema
- Kafka configuration
- "I chose X over Y because..."
- "If we needed Z, I would add..."
Architecture:
- Microservices, Layered Architecture, Clean Architecture
- Dependency Injection, Inversion of Control
- Event-Driven, Async Processing
Patterns:
- Strategy Pattern, Repository Pattern, Builder Pattern
- Token Bucket Algorithm, Cache-Aside Pattern
- Exponential Backoff
Caching:
- Spring Cache Abstraction, @Cacheable, @CacheEvict
- TTL (Time To Live), Cache Invalidation
- Serialization, Jackson ObjectMapper
Reliability:
- At-least-once delivery, Idempotency
- Circuit Breaker, Retry with Backoff
- Dead Letter Queue
Scalability:
- Horizontal Scaling, Stateless Services
- Partitioning, Sharding
- Consumer Groups
Data:
- ACID, Eventually Consistent
- Indexing, Composite Index
- Connection Pooling
Build your mental model of how the system works. Read this section when you have 1+ hours to prepare.
┌─────────────┐ ┌─────────────────────────────────────────────┐ ┌─────────────┐
│ Client │────▶│ NOTIFICATION SERVICE │────▶│ PostgreSQL │
│ (REST API) │ │ │ │ (Storage) │
└─────────────┘ │ ┌───────────┐ ┌───────────┐ ┌───────┐ │ └─────────────┘
│ │Controller │──▶│ Service │──▶│ Repo │ │
│ └───────────┘ └─────┬─────┘ └───────┘ │ ┌─────────────┐
│ │ │────▶│ Redis │
│ ▼ │ │(Cache + Rate│
│ ┌───────────┐ │ │ Limit) │
│ │ Kafka │ │ └─────────────┘
│ │ Kafka │ │
│ │ (Publish) │ │ ┌─────────────┐
│ └─────┬─────┘ │────▶│ Kafka │
│ │ │ │ (Queue) │
└────────────────────────┼────────────────────┘ └──────┬──────┘
│ │
┌────────────────────────┼─────────────────────────────────┘
│ ▼
│ ┌─────────────────────────────────────────────────────┐
│ │ KAFKA CONSUMER (Worker) │
│ │ │
│ │ ┌──────────────────┐ ┌───────────────────────┐ │
└──│──▶│ NotificationConsumer │──▶│ ChannelDispatcher │ │
│ └──────────────────┘ └───────────┬───────────┘ │
│ │ │
│ ┌────────────────────────┼─────┐ │
│ ▼ ▼ ▼ ▼ │
│ ┌───────┐ ┌───────┐ ┌───────┐ ┌──────┐ │
│ │ Email │ │ SMS │ │ Push │ │In-App│ │
│ │Handler│ │Handler│ │Handler│ │Handler│ |
│ └───────┘ └───────┘ └───────┘ └──────┘ │
└─────────────────────────────────────────────────────┘
| Step | Component | Action | Data Flow Details |
|---|---|---|---|
| 1 | NotificationController |
Receives POST request, validates input | Input: JSON request bodyjson<br>{<br> "userId": "550e8400-e29b-41d4-a716-446655440001",<br> "channel": "EMAIL",<br> "templateName": "welcome-email",<br> "templateVariables": {"userName": "John"}<br>}<br>Validation: Bean Validation on SendNotificationRequest DTO |
| 2 | NotificationService |
Validates user (cached), checks dedup + rate limit via Redis | User Lookup: userService.findById(userId) (cached in Redis with key "users::id:{id}")Dedup Check: deduplicationService.isDuplicate(eventId) (Redis key "event:{eventId}")Rate Limit Check: rateLimiterService.checkAndIncrement(userId, channel)Redis Key: "ratelimit:{userId}:{channel}"Throws: RateLimitExceededException if limit exceeded |
| 3 | TemplateService |
Processes template (cached lookup) | Input: templateName, templateVariablesTemplate Lookup: templateService.getTemplateByName(name) (cached with key "templates::name:{name}")Processing: Variable substitution in template content Output: subject, content stringsExample: Template "Welcome {{userName}}!" → "Welcome John!" |
| 4 | NotificationRepository |
Saves notification with PENDING status | Entity Creation:java<br>Notification notification = Notification.builder()<br> .user(user)<br> .channel(request.getChannel())<br> .priority(request.getPriority())<br> .subject(subject)<br> .content(content)<br> .status(NotificationStatus.PENDING)<br> .build();<br>Database: ACID transaction ensures durability |
| 5 | KafkaTemplate |
Publishes notification ID to Kafka topic | Message Key: notification.getId().toString()Message Value: notification.getId().toString()Topic: Channel-specific (e.g., notifications.email)Purpose: Only ID sent to avoid large messages |
| 6 | API Response | Returns 201 Created with notification ID | Response:json<br>{<br> "success": true,<br> "message": "Notification queued successfully",<br> "data": {<br> "id": "550e8400-e29b-41d4-a716-446655440002",<br> "status": "PENDING"<br> }<br>}<br>HTTP Status: 201 (Created) - notification record created, delivery happens async |
| 7 | NotificationConsumer |
Picks up message from Kafka | Consumer Record: ConsumerRecord<String, String>Value: Notification ID string Processing: Parse UUID, fetch from database Status Update: PENDING → PROCESSING |
| 8 | ChannelDispatcher |
Routes to correct handler (Email/SMS/Push/In-App) | Routing Logic:java<br>ChannelHandler handler = handlers.get(notification.getChannel());<br>return handler.send(notification);<br>Strategy Pattern: O(1) lookup via HashMap |
| 9 | EmailChannelHandler (etc.) |
Sends via external provider (SendGrid/Twilio) | Handler Selection: Based on notification.getChannel()External API Call: SendGrid/Twilio/Firebase/etc. Data Passed: user.email, notification.subject, notification.contentReturn: true (success) or false (failure) |
| 10 | NotificationRepository |
Updates status to SENT or schedules retry | Success: status = SENTFailure: status = PENDING, retry_count++, next_retry_at set with exponential backoff |
Client Request:
POST /api/v1/notifications
{
"userId": "550e8400-e29b-41d4-a716-446655440001",
"channel": "EMAIL",
"templateName": "welcome-email",
"templateVariables": {
"userName": "John",
"activationLink": "https://example.com/activate/abc123"
}
}1. Controller → Service:
SendNotificationRequestobject passed tonotificationService.sendNotification(request)- Contains: userId, channel, templateName, templateVariables
2. Service Processing:
- User lookup:
userService.findById(request.getUserId())(Redis-cached, avoids DB hit per request) - Deduplication check:
deduplicationService.isDuplicate(eventId)(Redis keyevent:{eventId}) - Rate limit check: Redis counter increment
- Template processing: Variables substituted in template content
- Entity creation:
Notificationobject built with processed content
3. Database Persistence:
INSERT INTO notifications (id, user_id, channel, subject, content, status, created_at)
VALUES ('uuid', 'user-uuid', 'EMAIL', 'Welcome to Our Platform', 'Hi John, ...', 'PENDING', NOW());4. Kafka Publishing:
- Topic:
notifications.email - Key:
"550e8400-e29b-41d4-a716-446655440002" - Value:
"550e8400-e29b-41d4-a716-446655440002" - Purpose: Decouple API response from slow email sending
5. Consumer Processing:
- Kafka message received:
"550e8400-e29b-41d4-a716-446655440002" - Database fetch:
notificationRepository.findById(uuid) - Status update:
PENDING→PROCESSING
6. Channel Dispatch:
- Handler lookup:
handlers.get(ChannelType.EMAIL)→EmailChannelHandler - Method call:
emailHandler.send(notification)
7. Email Handler Execution:
- User data:
notification.getUser().getEmail() - Content data:
notification.getSubject(),notification.getContent() - External API: SendGrid/Mailgun/SES integration
- Result: Success/failure boolean
8. Final Status Update:
- Success:
notification.markAsSent()→status = SENT - Failure:
notification.scheduleRetry("Delivery failed")→status = PENDING, retry scheduled
Location: RateLimiterService.java
How it works:
┌────────────────────────────────────────────┐
│ User requests to send EMAIL notification │
└─────────────────┬──────────────────────────┘
▼
┌────────────────────────────────────────────┐
│ Build Redis key: "ratelimit:user123:EMAIL" │
└─────────────────┬──────────────────────────┘
▼
┌────────────────────────────────────────────┐
│ Get current count from Redis │
│ Example: 7 (user sent 7 emails this hour) │
└─────────────────┬──────────────────────────┘
▼
┌────────────────────────────────────────────┐
│ Compare: 7 < 10 (limit)? │
│ YES → Increment counter, allow request │
│ NO → Throw RateLimitExceededException │
└────────────────────────────────────────────┘
Key Code:
// Redis key format: "ratelimit:{userId}:{channel}"
String key = String.format("ratelimit:%s:%s", userId, channel);
// Atomic increment with TTL
Long newCount = redisTemplate.opsForValue().increment(key);
if (newCount == 1) {
redisTemplate.expire(key, Duration.ofSeconds(3600)); // 1 hour window
}Why Token Bucket?
- Simple to implement and explain
- Allows bursts (up to bucket capacity)
- Self-resets via Redis TTL (no cleanup job needed)
How it works:
// Service layer caching with Spring @Cacheable
@Service
public class UserService {
@Cacheable(value = "users", key = "'email:' + #email")
public User findByEmail(String email) {
// First call hits database, result cached
// Subsequent calls serve from Redis
return userRepository.findByEmail(email).orElseThrow();
}
@CacheEvict(value = "users", key = "'email:' + #oldEmail")
public void evictUserCacheByEmail(String oldEmail) {
// Removes stale cache entry when email changes
}
}Cache Configuration:
@Bean
public CacheManager cacheManager(RedisConnectionFactory connectionFactory) {
ObjectMapper objectMapper = new ObjectMapper();
objectMapper.registerModule(new JavaTimeModule());
objectMapper.enableDefaultTyping(ObjectMapper.DefaultTyping.NON_FINAL,
JsonTypeInfo.As.PROPERTY);
GenericJackson2JsonRedisSerializer serializer =
new GenericJackson2JsonRedisSerializer(objectMapper);
RedisCacheConfiguration config = RedisCacheConfiguration.defaultCacheConfig()
.serializeKeysWith(new StringRedisSerializer())
.serializeValuesWith(serializer)
.entryTtl(Duration.ofHours(1)) // 1 hour TTL
.disableCachingNullValues();
return RedisCacheManager.builder(connectionFactory)
.cacheDefaults(config)
.build();
}Serialization Challenges Solved:
- ClassCastException: Fixed with Jackson default typing
- Lazy Loading: Prevented with @JsonIgnore on JPA relationships
- Complex Objects: Proper type information in JSON
Cache Keys:
users::id:{id}- User by ID lookup (used in notification hot path)users::email:{email}- User by email lookupusers::phone:{phone}- User by phone lookupusers::deviceTokens- List of users with push tokenstemplates::name:{name}- Template by name lookup
Why This Approach:
- Performance: Reduces database load by 60-80% for cached data
- Consistency: Cache eviction ensures data accuracy
- Scalability: Redis cluster-ready for horizontal scaling
- Simplicity: Spring Cache abstraction hides complexity
Why Kafka?
| Without Kafka | With Kafka |
|---|---|
| API waits for email to send (slow) | API returns immediately (fast) |
| If email fails, request fails | Worker retries independently |
| Can't handle traffic spikes | Queue absorbs spikes |
| Single point of failure | Decoupled, fault-tolerant |
Key Design Decisions (Alex Xu's Pattern):
- Channel-Specific Topics:
notifications.email,notifications.sms,notifications.push,notifications.in-app- Each channel scales independently
- Email issues don't affect push delivery
- Different partition counts per channel based on volume
- Key = Notification ID: Ensures ordering per notification
- Manual Acknowledgment: Only commit offset after successful processing
- Separate Consumer Groups: Each channel has its own consumer group for independent offset tracking
Configuration (KafkaConfig.java):
// Don't auto-commit - we'll manually acknowledge
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);
// Manual acknowledgment mode
factory.getContainerProperties().setAckMode(ContainerProperties.AckMode.MANUAL);Retry Flow:
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ PENDING │────▶│ PROCESSING │────▶│ SENT │ (Success!)
└─────────────┘ └──────┬───────┘ └─────────────┘
│
│ (Failure)
▼
┌──────────────┐
│ Schedule │
│ Retry with │
│ Backoff │
└──────┬───────┘
│
┌────────────┴────────────┐
▼ ▼
retry_count < max? retry_count >= max?
│ │
▼ ▼
Back to PENDING Status = FAILED
(with next_retry_at)
Exponential Backoff:
// Calculate next retry time with exponential backoff
// Retry 1: 5 min, Retry 2: 25 min, Retry 3: exceeds max → FAILED
long delayMinutes = (long) Math.pow(5, retryCount);
this.nextRetryAt = OffsetDateTime.now().plusMinutes(delayMinutes);Two Retry Mechanisms:
- Kafka Consumer: Immediate retry on failure
- RetryScheduler: Picks up PENDING notifications due for retry (runs every 60s)
Interface:
public interface ChannelHandler {
ChannelType getChannelType();
boolean send(Notification notification);
default boolean canHandle(Notification notification) { ... }
}Implementations:
EmailChannelHandler→ SendGrid/SESSmsChannelHandler→ TwilioPushChannelHandler→ Firebase FCMInAppChannelHandler→ Database storage
Dispatcher (O(1) Routing):
// Build handler map at startup
this.handlers = handlerList.stream()
.collect(Collectors.toMap(
ChannelHandler::getChannelType,
Function.identity()
));
// Dispatch: O(1) lookup
ChannelHandler handler = handlers.get(notification.getChannel());
return handler.send(notification);Why Strategy Pattern?
- Open/Closed Principle: Add new channels without modifying existing code
- Single Responsibility: Each handler knows only its channel
- Easy to test: Mock individual handlers
┌─────────────────┐ ┌─────────────────────┐
│ users │ │ user_preferences │
├─────────────────┤ ├─────────────────────┤
│ id (UUID) │◄──────┤ id (UUID) │
│ email (VARCHAR) │ │ user_id (FK) │
│ phone (VARCHAR) │ │ channel (VARCHAR) │
│ device_token │ │ enabled (BOOLEAN) │
│ created_at │ │ quiet_hours_start │
│ updated_at │ │ quiet_hours_end │
└─────────────────┘ └─────────────────────┘
│ │
│ │
▼ │
┌─────────────────┐ ┌─────────────────────┐
│ notifications │ │ notification_ │
├─────────────────┤ │ templates │
│ id (UUID) │ ├─────────────────────┤
│ user_id (FK) │──────►│ id (UUID) │
│ template_id (FK)│◄──────┤ name (VARCHAR) │
│ channel │ │ channel (VARCHAR) │
│ priority │ │ subject_template │
│ subject │ │ body_template (TEXT)│
│ content (TEXT) │ │ is_active (BOOLEAN) │
│ status │ └─────────────────────┘
│ retry_count │
│ max_retries │
│ next_retry_at │
│ error_message │
│ sent_at │
│ delivered_at │
│ read_at │
│ created_at │
└─────────────────┘
CREATE TABLE users (
id UUID PRIMARY KEY,
email VARCHAR(255) UNIQUE,
phone VARCHAR(20),
device_token VARCHAR(500),
created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);CREATE TABLE user_preferences (
id UUID PRIMARY KEY,
user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
channel VARCHAR(20) NOT NULL,
enabled BOOLEAN DEFAULT true,
quiet_hours_start TIME,
quiet_hours_end TIME,
created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
UNIQUE(user_id, channel)
);CREATE TABLE notification_templates (
id UUID PRIMARY KEY,
name VARCHAR(100) NOT NULL UNIQUE,
channel VARCHAR(20) NOT NULL,
subject_template VARCHAR(500),
body_template TEXT NOT NULL,
is_active BOOLEAN DEFAULT true,
created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);CREATE TABLE notifications (
id UUID PRIMARY KEY,
user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
template_id UUID REFERENCES notification_templates(id),
channel VARCHAR(20) NOT NULL,
priority VARCHAR(10) NOT NULL DEFAULT 'MEDIUM',
subject VARCHAR(500),
content TEXT NOT NULL,
status VARCHAR(20) NOT NULL DEFAULT 'PENDING',
retry_count INTEGER DEFAULT 0,
max_retries INTEGER DEFAULT 3,
next_retry_at TIMESTAMP WITH TIME ZONE,
error_message TEXT,
sent_at TIMESTAMP WITH TIME ZONE,
delivered_at TIMESTAMP WITH TIME ZONE,
read_at TIMESTAMP WITH TIME ZONE,
created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);Current implementation note: IDs are generated in application code using UUIDv7 in @PrePersist (time-ordered), not by database defaults.
- Globally Unique: No collisions across distributed systems
- Time-Ordered (UUIDv7): Better index locality and reduced page splits vs random UUIDv4
- No Sequence Gaps: Don't reveal how many records exist (security)
- Client Generation: Can generate IDs without database round-trip
- Microservices Ready: Perfect for distributed architectures
- Flexibility: Each user can have different preferences per channel
- Scalability: Preferences change less frequently than notifications
- Compliance: Easy to implement "Do Not Disturb" features
- Analytics: Track preference patterns across user base
- Consistency: Standardized messaging across the platform
- Maintainability: Change message content in one place
- Localization: Easy to support multiple languages
- Personalization: Template variables for dynamic content
- Testing: Templates can be tested independently
- Audit Trail: Complete history of all notifications sent
- Analytics: Track delivery rates, user engagement, failures
- Debugging: Detailed error messages and retry information
- Compliance: Prove notifications were sent (legal requirements)
- Business Intelligence: Understand user behavior patterns
PENDING → PROCESSING → SENT → DELIVERED
↓ ↓
FAILED READ (in-app only)
- PENDING: Queued for processing
- PROCESSING: Currently being sent (prevents duplicate processing)
- SENT: Successfully delivered to provider (SMS gateway, email service)
- DELIVERED: Confirmed received by user (webhook/callback)
- FAILED: All retries exhausted
- READ: User opened/acknowledged (in-app notifications only)
- Exponential Backoff:
next_retry_at = now + 5^retryCount minutes(5min, 25min) - Maximum Retries: Default 3 attempts before marking as FAILED
- Configurable: Different retry counts for different priority levels
- Smart Scheduling: Use database indexes for efficient retry queries
-- User inbox (most common query)
CREATE INDEX idx_notifications_user_created ON notifications(user_id, created_at DESC);
-- Processing queue (find pending notifications)
CREATE INDEX idx_notifications_status ON notifications(status);
-- Retry scheduling (find notifications ready for retry)
CREATE INDEX idx_notifications_retry ON notifications(next_retry_at)
WHERE status = 'PENDING' AND next_retry_at IS NOT NULL;
-- Analytics (filter by channel/time)
CREATE INDEX idx_notifications_channel ON notifications(channel);Why These Indexes?
- User Inbox:
user_id + created_at DESC- Fast pagination for user's notification history - Queue Processing:
status- Quickly find PENDING notifications to process - Retry Logic: Partial index on
next_retry_at- Only index rows that need retry - Analytics:
channel- Group notifications by delivery method
- UUID: Globally unique identifiers (vs auto-increment integers)
- TIMESTAMP WITH TIME ZONE: Proper timezone handling across regions
- TEXT: Unlimited message content (vs VARCHAR limits)
- VARCHAR(500): Reasonable limits for subjects/tokens
- BOOLEAN: Simple true/false flags (vs CHAR(1) 'Y'/'N')
- UNIQUE(email): Prevent duplicate user accounts
- UNIQUE(user_id, channel): One preference per user per channel
- UNIQUE(name): Template names must be unique
- NOT NULL: Critical fields that must always have values
- FOREIGN KEY: Maintain referential integrity
- ON DELETE CASCADE: Clean up related data when users are deleted
- Notification System: 90% reads (checking preferences, templates) vs 10% writes
- User Inbox: Frequent queries for user's notification history
- Analytics: Aggregate queries on notification data
- Notifications: High-volume inserts (millions per day)
- Status Updates: Frequent updates during processing
- Retry Logic: Updates to retry_count and next_retry_at
- Connection Pooling: Reuse database connections (HikariCP)
- Read Replicas: Separate read traffic from writes
- Partitioning: Partition notifications table by month/year
- Archiving: Move old notifications to cold storage
- UUIDv7 IDs: Time-ordered inserts improve B-tree write behavior compared to UUIDv4
| Question | Answer |
|---|---|
| Why UUIDv7? | Global uniqueness + time ordering for better write locality and index performance |
| Why separate preferences? | User control, compliance, different settings per channel |
| Why templates? | Consistency, maintainability, localization support |
| Why track everything? | Audit trail, analytics, debugging, compliance |
| Status flow? | PENDING→PROCESSING→SENT→DELIVERED, with FAILED/READ |
| Retry strategy? | Exponential backoff, max 3 retries, configurable |
| Indexing strategy? | User inbox, queue processing, retry scheduling, analytics |
| Performance? | Read-heavy, connection pooling, read replicas, partitioning |
| Data integrity? | Foreign keys, constraints, cascading deletes |
| Scalability? | Partitioning, archiving, read replicas |
| Decision | Why | Trade-off |
|---|---|---|
| Kafka over RabbitMQ | Better durability, replay capability, high throughput | More complex to operate |
| Redis for rate limiting | Fast (in-memory), TTL support, atomic operations | Additional infrastructure |
| PostgreSQL | ACID compliance, JSON support, reliability | Harder to scale horizontally |
| UUID primary keys | Globally unique, can generate client-side | Larger than integers, slower indexes |
| Save-then-publish | Notification survives Kafka failure | Possible duplicate sends |
| Channel-specific Kafka topics | Independent scaling, channel isolation, follows Alex Xu's design | More topics to manage |
| Polling retry scheduler | Simple, reliable | Less immediate than push-based |
Show interviewers you think about production. This covers tuning, monitoring, and detailed data flow.
| Layer | Setting | Value | Why |
|---|---|---|---|
| Tomcat | max-threads | 400 | Handle concurrent requests (up from default 200) |
| Tomcat | max-connections | 10,000 | Accept 10k simultaneous connections |
| Tomcat | connection-timeout | 5,000ms | Fail fast instead of holding resources |
| HikariCP | maximum-pool-size | 50 | At 5ms/query, 50 conns = ~10,000 queries/sec |
| HikariCP | connection-timeout | 5,000ms | Fail fast if pool exhausted |
| Hibernate | batch_size | 50 | Batch inserts reduce DB round-trips |
| Hibernate | provider_disables_autocommit | true | Avoids extra DB call per transaction |
| Kafka Producer | acks | 1 | Leader-only ack (5x faster than acks=all) |
| Kafka Producer | linger.ms | 5 | Wait 5ms to fill batches before sending |
| Kafka Producer | batch-size | 64KB | Larger batches = fewer network calls |
| Kafka Producer | compression | lz4 | Compress messages to reduce network I/O |
| Kafka Consumer | concurrency | 10 | 10 parallel consumer threads |
| Kafka Consumer | max-poll-records | 500 | Process 500 records per poll cycle |
| Kafka Topics | total partitions | 26 | email:8, push:8, in-app:6, sms:4 |
| Component | Per-Request Cost | Capacity |
|---|---|---|
| Tomcat accept | ~0ms | 10,000 conns |
| Redis (dedup + rate limit) | ~1ms (2 calls) | ~50,000/s |
| User lookup (Redis cache hit) | ~0.5ms | ~50,000/s |
| DB save (notification INSERT) | ~5ms | 50 conns / 5ms = 10,000/s |
| Kafka send (batched, acks=1) | ~0.1ms amortized | 50,000+/s |
Estimated single-instance throughput: ~8,000-12,000 req/sec
"I identified bottlenecks layer by layer:
-
User lookup: Each request was hitting PostgreSQL. I added
@CacheabletoUserService.findById()so user lookups go to Redis (~0.5ms vs ~5ms DB). -
DB connection pool: Increased HikariCP from 10 to 50. At 5ms per query, 50 connections supports ~10k queries/sec.
-
Kafka producer: Changed from
acks=alltoacks=1(leader-only). Added batching withlinger.ms=5andbatch-size=64KBwith lz4 compression. This makes sends effectively fire-and-forget at ~0.1ms amortized. -
Tomcat threads: Increased to 400 threads with 10k max connections and 5s timeout.
-
Logging: Moved hot-path logs from INFO to DEBUG to avoid I/O overhead at 10k/s.
The bottleneck shifts to PostgreSQL INSERTs at ~10k/sec. Beyond that, horizontal scaling (multiple instances behind a load balancer) is needed."
The project includes a complete Prometheus + Grafana monitoring setup:
Application (Micrometer) → Prometheus (scrape) → Grafana (dashboards)
Prometheus Metrics Exposed:
- HTTP request rate, latency percentiles (p95, p99)
- JVM heap usage, GC frequency
- HikariCP connection pool utilization
- Kafka consumer lag per topic
- Custom notification counters per channel
Grafana Dashboard Panels:
- HTTP throughput by endpoint
- Response time distribution
- JVM heap and GC pressure
- Database connection pool health
- Kafka consumer lag
Configuration: monitoring/prometheus/prometheus.yml scrapes the /actuator/prometheus endpoint.
The project includes three k6 load test profiles in stress-test/k6/:
| Script | Purpose | Peak VUs |
|---|---|---|
stress-test.js |
Steady-state throughput and latency baseline | 10,000 |
spike-test.js |
Burst handling and recovery behavior | 1,000 |
heap-test.js |
JVM heap growth and GC under pressure | — |
Thresholds defined:
http_req_failed: < 5% error ratehttp_req_duration: p95 < 750ms, p99 < 1200ms
How to run:
k6 run stress-test/k6/stress-test.js
# POST endpoint:
BASE_URL=http://localhost:8080 TARGET_PATH=/api/v1/notifications METHOD=POST k6 run stress-test/k6/stress-test.js1. Request DTO → Entity Transformation:
// Input: SendNotificationRequest (JSON)
{
"userId": "550e8400-e29b-41d4-a716-446655440001",
"channel": "EMAIL",
"templateName": "welcome-email",
"templateVariables": {"userName": "John"},
"priority": "HIGH"
}
// Transformation in NotificationService.sendNotification():
User user = userService.findById(request.getUserId()); // Redis-cached lookup
String subject = templateService.processTemplate(request.getTemplateName(),
request.getTemplateVariables()).getSubject(); // Template processing
// Output: Notification Entity
Notification notification = Notification.builder()
.id(UUID.randomUUID()) // Generated
.user(user) // Foreign key relationship
.channel(ChannelType.EMAIL) // Enum conversion
.priority(Priority.HIGH) // Enum conversion
.subject(subject) // Processed from template
.content(content) // Processed from template
.status(NotificationStatus.PENDING) // Initial status
.retryCount(0) // Initial retry count
.maxRetries(3) // Configurable
.createdAt(OffsetDateTime.now()) // Timestamp
.build();2. Entity → Database Record:
-- PostgreSQL INSERT (via JPA/Hibernate)
INSERT INTO notifications (
id, user_id, channel, priority, subject, content,
status, retry_count, max_retries, created_at
) VALUES (
'550e8400-e29b-41d4-a716-446655440002',
'550e8400-e29b-41d4-a716-446655440001',
'EMAIL', 'HIGH',
'Welcome to Our Platform, John!',
'Hi John,\n\nThank you for joining...',
'PENDING', 0, 3, '2024-01-27T10:30:00Z'
);3. Entity → Kafka Message:
// Only ID sent to Kafka (small message, decoupling)
String messageKey = notification.getId().toString();
String messageValue = notification.getId().toString();
String topic = "notifications.email"; // Channel-specific
kafkaTemplate.send(topic, messageKey, messageValue);4. Kafka Message → Consumer Processing:
// Consumer receives minimal message
ConsumerRecord<String, String> record = ...;
String notificationIdStr = record.value(); // "550e8400-e29b-41d4-a716-446655440002"
// Fetch full entity from database
UUID notificationId = UUID.fromString(notificationIdStr);
Notification notification = notificationRepository.findById(notificationId)
.orElseThrow(() -> new RuntimeException("Notification not found"));5. Entity → Channel Handler Data:
// EmailChannelHandler.send()
String recipientEmail = notification.getUser().getEmail(); // "john@example.com"
String subject = notification.getSubject(); // "Welcome to Our Platform, John!"
String body = notification.getContent(); // Full processed content
// External API call (SendGrid example)
SendGrid sg = new SendGrid(apiKey);
Email from = new Email("noreply@company.com");
Email to = new Email(recipientEmail);
Content content = new Content("text/html", body);
Mail mail = new Mail(from, subject, to, content);6. Status Updates → Database:
// Success path
notification.setStatus(NotificationStatus.SENT);
notification.setSentAt(OffsetDateTime.now());
notificationRepository.save(notification);
// Failure path
notification.setRetryCount(notification.getRetryCount() + 1);
notification.setNextRetryAt(OffsetDateTime.now().plusMinutes(
(long) Math.pow(5, notification.getRetryCount()))); // 5^n backoff (5min, 25min)
notification.setErrorMessage("SMTP connection failed");
notificationRepository.save(notification);Scenario 1: Rate Limit Exceeded
Client Request → Controller → Service.checkRateLimit()
↓
RateLimitExceededException thrown
↓
GlobalExceptionHandler catches → 429 Too Many Requests
↓
Response: {"error": "Rate limit exceeded", "retryAfter": "3600"}
Data State: No notification record created, no Kafka message sent
Scenario 2: Template Not Found
Service.processTemplate() → TemplateNotFoundException
↓
Wrapped in NotificationException
↓
GlobalExceptionHandler → 400 Bad Request
Data State: No notification record created
Scenario 3: Kafka Publishing Failure
kafkaTemplate.send() throws exception
↓
Logged as error, but transaction continues
↓
Notification saved to DB with PENDING status
↓
RetryScheduler will pick it up later
Data State: Notification exists in DB, will be retried
Scenario 4: Channel Handler Failure
EmailChannelHandler.send() returns false
↓
Consumer calls notification.scheduleRetry()
↓
Status: PENDING, retry_count++, next_retry_at set
↓
Kafka message acknowledged (prevents infinite retry)
Data State: Notification scheduled for retry with exponential backoff
Scenario 5: Database Connection Failure
notificationRepository.save() throws exception
↓
@Transactional rolls back entire operation
↓
No notification record created
↓
API returns 500 Internal Server Error
Data State: Consistent state maintained via ACID transactions
Exponential Backoff Schedule:
// RetryScheduler.processRetries()
List<Notification> readyNotifications = notificationRepository
.findReadyForProcessing(OffsetDateTime.now());
// For each failed notification:
int retryCount = notification.getRetryCount();
Duration backoff = Duration.ofMinutes((long) Math.pow(5, retryCount));
// retryCount=1: 5 minutes, retryCount=2: 25 minutes, retryCount=3: FAILED (max reached)
notification.setNextRetryAt(OffsetDateTime.now().plus(backoff));Retry State Machine:
PENDING → PROCESSING → SENT (success)
↓ ↓
FAIL FAIL → PENDING (retry scheduled)
↓
Max retries exceeded → FAILED
Dead Letter Queue Pattern:
// After maxRetries exceeded
if (notification.getRetryCount() >= notification.getMaxRetries()) {
notification.setStatus(NotificationStatus.FAILED);
notification.setErrorMessage("Max retries exceeded");
// Could publish to dead letter topic for manual review
}Optimistic Locking (Version Fields):
// Notification entity
@Version
private Long version; // Hibernate increments on update
// Prevents concurrent modifications
notificationRepository.save(notification); // Throws exception if version mismatchIdempotency for Duplicate Prevention:
// DeduplicationService
public boolean isDuplicate(String eventId) {
String key = "event:" + eventId;
Boolean exists = redisTemplate.hasKey(key);
if (Boolean.TRUE.equals(exists)) {
return true; // Duplicate!
}
// Mark as seen with 24-hour TTL
redisTemplate.opsForValue().set(key, "1",
Duration.ofSeconds(eventTtlSeconds));
return false;
}Transactional Boundaries:
@Transactional
public NotificationResponse sendNotification(SendNotificationRequest request) {
// User lookup, rate limiting, template processing
// ALL succeed or ALL fail together
Notification saved = notificationRepository.save(notification);
// Kafka publish (non-transactional)
sendToKafka(saved);
return NotificationResponse.from(saved);
}Metrics Collection:
// NotificationService
@Timed(value = "notification.send", histogram = true)
public NotificationResponse sendNotification(SendNotificationRequest request) {
// Implementation
}
// Custom metrics
meterRegistry.counter("notification.sent", "channel", channel.name()).increment();
meterRegistry.timer("notification.delivery.time", "channel", channel.name())
.record(() -> channelHandler.send(notification));Logging Data Points:
// Structured logging with MDC
MDC.put("notificationId", notification.getId().toString());
MDC.put("userId", notification.getUser().getId().toString());
MDC.put("channel", notification.getChannel().name());
log.info("Processing notification for delivery");Health Check Data:
// Actuator endpoint data
@Component
public class NotificationHealthIndicator implements HealthIndicator {
public Health health() {
long pendingCount = notificationRepository.countByStatus(PENDING);
if (pendingCount > 10000) { // Threshold
return Health.down()
.withDetail("pendingNotifications", pendingCount)
.build();
}
return Health.up().build();
}
}Notification Archival:
// Scheduled job for old notifications
@Scheduled(cron = "0 0 2 * * ?") // Daily at 2 AM
@Transactional
public void archiveOldNotifications() {
OffsetDateTime cutoff = OffsetDateTime.now().minusMonths(6);
List<Notification> oldNotifications = notificationRepository
.findByCreatedAtBeforeAndStatusIn(cutoff,
List.of(SENT, FAILED));
// Move to archive table or delete
notificationRepository.deleteAll(oldNotifications);
}Redis Key Expiration:
// Rate limiting keys auto-expire
redisTemplate.expire(rateLimitKey, Duration.ofHours(1));
// Cache keys have TTL
@Cacheable(value = "users", key = "'email:' + #email",
unless = "#result == null")
public User findByEmail(String email) {
return userRepository.findByEmail(email);
}Invalid User Scenarios:
- User deleted between API call and consumer processing
- User email/phone changed during processing
- User preferences disable channel during processing
Template Processing Edge Cases:
- Missing template variables
- Malformed template syntax
- Template channel mismatch
Concurrent Processing:
- Multiple consumers processing same notification
- Rate limit counters updated simultaneously
- Template cache invalidation during updates
Network & External Service Failures:
- Kafka broker unavailable
- Email provider rate limited
- Database connection pool exhausted
- Redis cluster failover
31 questions with model answers. Don't memorize word-for-word — understand the key points and explain in your own words.
Q1: Walk me through how a notification is sent in your system.
Answer:
"When a client sends a POST request to /api/v1/notifications:
- The Controller validates the request body using Bean Validation
- The Service checks for duplicate events via Redis, then checks rate limits - if exceeded, returns 429
- If using a template, the TemplateService processes variable substitution
- The notification is saved to PostgreSQL with PENDING status
- The notification ID is published to Kafka (channel-specific topic)
- We return 201 Created immediately (delivery happens async)
- A Kafka Consumer picks up the message
- The ChannelDispatcher routes to the right handler (Email/SMS/Push/In-App)
- The handler sends via the external provider
- Status is updated to SENT, or retry is scheduled if failed"
Q2: Why did you use a message queue? Why Kafka specifically?
Answer: "I used a message queue for three main reasons:
-
Decoupling: The API returns immediately instead of waiting for email to send. Users get a fast response.
-
Reliability: If SendGrid is down, the notification is safe in Kafka. The worker will retry later.
-
Handling spikes: If we get 10,000 notifications at once, Kafka absorbs the load. Workers process at their own pace.
I chose Kafka over RabbitMQ because:
- Durability: Messages are persisted to disk
- Replay: If something goes wrong, we can replay messages
- High throughput: Kafka handles millions of messages per second
- Consumer groups: Easy horizontal scaling of workers"
Q3: How does your rate limiting work?
Answer: "I implemented the Token Bucket algorithm using Redis:
- Each user+channel combination has a Redis key like
ratelimit:user123:EMAIL - The value is a counter of notifications sent in the current hour
- When a request comes in, I check: is counter < limit?
- If yes: increment counter and proceed
- If no: throw
RateLimitExceededException(HTTP 429)
The counter auto-expires after 1 hour via Redis TTL, so it self-resets.
Why Redis?
- Fast: In-memory, sub-millisecond lookups
- Atomic:
INCRoperation is thread-safe - TTL: Built-in expiration, no cleanup job needed
- Shared: Multiple app instances see the same counters"
Q4: How do you handle failures? What if an email fails to send?
Answer: "I have a two-layer retry mechanism:
Layer 1 - Immediate (Kafka Consumer): When delivery fails, I update the notification with:
retry_countincrementednext_retry_atset with exponential backoff (1min, 2min, 4min...)statusback to PENDINGlast_errorwith the failure reason
Layer 2 - Scheduled (RetryScheduler):
A cron job runs every 60 seconds, finds PENDING notifications where next_retry_at < now, and reprocesses them.
Safeguards:
- Max 3 retries with 5^n backoff (5min, 25min), then status = FAILED
- Stuck notification reset (if PROCESSING > 10 minutes, reset to PENDING)
- All failures are logged for monitoring"
Q5: What happens if Kafka is down when you try to publish?
Answer: "The notification is safe because of my save-then-publish pattern:
- I save to PostgreSQL first (transactional, durable)
- Then publish to Kafka (fire-and-forget with
acks=1)
If Kafka publish fails:
- I catch the exception and log it
- The transaction still commits (notification is in DB)
- The RetryScheduler will find PENDING notifications and reprocess them
This ensures at-least-once delivery - we might send duplicates, but we'll never lose a notification."
Q6: What design patterns did you use?
Answer: "Several patterns:
-
Strategy Pattern (Channel Handlers):
ChannelHandlerinterface withsend()method- Implementations:
EmailChannelHandler,SmsChannelHandler, etc. ChannelDispatcherroutes to the right handler- Benefit: Add new channels without changing existing code (Open/Closed Principle)
-
Repository Pattern:
- Spring Data JPA repositories abstract database access
- Service layer doesn't know SQL or JPA details
- Easy to switch databases or add caching
-
Builder Pattern:
- Used for constructing
NotificationandApiResponseobjects - Makes object creation readable:
Notification.builder().user(user).channel(EMAIL).build()
- Used for constructing
-
Template Method Pattern (implicit):
ChannelHandler.canHandle()has a default implementation- Subclasses can override for custom validation"
Q7: How did you ensure loose coupling between components?
Answer: "Three main techniques:
-
Dependency Injection: Constructor injection everywhere. Components receive their dependencies, they don't create them. Makes testing easy - just pass mocks.
-
Interface-based design:
ChannelHandlerinterface means the dispatcher doesn't know about specific handlers. I can addWhatsAppChannelHandlerwithout touching dispatcher code. -
Event-driven architecture: The API and worker are decoupled via Kafka. They don't call each other directly. The API publishes events, workers consume them."
Q8: How would you add a new notification channel (e.g., WhatsApp)?
Answer: "It's a 2-step process thanks to the Strategy Pattern:
- Create the handler:
@Component
public class WhatsAppChannelHandler implements ChannelHandler {
@Override
public ChannelType getChannelType() {
return ChannelType.WHATSAPP;
}
@Override
public boolean send(Notification notification) {
// Call WhatsApp Business API
}
}- Add the enum value:
public enum ChannelType {
EMAIL, SMS, PUSH, IN_APP, WHATSAPP
}That's it. Spring auto-discovers the new handler, the dispatcher registers it automatically."
Q9: Explain your database schema design.
Answer: "I have 4 main tables:
- users: Basic user info (email, phone, device_token)
- user_preferences: Channel preferences per user (enabled/disabled, quiet hours)
- notification_templates: Reusable message templates with variable placeholders
- notifications: The core table - one row per notification sent
Key design decisions:
- UUIDs as primary keys: Globally unique, can generate client-side
- Timestamps with timezone: All times stored in UTC
- Indexes on common queries: user_id, status, created_at
- Composite index: (user_id, created_at DESC) for inbox pagination
The notifications table is the busiest, so I indexed heavily:
CREATE INDEX idx_notifications_user_id ON notifications(user_id);
CREATE INDEX idx_notifications_status ON notifications(status);
CREATE INDEX idx_notifications_user_created ON notifications(user_id, created_at DESC);
```"
---
**Q10: How do you handle the notifications table growing very large?**
**Answer:**
"For a production system at scale, I would consider:
1. **Time-based partitioning:**
Partition by month. Old partitions can be archived or dropped.
```sql
CREATE TABLE notifications (...) PARTITION BY RANGE (created_at);-
Archival strategy: Move notifications older than 90 days to an archive table or cold storage (S3).
-
Read replicas: Route inbox queries to read replicas to reduce load on primary.
-
Caching hot data: Cache recent notifications in Redis for fast inbox loading.
However, for the current scope, the indexes I have are sufficient for millions of records."
Q11: Why UUIDs instead of auto-increment IDs?
Answer: "I use UUIDv7 (time-ordered UUIDs) instead of auto-increment IDs.
Core reasons:
- Globally unique: No collisions across distributed systems or merging databases
- Generate client-side: Don't need a database round-trip to get an ID
- Write performance: UUIDv7 is time-ordered, so inserts are much more index-friendly than random UUIDv4
- Security: Doesn't reveal how many records exist (ID enumeration attack prevention)
Trade-offs I accept:
- Larger (16 bytes vs 4 bytes for int)
- Still larger than BIGINT, but better locality than UUIDv4
- Not human-readable
For this use case, the benefits outweigh the costs."
Q12: How do you ensure messages aren't lost in Kafka?
Answer: "Multiple safeguards:
-
Producer side:
acks=1(leader acknowledges before confirming)- Idempotent producer enabled (
enable.idempotence=true) to prevent duplicates - Producer batching with retries=3 and retry.backoff.ms=1000
- Kafka persists to disk before acknowledging
-
Consumer side:
- Manual acknowledgment (
AckMode.MANUAL) - Only commit offset after successful processing
- If consumer crashes, message will be redelivered
- Manual acknowledgment (
-
Application level:
- Save to PostgreSQL before publishing to Kafka
- If Kafka fails, RetryScheduler picks up PENDING notifications
This gives at-least-once delivery. We might process duplicates, but never lose messages."
Q13: How do you handle duplicate messages?
Answer: "I implemented a two-layer deduplication strategy:
1. Event-Level Deduplication (Prevention):
Clients can provide an eventId in notification requests. Before creating any notification, I check:
if (request.getEventId() != null && deduplicationService.isDuplicate(request.getEventId())) {
return NotificationResponse.builder()
.status(FAILED)
.errorMessage("Duplicate event: notification already processed")
.build();
}- Redis Storage: Event IDs stored with 24-hour TTL using keys like
event:{eventId} - Atomic Check: Uses Redis
SET NXsemantics to prevent race conditions - Distributed Safe: Works across multiple application instances
2. Processing-Level Deduplication (Recovery): For Kafka message duplicates, I check status before processing:
if (notification.getStatus() != PENDING) {
log.info("Notification {} already processed", notificationId);
return; // Skip duplicate processing
}3. Database-Level Deduplication: PostgreSQL constraints prevent duplicate data insertion.
This gives me exactly-once delivery at the event level and at-least-once with deduplication at the processing level."
Q14: Why did you use channel-specific Kafka topics?
Answer: "I implemented channel-specific topics following Alex Xu's system design pattern:
notifications.email- 8 partitions (high volume, can tolerate delay)notifications.sms- 4 partitions (rate-limited by providers like Twilio)notifications.push- 8 partitions (needs to be fast for UX)notifications.in-app- 6 partitions (moderate volume)
Benefits of this approach:
- Independent Scaling: I can have 10 email consumers but only 4 SMS consumers based on volume
- Fault Isolation: If the email provider is down, push notifications still work
- Different SLAs: Push notifications can have higher processing priority
- Better Monitoring: Track lag and throughput per channel separately
Implementation:
// Route to correct topic based on channel
private String getTopicForChannel(ChannelType channel) {
return switch (channel) {
case EMAIL -> emailTopic;
case SMS -> smsTopic;
case PUSH -> pushTopic;
case IN_APP -> inAppTopic;
};
}Each channel also has its own consumer group for independent offset tracking."
Q15: Why Redis for rate limiting instead of in-memory?
Answer: "In a distributed system with multiple application instances:
In-memory (HashMap):
- Instance A counts: 5 requests
- Instance B counts: 5 requests
- User actually sent 10, but each instance thinks only 5
Redis (shared):
- Both instances read/write to same counter
- Accurate count across all instances
Also, Redis gives me:
- Atomic operations: INCR is thread-safe
- TTL: Counter auto-expires, no cleanup needed
- Persistence: Survives app restart"
Q16: What if Redis goes down?
Answer: "I have a few options:
-
Fail open (current): If Redis is unavailable, allow the request. Better user experience, but rate limiting is bypassed.
-
Fail closed: Reject all requests if Redis is down. Safer, but poor UX.
-
Local fallback: Maintain an in-memory rate limiter as backup. Less accurate but functional.
For production, I'd recommend:
- Redis cluster/sentinel for high availability
- Health check endpoint to monitor Redis
- Alerting when Redis becomes unavailable"
Q17: Tell me about your caching implementation. What do you cache and why?
Answer: "I implemented Redis caching for frequently accessed data to reduce database load:
What I cache:
- User lookups by ID:
users::id:{id}→ User entity (hot path for notifications) - User lookups by email:
users::email:{email}→ User entity - User lookups by phone:
users::phone:{phone}→ User entity - Users with device tokens:
users::deviceTokens→ List - Notification templates:
templates::name:{name}→ TemplateResponse
Why these specifically:
- User lookups by ID: Called on every notification send; Redis cache eliminates a DB round-trip per request
- User lookups by email/phone: Called frequently during user operations, users don't change often
- Device tokens: Push notifications need to find all users with tokens, expensive query
- Templates: Reusable content, read-heavy, write-rare
Cache strategy:
- TTL: 1 hour for all cached data
- Serialization: Jackson with default typing for complex objects
- Eviction: @CacheEvict when data changes (user email update, template modification)
- Cache misses: Only successful lookups are cached, exceptions are not"
Q18: How do you handle cache consistency when data changes?
Answer: "I use cache eviction strategies:
For user data:
@CacheEvict(value = "users", key = "'email:' + #oldEmail")
public void evictUserCacheByEmail(String oldEmail) {
// Removes stale cache entry
}For templates:
@CacheEvict(value = "templates", allEntries = true)
public TemplateResponse updateTemplate(...) {
// Clears all template cache on any change
}Why this approach:
- Immediate consistency: Cache is invalidated when data changes
- Simple: No complex cache update logic
- Safe: Next request will hit database and refresh cache
- Performance: Better than cache invalidation storms"
Q19: What serialization challenges did you face with Redis caching?
Answer: "Two main issues:
1. ClassCastException with LinkedHashMap:
- Problem: Jackson deserialized User objects as LinkedHashMap
- Root cause: Missing type information in JSON
- Solution: Enabled default typing in ObjectMapper:
objectMapper.enableDefaultTyping(ObjectMapper.DefaultTyping.NON_FINAL, JsonTypeInfo.As.PROPERTY);2. Lazy loading exceptions:
- Problem: @OneToMany relationships caused lazy loading in cached objects
- Root cause: JPA proxy objects can't be serialized
- Solution: Added @JsonIgnore to lazy relationships:
@OneToMany(fetch = FetchType.LAZY)
@JsonIgnore
private List<Notification> notifications;Result: Clean JSON serialization without database dependencies"
Q20: How do you test your caching implementation?
Answer: "I test both cache hits and cache misses:
Cache Miss Testing:
# Clear cache
docker exec notification-redis redis-cli FLUSHALL
# First request - hits database
http :8080/api/v1/users/email/john@example.com
# Logs show: Hibernate SQL query executed
# Check Redis - data is now cached
docker exec notification-redis redis-cli keys "*"
# Shows: users::email:john@example.com
# Second request - serves from cache
http :8080/api/v1/users/email/john@example.com
# No database query in logsFailed Lookup Testing:
# Non-existent user
http :8080/api/v1/users/email/nonexistent@example.com
# Returns 404, no cache entry created
# Redis keys still shows only successful lookupsCache Eviction Testing:
# Update user email (would trigger @CacheEvict)
# Verify old cache key is removed, new one is created
```"
---
### API Design Questions
---
**Q21: Why return 201 Created instead of 200 OK?**
**Answer:**
"HTTP 201 means 'a new resource was created as a result of the request.'
This is semantically correct because:
- A notification record is created in the database
- The notification ID is returned so clients can track it
- Actual delivery happens asynchronously via Kafka
200 OK would also be acceptable, but 201 better communicates that a new resource (the notification) was created.
I also return the notification ID so clients can check status later:
```json
{
"success": true,
"data": {
"id": "abc-123",
"status": "PENDING"
}
}
```"
---
**Q22: How do you handle validation errors?**
**Answer:**
"I use Bean Validation annotations and a global exception handler:
**Request DTO:**
```java
@NotNull(message = "User ID is required")
private UUID userId;
@NotNull(message = "Channel is required")
private ChannelType channel;Global Exception Handler:
@ExceptionHandler(MethodArgumentNotValidException.class)
public ResponseEntity<ApiResponse<Map<String, String>>> handleValidation(...) {
// Extract field errors
Map<String, String> errors = ex.getBindingResult()
.getFieldErrors()
.stream()
.collect(toMap(FieldError::getField, FieldError::getDefaultMessage));
return ResponseEntity.badRequest().body(ApiResponse.error(errors));
}Response:
{
"success": false,
"message": "Validation failed",
"errors": {
"userId": "User ID is required",
"channel": "Channel is required"
}
}
```"
---
### Error Handling & Reliability Questions
---
**Q23: How do you handle partial failures in bulk notifications?**
**Answer:**
"For bulk notifications, I use a **best-effort** approach:
1. Process each user independently in a loop
2. Catch exceptions per-user, don't fail the whole batch
3. Track successes and failures separately
4. Return a detailed report
```java
for (UUID userId : request.getUserIds()) {
try {
// Process notification
response.addSuccess(notificationId);
} catch (Exception e) {
response.addFailure(userId, e.getMessage());
}
}Response:
{
"totalRequested": 100,
"successCount": 95,
"failedCount": 5,
"successIds": ["abc", "def", ...],
"failures": [
{"userId": "xyz", "reason": "Rate limit exceeded"}
]
}
```"
---
**Q24: What monitoring did you implement for production readiness?**
**Answer:**
"I built a full observability stack into the project:
1. **Metrics (Prometheus + Micrometer) — implemented:**
- Spring Boot Actuator exposes `/actuator/prometheus`
- Prometheus scrapes metrics every 15s (configured in `monitoring/prometheus/prometheus.yml`)
- Grafana dashboards visualize HTTP throughput, JVM heap, DB pool health, and Kafka consumer lag
- Custom counters for notifications sent per channel, rate limit rejections, retry counts
2. **Stress Testing (k6) — implemented:**
- `stress-test.js`: 10,000 VU ramp-up to validate throughput target
- `spike-test.js`: Sudden burst to test circuit-breaker behavior
- `heap-test.js`: Long-running load to detect JVM memory leaks
- Thresholds: p95 < 750ms, error rate < 5%
3. **Health endpoints (Spring Boot Actuator) — implemented:**
- `/health` - Overall health with DB, Redis, Kafka checks
- `/actuator/prometheus` - Prometheus-format metrics
4. **Logging:**
- Hot-path logs set to DEBUG to avoid I/O overhead at scale
- WARN level for errors and anomalies
- Structured logging ready for ELK stack integration
5. **Alerting (Grafana — ready to configure):**
- Failed notification rate > 5%
- Kafka consumer lag > 1000
- HikariCP pool utilization > 80%
- Redis unavailable"
---
### Scaling & Performance Questions
---
**Q25: How would you scale this system to handle 10x traffic?**
**Answer:**
"I would scale horizontally at multiple layers:
1. **API Layer:**
- Run multiple instances behind a load balancer
- They're stateless, so easy to scale
- Tomcat tuned: 400 threads, 10k max connections
2. **Kafka:**
- Already configured with high partition counts (email:8, push:8, in-app:6, sms:4)
- More partitions = more parallel consumers
- Producer batching: linger.ms=5, batch-size=64KB, lz4 compression
3. **Workers:**
- Add more consumer instances (same group ID)
- Currently 10 concurrent consumer threads
- Kafka distributes partitions among them
4. **Database:**
- HikariCP pool at 50 connections with batch_size=50
- Read replicas for inbox queries
- Consider sharding by user_id for extreme scale
5. **Redis:**
- User lookups cached by ID (avoids DB hit per notification)
- Redis Cluster for distributed rate limiting
- Consistent hashing for key distribution"
---
**Q26: What's the bottleneck in your current design?**
**Answer:**
"The most likely bottlenecks:
1. **Database writes:** Every notification = one INSERT. At high volume, this could slow down.
**Current mitigation:** HikariCP pool=50, Hibernate batch_size=50, provider_disables_autocommit.
**Further solution:** Batch inserts, or use a write-optimized DB like Cassandra.
2. **External providers:** SendGrid/Twilio have their own rate limits.
**Solution:** Respect their limits, queue excess, retry with backoff.
3. **Kafka producer throughput:** Sync sends would bottleneck at high volume.
**Current mitigation:** Fire-and-forget sends with producer batching (linger.ms=5, batch-size=64KB, lz4 compression, acks=1).
With current tuning (50 DB conns, 400 Tomcat threads, 10 Kafka consumer threads), a single instance can sustain ~8,000-12,000 req/sec. Beyond that, horizontal scaling is needed."
---
### Code Quality & Testing Questions
---
**Q27: How do you test this system?**
**Answer:**
"Multiple test levels:
1. **Unit Tests:**
- Test services with mocked dependencies
- `RateLimiterServiceTest` - mock Redis
- `NotificationServiceTest` - mock repos, Kafka
2. **Integration Tests:**
- Use real PostgreSQL (via Docker)
- Test full request flow
- Verify database state
3. **Manual/E2E Testing:**
- Swagger UI for API testing
- Postman collection included
- Docker Compose for local environment
**Example unit test:**
```java
@Test
void shouldRejectWhenRateLimitExceeded() {
when(redisTemplate.opsForValue().get(anyString())).thenReturn("10");
assertThrows(RateLimitExceededException.class, () ->
rateLimiterService.checkAndIncrement(userId, EMAIL)
);
}
```"
---
**Q28: How do you handle configuration across environments?**
**Answer:**
"Spring profiles and externalized configuration:
1. **application.yml:** Default configuration
2. **application-test.yml:** Test overrides (different DB, disabled schedulers)
3. **Environment variables:** Override for production (DB_URL, KAFKA_SERVERS)
**Example:**
```yaml
# application.yml
notification:
rate-limit:
email: ${RATE_LIMIT_EMAIL:10} # Default 10, override via env varThis follows 12-factor app principles - config in environment, not code."
Q29: What was the most challenging part of building this?
Answer: "The retry mechanism with exactly-once semantics was tricky.
Challenge: Ensuring a failed notification gets retried without:
- Losing it (if worker crashes)
- Sending duplicates (if processed twice)
Solution:
- Save to DB before publishing to Kafka (notification survives Kafka failure)
- Check status before processing (skip if already SENT)
- Manual Kafka acknowledgment (only commit after success)
- Scheduled job as backup (catches anything missed)
This gives at-least-once delivery with idempotent processing."
Q30: What would you do differently if you rebuilt this?
Answer: "A few things:
-
Event Sourcing: Store notification events instead of just current state. Enables audit trails and replay.
-
Dead Letter Queue: Instead of max retries → FAILED, move to a DLQ for manual investigation.
-
Circuit Breaker: If SendGrid is down, stop trying and fail fast. Prevent cascade failures.
-
Outbox Pattern: Instead of publish after save, use a transactional outbox for guaranteed delivery.
These are production-grade improvements I'd add in a real system."
Q31: How did you decide what to include vs. exclude?
Answer: "I followed Alex Xu's approach: solve the core problem well, explicitly scope out complexity.
Included (core features):
- Multi-channel delivery ✓
- Rate limiting ✓
- Async processing ✓
- Retry mechanism ✓
- Template system ✓
Excluded (complexity traps):
- Analytics dashboard ✗
- A/B testing ✗
- Complex scheduling (cron) ✗
- Multi-tenancy ✗
This keeps the project explainable in an interview without inviting questions I can't answer well."
Advanced topic for principal/staff-level interviews. Shows you can plan a multi-region production deployment.
┌─────────────────────────────────────────────────────────────────┐
│ GLOBAL SERVICES (Single Region) │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Route 53 │ │ CloudFront │ │ WAF & Shield │ │
│ │ (DNS) │ │ (CDN) │ │ (Security) │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ REGIONAL SERVICES (Per Region) │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Application │ │ ElastiCache │ │ Aurora Global │ │
│ │ Load Balancer │ │ (Redis) │ │ Database │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ ECS/EKS │ │ MSK (Kafka) │ │ S3 │ │
│ │ (Containers) │ │ (Messaging) │ │ (Storage) │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
| Component | AWS Service | Configuration | Multi-Region |
|---|---|---|---|
| Application | ECS Fargate / EKS | Auto-scaling groups, health checks | Regional deployment |
| Database | Aurora Global Database | PostgreSQL 15, multi-master | Global write/read endpoints |
| Cache | ElastiCache Global Datastore | Redis 7, cluster mode | Global replication |
| Message Queue | Amazon MSK | Kafka 3.x, multi-AZ | Regional with cross-region replication |
| Load Balancer | Application Load Balancer | SSL termination, WAF integration | Regional |
| CDN | CloudFront | Edge locations, caching rules | Global |
| DNS | Route 53 | Geo-routing, health checks | Global |
| Storage | S3 | Versioning, cross-region replication | Multi-region replication |
| Security | AWS WAF + Shield | Rate limiting, DDoS protection | Global |
| Monitoring | CloudWatch + X-Ray | Metrics, logs, tracing | Cross-region aggregation |
US-East-1 (Primary) EU-West-1 (Secondary) AP-Southeast-1 (Secondary)
├── Aurora Writer ├── Aurora Reader ├── Aurora Reader
├── Redis Primary ├── Redis Replica ├── Redis Replica
├── MSK Primary ├── MSK Replica ├── MSK Replica
└── ECS Tasks └── ECS Tasks └── ECS Tasks
US-East-1 EU-West-1 AP-Southeast-1
├── Aurora Multi-Master ├── Aurora Multi-Master ├── Aurora Multi-Master
├── Redis Global ├── Redis Global ├── Redis Global
├── MSK Global ├── MSK Global ├── MSK Global
└── ECS Global └── ECS Global └── ECS Global
- Data Location: EU-West-1 (Ireland) primary
- Data Processing: Consent management, right to erasure
- Cross-Border: Explicit user consent required
- Retention: Configurable per regulation requirements
- Data Location: US-West-2 (Oregon) primary
- Data Subject Rights: Access, delete, opt-out
- Data Mapping: Track all personal data flows
- Breach Notification: <72 hours requirement
- Data Location: AP-Southeast-1 (Singapore) primary
- Data Protection: Consent-based collection
- Cross-Border: PDPC approval for transfers
- Retention: Business necessity principle
- Spot Instances: 70% savings for non-critical workloads
- Auto-scaling: Scale to zero during low traffic
- Graviton Processors: 20% cost reduction vs x86
- Serverless v2: Pay per usage, auto-scaling
- Aurora Optimized Reads: Up to 8x faster queries
- Storage Auto-scaling: No over-provisioning
- Reserved Nodes: 50% savings for predictable workloads
- Data Tiering: Automatic cost optimization
- Cluster Mode: Better memory utilization
- Edge Locations: Reduced latency, lower data transfer costs
- Compression: Gzip/Brotli for smaller payloads
- Caching: Reduce origin requests by 80%
CloudWatch Metrics:
├── Application: Response time, error rates, throughput
├── Database: Connection count, query latency, deadlocks
├── Cache: Hit rate, memory usage, eviction count
└── Queue: Message count, consumer lag, throughput
X-Ray Integration:
├── Service mesh tracing across regions
├── Performance bottleneck identification
├── Error root cause analysis
└── User journey tracking
Critical Alerts (Page immediately):
├── Service unavailable (>5 minutes)
├── Database connection failures
├── Queue backlog > 1M messages
└── Security incidents
Warning Alerts (Monitor trends):
├── High latency (>500ms p95)
├── Error rate > 1%
├── Cache hit rate < 80%
└── Storage utilization > 80%
- Critical Services: RTO < 1 hour, RPO < 5 minutes
- Standard Services: RTO < 4 hours, RPO < 1 hour
- Data Services: RTO < 2 hours, RPO < 15 minutes
Automatic Failover:
├── Route 53 health checks trigger DNS failover
├── Aurora Global Database promotes secondary region
├── ElastiCache Global Datastore switches primary
└── ECS services scale up in secondary region
VPC Design:
├── Public subnets: Load balancers only
├── Private subnets: Application and data layers
├── Isolated subnets: Database and cache
└── Transit Gateway: Cross-region connectivity
Encryption:
├── Data at rest: AWS KMS with customer keys
├── Data in transit: TLS 1.3 minimum
├── Database: Transparent Data Encryption (TDE)
└── Cache: Redis encryption in transit/at rest
IAM Strategy:
├── Least privilege principle
├── Service roles for ECS tasks
├── Cross-account access via IAM roles
└── Multi-factor authentication required
- CloudFront: 200+ edge locations worldwide
- Route 53: Geo-based routing to nearest region
- Global Accelerator: TCP/UDP optimization
- Aurora Global Database: Sub-second cross-region replication
Horizontal Scaling:
├── ECS: Auto-scaling based on CPU/memory
├── Aurora: Auto-scaling read replicas
├── ElastiCache: Cluster scaling
└── MSK: Broker scaling
Vertical Scaling:
├── Instance types: Graviton3 for cost/performance
├── Storage: Aurora I/O optimization
├── Network: Enhanced networking
└── Memory: Redis cluster mode
Compute (ECS Fargate): $12,000 - $25,000
Database (Aurora Global): $8,000 - $15,000
Cache (ElastiCache): $3,000 - $6,000
Message Queue (MSK): $2,000 - $4,000
Storage (S3): $500 - $1,000
CDN (CloudFront): $2,000 - $4,000
Monitoring (CloudWatch): $800 - $1,500
Security (WAF/Shield): $1,000 - $2,000
─────────────────────────────────────────────
Total Estimate: $29,300 - $58,500
- 1M daily notifications: +20% infrastructure cost
- 10M daily notifications: +100% infrastructure cost
- 100M daily notifications: +300% infrastructure cost
| Question | Answer |
|---|---|
| Why multi-region? | Compliance, latency, disaster recovery |
| Database choice? | Aurora Global - cross-region replication, PostgreSQL compatibility |
| Cache strategy? | ElastiCache Global - worldwide replication, Redis 7 |
| CDN choice? | CloudFront - 200+ locations, WAF integration |
| DNS routing? | Route 53 geo-routing, health-based failover |
| Security layers? | WAF + Shield (edge), Security Groups (VPC), IAM (access) |
| Monitoring stack? | CloudWatch + X-Ray, centralized logging |
| Cost optimization? | Reserved instances, spot instances, auto-scaling |
| Disaster recovery? | RTO < 1hr, RPO < 5min for critical services |
| Compliance handling? | Regional data residency, consent management |
Good luck with your interviews!