Enron Email Search System

A scalable, resilient system for cleaning, indexing, and searching the Enron email dataset. This project demonstrates advanced software architecture principles including the Scale Cube, fault isolation, and comprehensive monitoring.

Project Overview

The Enron Email Search System processes the publicly available Enron email dataset (approximately 1.7GB) and makes it searchable through a web interface. The system is designed with three main components:

Email Cleaner: Removes headers from raw email files and prepares them for indexing
Email Indexer: Creates a searchable index in a SQLite database
Search Interface: Provides a web UI for searching and viewing emails

The architecture follows clean architecture principles with separate Core, Infrastructure, and Presentation layers.

Key Features

Scalable Architecture: Implementation of all three Scale Cube dimensions (X, Y, Z axes)
Fault Tolerance: Comprehensive resilience patterns using Polly
Containerization: Docker and docker-compose for easy deployment
Monitoring: Prometheus and Grafana for metrics and visualization
Performance: Optimized for speed in both indexing and searching

System Architecture

C4 Model

The system is designed using the C4 model approach with clear definitions of:

Context: System boundaries and external interactions
Containers: Major building blocks like web application, indexer, and database
Components: Internal components of each container
Code: Implementation details

Database Schema

We evolved the database schema from the one provided throught the assignment. Here's a text representation of the entity relationships:

This schema design provides:

Core Entities

Word: Stores unique terms found in the dataset
- WordId (PK): Unique identifier
- Text: The actual word/term
EmailFile: Stores information about email files
- FileId (PK): Unique identifier
- FileName: Original file name
- Content: Raw email content
Occurrence: Junction table tracking word occurrences in files
- WordId (PK, FK): Reference to Word
- FileId (PK, FK): Reference to EmailFile
- Count: Number of occurrences of the word in the file

Extended Entities

Contact: Email senders and recipients
- ContactId (PK): Unique identifier
- EmailAddress: Email address
- Name: Contact name
- Company: Company affiliation
- Position: Job position
EmailRecipient: Junction table for email recipients
- EmailId (PK, FK): Reference to EmailFile
- ContactId (PK, FK): Reference to Contact
- RecipientType: Type (To, Cc, Bcc)
Topic: Email topics for categorization
- TopicId (PK): Unique identifier
- TopicName: Topic name
- Keywords: Related keywords
TopicDocumentMapping: Junction table for email-topic relationships
- TopicId (PK, FK): Reference to Topic
- FileId (PK, FK): Reference to EmailFile
- RelevanceScore: Topic relevance score

Directory Structure

EnronEmailSearch/
├── Core/                # Domain models and business logic
│   ├── Interfaces/      # Core abstractions
│   ├── Models/          # Domain entities
│   ├── ScaleCube/       # Scaling implementations
│   └── Services/        # Core business services
├── Infrastructure/      # External concerns implementation
│   ├── Data/            # Database implementation
│   ├── DI/              # Dependency injection
│   └── Services/        # Infrastructure services
├── Indexer/             # Console application for indexing
│   ├── Program.cs       # Entry point for indexer
│   └── Dockerfile       # Container definition
├── Web/                 # Web interface
│   ├── Controllers/     # MVC controllers
│   ├── Views/           # Razor views
│   └── Dockerfile       # Container definition
└── Documentation/       # Project documentation
    └── C4Model.dsl      # C4 model definition

Scale Cube Implementation

X-Axis: Horizontal Duplication

The system implements X-axis scaling through multiple instances of the web application behind an NGINX load balancer. Configuration is managed through the XAxisConfiguration class and applied via docker-compose.

services:
  webapp:
    # ...
    deploy:
      replicas: 2  # X-axis scaling - horizontal duplication

Y-Axis: Functional Decomposition

The system can be deployed in a microservice configuration using docker-compose.microservice.yml, which separates functionality into distinct services:

Cleaner Service
Indexer Service
Search API Service
Web UI Service

Z-Axis: Data Partitioning

Data processing is partitioned using various sharding strategies implemented in the ZAxisScaling class:

Range-based sharding
Hash-based sharding
Directory-based sharding
Modulo hash sharding

Fault Isolation

The system implements comprehensive fault isolation using Polly patterns:

Circuit Breakers: Prevent cascading failures
Retry Policies: Handle transient errors
Timeout Policies: Prevent hanging operations
Bulkheads: Isolate components from each other

Key resilience classes:

ResilienceService: Implements core resilience patterns
ResilientEmailIndexer: Decorates the indexer with resilience

Monitoring

Monitoring is implemented with:

Prometheus: Metrics collection
Grafana: Visualization dashboards
Health Checks: Active monitoring of system components
Serilog: Structured logging

Getting Started

Prerequisites

Docker and Docker Compose
The Enron email dataset (downloadable from https://www.cs.cmu.edu/~enron/)

Installation and Running

Clone the repository:

git clone https://github.com/yourusername/EnronEmailSearch.git

Create directories for data:

mkdir -p enron_dataset cleaned_emails db

Download and extract the Enron dataset to the enron_dataset directory.
Start the system:
```
docker-compose up
```

For microservice deployment:

docker-compose -f docker-compose.microservice.yml up

Accessing the System

Web Interface: http://localhost:80
Grafana Dashboards: http://localhost:3000 (admin/admin)
Prometheus: http://localhost:9090

Performance

Performance metrics for the system:

Configuration	Files/Second	Total Processing Time
Single thread	152	43 min
Multi-thread	487	14 min
Z-axis (4 shards)	1,892	3.5 min

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.idea/.idea.EnronEmailSearch/.idea		.idea/.idea.EnronEmailSearch/.idea
EnronEmailSearch		EnronEmailSearch
.dockerignore		.dockerignore
EnronEmailSearch.sln		EnronEmailSearch.sln
README.md		README.md
docker-compose.microservice.yml		docker-compose.microservice.yml
docker-compose.yml		docker-compose.yml
global.json		global.json
grafana-dashboard.json		grafana-dashboard.json
nginx.conf		nginx.conf
prometheus.yml		prometheus.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Enron Email Search System

Project Overview

Key Features

System Architecture

C4 Model

Database Schema

Core Entities

Extended Entities

Directory Structure

Scale Cube Implementation

X-Axis: Horizontal Duplication

Y-Axis: Functional Decomposition

Z-Axis: Data Partitioning

Fault Isolation

Monitoring

Getting Started

Prerequisites

Installation and Running

Accessing the System

Performance

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Enron Email Search System

Project Overview

Key Features

System Architecture

C4 Model

Database Schema

Core Entities

Extended Entities

Directory Structure

Scale Cube Implementation

X-Axis: Horizontal Duplication

Y-Axis: Functional Decomposition

Z-Axis: Data Partitioning

Fault Isolation

Monitoring

Getting Started

Prerequisites

Installation and Running

Accessing the System

Performance

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages