Skip to content

Commit fe0836b

Browse files
committed
docs: update readme and add architecture overview
1 parent 33f5107 commit fe0836b

2 files changed

Lines changed: 158 additions & 30 deletions

File tree

README.md

Lines changed: 86 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
# LeadFlow Architecture 🚀
22

3-
![Python](https://img.shields.io/badge/python-3.8+-blue.svg)
3+
![Python](https://img.shields.io/badge/python-3.14+-blue.svg)
44
![License](https://img.shields.io/badge/license-MIT-green.svg)
5-
![Build Status](https://img.shields.io/badge/build-passing-brightgreen)
5+
![CI](https://github.com/PyDevDeep/LeadFlow-Architecture/actions/workflows/ci.yml/badge.svg)
66
![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)
77

88
**LeadFlow Architecture** is a professional lead generation tool that automates the full pipeline — from data scraping to CRM integration via Webhooks and Make.com.
@@ -21,15 +21,19 @@ Built for developers and marketing teams who need to streamline lead collection
2121
- **Reliability & Error Handling** — Break directives with automatic retry on API failures.
2222
- **Lead Deduplication** — Built-in filters prevent duplicate records and skip leads without emails.
2323
- **Configurable Logging** — Flexible log levels: `DEBUG`, `INFO`, `ERROR`.
24+
- **CI Pipeline** — Automated linting (Ruff), type checking (Pyright), and tests on every push.
2425

2526
---
2627

2728
## 🛠 Tech Stack
2829

2930
| Layer | Technology |
3031
|---|---|
31-
| Language | [Python 3.8+](https://www.python.org/) |
32+
| Language | [Python 3.14+](https://www.python.org/) |
3233
| Database | SQLite |
34+
| Validation | [Pydantic v2](https://docs.pydantic.dev/) |
35+
| Linting | [Ruff](https://docs.astral.sh/ruff/) |
36+
| Type Checking | [Pyright](https://github.com/microsoft/pyright) |
3337
| Testing | [Pytest](https://docs.pytest.org/) |
3438
| Automation | [Make.com](https://www.make.com/) |
3539
| Integrations | Airtable · OpenAI · Hunter.io · Instantly |
@@ -48,7 +52,7 @@ cd LeadFlow-Architecture
4852
### 2. Set Up the Environment
4953

5054
```bash
51-
pip install -r requirements.txt
55+
pip install ".[dev]"
5256
cp .env.example .env
5357
```
5458

@@ -134,24 +138,13 @@ python main.py send
134138

135139
---
136140

137-
### 🧪 Tests
141+
### 🧪 Run Tests
138142

139143
```bash
140-
# All tests with verbose output
141-
pytest tests/ -v
142-
143-
# With coverage report
144-
pytest tests/ -v --cov=app --cov-report=term-missing
144+
pytest tests/test_suite.py
145145
```
146146

147-
| Module | Coverage |
148-
|--------------------|----------|
149-
| `validators.py` | 100% |
150-
| `manager.py` | 100% |
151-
| `serper_client.py` | 100% |
152-
| `worker.py` | 100% |
153-
| `database.py` | 100% |
154-
| **TOTAL** | **100%** |
147+
---
155148

156149
## ⚙️ Configuration
157150

@@ -160,10 +153,33 @@ All settings are managed via the `.env` file. Copy `.env.example` and fill in th
160153
| Variable | Description |
161154
|---|---|
162155
| `DATABASE_PATH` | Path to the SQLite database file |
163-
| `SCRAPER_TIMEOUT` | HTTP request timeout (seconds) |
156+
| `LOG_LEVEL` | Logging verbosity (`DEBUG` / `INFO` / `ERROR`) |
157+
| `SERPER_API_KEY` | API key for Serper.dev (search & maps) |
158+
| `MAKE_LEAD_KEY` | Secret key for Make.com Webhook authentication |
164159
| `WEBHOOK_URL` | Destination endpoint for lead delivery |
165160
| `WEBHOOK_BATCH_SIZE` | Number of leads sent per batch |
166-
| `LOG_LEVEL` | Logging verbosity (`DEBUG` / `INFO` / `ERROR`) |
161+
| `SCRAPER_TIMEOUT` | HTTP request timeout in seconds |
162+
| `SCRAPER_RETRIES` | Number of retry attempts on request failure |
163+
| `SCRAPER_MAX_WORKERS` | Thread pool size for parallel scraping |
164+
| `SERPER_MAX_RESULTS` | Max results returned per Serper API call |
165+
166+
---
167+
168+
## 🔄 CI/CD
169+
170+
The project uses **GitHub Actions** for continuous integration. The pipeline runs on every push and pull request to `main` / `master`.
171+
172+
```
173+
push / PR → Ruff (lint) → Pyright (type check) → Pytest
174+
```
175+
176+
| Step | Tool | Purpose |
177+
|---|---|---|
178+
| Lint | Ruff | Code style and import checks |
179+
| Type Check | Pyright (strict) | Static type safety across `app/` |
180+
| Tests | Pytest | Functional test suite with isolated env credentials |
181+
182+
CI config: [`.github/workflows/ci.yml`](.github/workflows/ci.yml)
167183

168184
---
169185

@@ -180,7 +196,7 @@ A ready-to-use blueprint is available in the `/automation` directory.
180196
- **Hunter.io** (Domain Search)
181197
- **OpenAI** (GPT-4o-mini)
182198
- **Instantly** (Lead Import)
183-
4. Replace all `YOUR_...` placeholders with your actual IDs (see table below)
199+
4. Replace all `YOUR_...` placeholders with your actual IDs (see tables below)
184200
5. Copy the generated Webhook URL from Module 1 → paste it into your `.env` as `WEBHOOK_URL`
185201

186202
---
@@ -235,29 +251,69 @@ For **Module 11 (Create a Record)** to function correctly, your Airtable table m
235251

236252
![Pipeline Overview](automation/Scenario_IMG.jpg)
237253

254+
> For a full component breakdown and data flow diagram, see [architecture.md](architecture.md).
255+
238256
---
239257

240258
## 📂 Project Structure
241259

242260
```text
243261
.
262+
├── .github/
263+
│ └── workflows/
264+
│ └── ci.yml # GitHub Actions CI pipeline
244265
├── app/
245-
│ ├── scraper/ # Scraping modules (client, parser, logic)
246-
│ ├── sender/ # Webhook sending logic
247-
│ ├── utils/ # Utilities and logging
248-
│ ├── config.py # Environment configuration
249-
│ └── database.py # Database interaction layer
250-
├── automation/ # Make.com blueprints and assets
266+
│ ├── scraper/ # Scraping modules (client, parser, logic)
267+
│ ├── sender/ # Webhook sending logic
268+
│ ├── utils/ # Utilities and logging
269+
│ ├── config.py # Environment configuration
270+
│ └── database.py # Database interaction layer
271+
├── automation/ # Make.com blueprints and assets
251272
│ ├── Make.json
252273
│ ├── outreach_pipeline.json
253274
│ └── Scenario_IMG.jpg
254-
├── main.py # CLI entry point
255-
├── requirements.txt # Project dependencies
256-
└── .env.example # Environment variable template
275+
├── tests/
276+
│ └── test_suite.py # Pytest test suite
277+
├── main.py # CLI entry point
278+
├── pyproject.toml # Project metadata and dependencies
279+
├── architecture.md # Data flow architecture and component docs
280+
└── .env.example # Environment variable template
257281
```
258282

259283
---
260284

285+
## 🏗 Why This Architecture?
286+
287+
LeadFlow is intentionally built around **simplicity of deployment** over distributed complexity.
288+
289+
**SQLite over Redis or PostgreSQL:**
290+
- Zero infrastructure overhead — no separate server process to manage or monitor.
291+
- The scraping pipeline is inherently sequential per session; concurrent write pressure is minimal.
292+
- A single `.db` file is trivially portable, backupable, and inspectable without tooling.
293+
- Redis would add operational complexity (persistence config, eviction policy, connection pooling) with no meaningful throughput gain at this scale.
294+
295+
**Python-side data normalization over Make.com:**
296+
- Make.com charges per **operation**. Pushing raw, unnormalized data and transforming it inside a scenario burns operations on every field mapping, filter, and iterator.
297+
- Normalizing in Python before the Webhook call means Make.com receives a clean, flat payload — one HTTP module fires, one Airtable record is created. No intermediate transformations.
298+
- Business logic stays in version-controlled code, not locked inside a visual no-code scenario that is harder to diff, test, or roll back.
299+
300+
---
301+
302+
## ⚖️ Trade-offs & Production Readiness
303+
304+
| Dimension | Current State | Production Consideration |
305+
|---|---|---|
306+
| **Concurrency** | Multi-threaded scraping per run | No distributed task queue (Celery / RQ) — single-machine only |
307+
| **Database** | SQLite | Not suitable for multi-process writes or horizontal scaling |
308+
| **Error Recovery** | Make.com Break directives + retry | No dead-letter queue for leads that permanently fail |
309+
| **Observability** | File-based logging | No structured log aggregation (Datadog, Loki, etc.) |
310+
| **Rate Limiting** | Timeout config via `.env` | No adaptive back-off or proxy rotation built in |
311+
| **Auth** | API keys in `.env` | Secrets manager (Vault, AWS SSM) recommended for team deployments |
312+
313+
> **Bottom line:** LeadFlow is optimized for **solo operators and small teams** running scheduled scraping jobs on a single machine. It is not designed for high-frequency, multi-tenant, or real-time production environments without the additions noted above.
314+
315+
---
316+
261317
## 🤝 Contributing
262318

263319
Contributions are welcome!

architecture.md

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
# Data Flow Architecture
2+
3+
## Pipeline Overview
4+
5+
This diagram illustrates the complete lifecycle of a lead, from the initial search query to the execution of a cold outreach campaign.
6+
```mermaid
7+
flowchart TD
8+
%% Source Layer
9+
CLI([CLI Input / URL File]) --> Orchestrator
10+
11+
%% Orchestration & Scraping
12+
subgraph Python Worker Layer
13+
Orchestrator[ScrapeManager] -->|Search / Maps / Scrape| SerperAPI((Serper.dev API))
14+
SerperAPI -->|JSON / Markdown| Validator[Pydantic Validator & Regex]
15+
Validator -->|Clean Domain & Phone| DB[(SQLite: leads_queue)]
16+
end
17+
18+
%% Delivery Layer
19+
subgraph Delivery Layer
20+
DB -->|SELECT pending| Consumer[Consumer Worker]
21+
Consumer -->|HTTP POST Batch| Webhook((Make.com Webhook))
22+
Consumer -.->|UPDATE status| DB
23+
end
24+
25+
%% Cloud Integration Layer
26+
subgraph Make.com (SSOT & Enrichment)
27+
Webhook --> Iterator[Iterator]
28+
Iterator --> Filter{Domain valid?}
29+
Filter -- Yes --> Airtable[(Airtable SSOT)]
30+
Airtable --> Hunter((Hunter.io API))
31+
Hunter --> OpenAI((OpenAI: First Line))
32+
OpenAI --> Instantly((Instantly.ai))
33+
end
34+
```
35+
36+
---
37+
38+
## 🧩 Component Breakdown
39+
40+
### 1. ScrapeManager
41+
The central orchestrator. Determines the appropriate Serper API endpoint (`Maps`, `Search`, or `Scrape`) based on the selected CLI execution mode and dispatches work accordingly.
42+
43+
### 2. Pydantic Validator & Regex
44+
Stateless validation layer that normalizes raw API responses into clean, typed records — stripping noise, extracting domains, and formatting phone numbers before any data touches the database.
45+
46+
### 3. SQLite — Persistent Queue
47+
Acts as a local buffer between the scraping and delivery layers. Ensures no leads are lost in the event of network failures, application crashes, or Make.com rate limiting. All records persist with a `status` field (`pending` / `sent` / `failed`).
48+
49+
### 4. Consumer Worker
50+
Reads the `pending` queue in configurable batches and delivers payloads to the cloud webhook. Implements **Exponential Backoff** for resilient error handling on transient failures.
51+
52+
### 5. Make.com — SSOT & Enrichment
53+
Serves as the business logic orchestrator in the cloud:
54+
- **Deduplication** — filters leads already present in Airtable.
55+
- **Email Enrichment** — queries Hunter.io by domain to retrieve verified contact emails.
56+
- **AI Personalization** — passes company context to OpenAI to generate a tailored cold outreach opening line.
57+
- **Campaign Injection** — pushes the enriched, personalized lead into Instantly.ai for outreach execution.
58+
59+
---
60+
61+
## 📦 Layer Summary
62+
63+
| Layer | Technology | Responsibility |
64+
|---|---|---|
65+
| Input | CLI / `.txt` file | Query or URL list ingestion |
66+
| Orchestration | ScrapeManager (Python) | Mode routing, API dispatch |
67+
| Validation | Pydantic + Regex | Data normalization & typing |
68+
| Storage | SQLite | Persistent lead queue |
69+
| Delivery | Consumer Worker | Batched webhook dispatch |
70+
| Enrichment | Make.com + Hunter.io | Deduplication, email lookup |
71+
| Personalization | OpenAI (GPT-4o-mini) | AI-generated opening lines |
72+
| Outreach | Instantly.ai | Cold email campaign execution |

0 commit comments

Comments
 (0)