Skip to content

PyDevDeep/LeadFlow-Architecture

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

48 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

LeadFlow Architecture πŸš€

Python License CI PRs Welcome

LeadFlow Architecture is a professional lead generation tool that automates the full pipeline β€” from data scraping to CRM integration via Webhooks and Make.com.

Built for developers and marketing teams who need to streamline lead collection and outreach workflows at scale.


✨ Features

  • Automated Scraping β€” Extract leads from Google Maps, Google Search, or custom URL lists.
  • 4 Scraping Modes β€” Surface search, deep scrape, hybrid, and file-based pipelines.
  • Data Management β€” Local SQLite storage for efficient processing and queue tracking.
  • Webhook Integration β€” Push validated leads to Make.com, Zapier, or any HTTP endpoint.
  • Outreach Pipeline β€” Native integration with Airtable, OpenAI, Hunter.io, and Instantly.
  • Reliability & Error Handling β€” Break directives with automatic retry on API failures.
  • Lead Deduplication β€” Built-in filters prevent duplicate records and skip leads without emails.
  • Configurable Logging β€” Flexible log levels: DEBUG, INFO, ERROR.
  • CI Pipeline β€” Automated linting (Ruff), type checking (Pyright), and tests on every push.

πŸ›  Tech Stack

Layer Technology
Language Python 3.14+
Database SQLite
Validation Pydantic v2
Linting Ruff
Type Checking Pyright
Testing Pytest
Automation Make.com
Integrations Airtable Β· OpenAI Β· Hunter.io Β· Instantly

πŸ“¦ Installation

1. Clone the Repository

git clone https://github.com/PyDevDeep/LeadFlow-Architecture.git
cd LeadFlow-Architecture

2. Set Up the Environment

pip install ".[dev]"
cp .env.example .env

Edit .env with your credentials before proceeding.

3. Initialize the Database

python main.py init

πŸ–₯ CLI Usage Guide

All commands are executed from the project root directory.

Database Initialization

Must be run once before any scraping pipeline.

python main.py init

Scraping Pipelines

Scenario 1 β€” Google Maps Scraper

Extracts local business data from Google Maps based on a search query.

python main.py maps -q "<your_query>"

# Example
python main.py maps -q "dental clinics in Kyiv"

Scenario 2 β€” Google Search (Surface Level)

Extracts basic snippets and URLs from Google organic search results.

python main.py search -q "<your_query>"

# Example
python main.py search -q "top digital marketing agencies"

Scenario 3 β€” Hybrid (Search + Deep Scrape)

Runs an organic search, then performs multi-threaded deep scraping on each discovered domain to extract contact details and metadata.

python main.py hybrid -q "<your_query>"

# Example
python main.py hybrid -q "software development outsourcing ukraine"

Scenario 4 β€” File-Based Deep Scrape

Reads URLs from a .txt file and runs a multi-threaded deep scrape on each target.

python main.py file -f <filepath>

# Example
python main.py file -f urls_list.txt

Push Leads to Webhook

Processes the pending queue and pushes validated leads to the configured Webhook endpoint.

python main.py send

πŸ§ͺ Run Tests

pytest tests/test_suite.py

βš™οΈ Configuration

All settings are managed via the .env file. Copy .env.example and fill in the values:

Variable Description
DATABASE_PATH Path to the SQLite database file
LOG_LEVEL Logging verbosity (DEBUG / INFO / ERROR)
SERPER_API_KEY API key for Serper.dev (search & maps)
MAKE_LEAD_KEY Secret key for Make.com Webhook authentication
WEBHOOK_URL Destination endpoint for lead delivery
WEBHOOK_BATCH_SIZE Number of leads sent per batch
SCRAPER_TIMEOUT HTTP request timeout in seconds
SCRAPER_RETRIES Number of retry attempts on request failure
SCRAPER_MAX_WORKERS Thread pool size for parallel scraping
SERPER_MAX_RESULTS Max results returned per Serper API call

πŸ”„ CI/CD

The project uses GitHub Actions for continuous integration. The pipeline runs on every push and pull request to main / master.

push / PR β†’ Ruff (lint) β†’ Pyright (type check) β†’ Pytest
Step Tool Purpose
Lint Ruff Code style and import checks
Type Check Pyright (strict) Static type safety across app/
Tests Pytest Functional test suite with isolated env credentials

CI config: .github/workflows/ci.yml


πŸ€– Make.com Automation

A ready-to-use blueprint is available in the /automation directory.

Quick Setup

  1. Download Make.json or outreach_pipeline.json from /automation
  2. In Make.com β€” create a new scenario β†’ Import Blueprint
  3. Connect each module marked with a red !:
    • Airtable (Production)
    • Hunter.io (Domain Search)
    • OpenAI (GPT-4o-mini)
    • Instantly (Lead Import)
  4. Replace all YOUR_... placeholders with your actual IDs (see tables below)
  5. Copy the generated Webhook URL from Module 1 β†’ paste it into your .env as WEBHOOK_URL

πŸ”‘ Service Identifiers

Variable Description Where to Find
YOUR_BASE_ID Unique ID of your Airtable Base Airtable URL / API docs
YOUR_TABLE_ID Name or ID of the target table Airtable table URL
YOUR_INSTANTLY_CAMPAIGN_ID ID of your outreach campaign Instantly.ai dashboard

πŸ—‚ Airtable Field Mapping

For Module 11 (Create a Record) to function correctly, your Airtable table must include columns matching these identifiers:

Placeholder Field Description
YOUR_FIELD_ID_DOMAIN Company domain (e.g. example.com)
YOUR_FIELD_ID_EMAIL Primary contact email found by Hunter
YOUR_FIELD_ID_COMPANY_NAME Organization name
YOUR_FIELD_ID_URL Full website URL
YOUR_FIELD_ID_DB_ID Original ID from the lead source / webhook
YOUR_FIELD_ID_PHONE Contact phone number
YOUR_FIELD_ID_DESCRIPTION Company description used for AI context
YOUR_FIELD_ID_SOURCE_METHOD Label indicating lead origin
YOUR_FIELD_ID_AI_RESPONSE GPT-generated personalized opening line
YOUR_FIELD_ID_STATUS Lead status (Ready / Done)

πŸ›‘οΈ Reliability Features

  • Error Handling β€” Break directives on OpenAI, Hunter, and Instantly modules; Make.com auto-retries on service failures.
  • Lead Filtering β€” Skips leads already present in Airtable and leads without a valid email, reducing unnecessary API token usage.

πŸ” Security Audit

Check Status
API keys / passwords in JSON βœ… None found
__IMTCONN__ connection fields βœ… All set to null
Sensitive placeholders βœ… Correctly masked

πŸ–Ό Pipeline Overview

Pipeline Overview

For a full component breakdown and data flow diagram, see architecture.md.


πŸ“‚ Project Structure

.
β”œβ”€β”€ .github/
β”‚   └── workflows/
β”‚       └── ci.yml                # GitHub Actions CI pipeline
β”œβ”€β”€ app/
β”‚   β”œβ”€β”€ scraper/                  # Scraping modules (client, parser, logic)
β”‚   β”œβ”€β”€ sender/                   # Webhook sending logic
β”‚   β”œβ”€β”€ utils/                    # Utilities and logging
β”‚   β”œβ”€β”€ config.py                 # Environment configuration
β”‚   └── database.py               # Database interaction layer
β”œβ”€β”€ automation/                   # Make.com blueprints and assets
β”‚   β”œβ”€β”€ Make.json
β”‚   β”œβ”€β”€ outreach_pipeline.json
β”‚   └── Scenario_IMG.jpg
β”œβ”€β”€ tests/
β”‚   └── test_suite.py             # Pytest test suite
β”œβ”€β”€ main.py                       # CLI entry point
β”œβ”€β”€ pyproject.toml                # Project metadata and dependencies
β”œβ”€β”€ architecture.md               # Data flow architecture and component docs
└── .env.example                  # Environment variable template

πŸ— Why This Architecture?

LeadFlow is intentionally built around simplicity of deployment over distributed complexity.

SQLite over Redis or PostgreSQL:

  • Zero infrastructure overhead β€” no separate server process to manage or monitor.
  • The scraping pipeline is inherently sequential per session; concurrent write pressure is minimal.
  • A single .db file is trivially portable, backupable, and inspectable without tooling.
  • Redis would add operational complexity (persistence config, eviction policy, connection pooling) with no meaningful throughput gain at this scale.

Python-side data normalization over Make.com:

  • Make.com charges per operation. Pushing raw, unnormalized data and transforming it inside a scenario burns operations on every field mapping, filter, and iterator.
  • Normalizing in Python before the Webhook call means Make.com receives a clean, flat payload β€” one HTTP module fires, one Airtable record is created. No intermediate transformations.
  • Business logic stays in version-controlled code, not locked inside a visual no-code scenario that is harder to diff, test, or roll back.

βš–οΈ Trade-offs & Production Readiness

Dimension Current State Production Consideration
Concurrency Multi-threaded scraping per run No distributed task queue (Celery / RQ) β€” single-machine only
Database SQLite Not suitable for multi-process writes or horizontal scaling
Error Recovery Make.com Break directives + retry No dead-letter queue for leads that permanently fail
Observability File-based logging No structured log aggregation (Datadog, Loki, etc.)
Rate Limiting Timeout config via .env No adaptive back-off or proxy rotation built in
Auth API keys in .env Secrets manager (Vault, AWS SSM) recommended for team deployments

Bottom line: LeadFlow is optimized for solo operators and small teams running scheduled scraping jobs on a single machine. It is not designed for high-frequency, multi-tenant, or real-time production environments without the additions noted above.


🀝 Contributing

Contributions are welcome!

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/your-feature
  3. Commit your changes: git commit -m 'Add some feature'
  4. Push to the branch: git push origin feature/your-feature
  5. Open a Pull Request

πŸ“„ License

This project is licensed under the MIT License.

About

A resilient end-to-end lead generation pipeline featuring a local SQLite queue, exponential backoff delivery, and automated enrichment via Make.com & Airtable.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages