Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion builders/scripts/faang-daily-close/0.1.0/builder.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ def build(
dependencies: dict[str, dict[datetime, list[dict]]], timestamp: datetime
) -> list[dict[str, Any]]:
"""Fetch NYSE daily close prices for FAANG stocks from EODHD."""
# APIClient is instantiated here, not at module level — isolated_worker.py
# APIClient is instantiated here, not at module level — subprocess_worker.py
# loads the .env before calling build() but after importing the module,
# so module-level env access would fail
api = APIClient(os.environ["EODHD_API_KEY"])
Expand Down
14 changes: 8 additions & 6 deletions dev-docs/SPEC-backend.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ Each entry in `rows` contains all data dicts for that timestamp (matching the DB
- It dynamically imports builder scripts to run them (see below).
- All builder scripts will be implemented by internal users, so we may trust that all code is safe.
- But we must not trust that builder scripts will not crash, so each builder invocation runs in an isolated subprocess.
- Each builder runs in its own `subprocess.Popen` process, communicating via JSON over stdin/stdout. A standalone worker script (`isolated_worker.py`) handles deserialization, builder import, and result serialization. It uses only stdlib so it works in any venv.
- Each builder runs in its own `subprocess.Popen` process, communicating via JSON over stdin/stdout. A standalone worker script (`subprocess_worker.py`) handles deserialization, builder import, and result serialization. It uses only stdlib so it works in any venv.
- If the subprocess crashes, the main process catches the failure cleanly without going down.
- Builders **never** have direct access to the database -- all reads and writes are handled by the server. For now, this is enforced by convention. TODO: add a runtime guard to enforce this.
- After builder scripts are run, we upload the data to the Postgres database (see below).
Expand Down Expand Up @@ -218,15 +218,16 @@ builders/server/
├── db/ # database connection management and queries
│ ├── connection.py
│ └── datasets.py
├── runtime/ # config loading, subprocess isolation, schema validation, venv management
├── runtime/ # config loading, schema validation, venv management
│ ├── config.py
│ ├── registry.py # startup preload + in-memory config registry
│ ├── isolated_worker.py # standalone worker script (stdlib-only, runs in builder subprocesses)
│ ├── loader.py
│ ├── runner.py
│ ├── serialization.py # JSON serialization for subprocess IPC
│ ├── validator.py
│ └── venv_management.py # per-builder venv creation and caching
├── workers/ # subprocess entry points
│ └── subprocess_worker.py # standalone worker script (stdlib-only, runs in builder subprocesses)
├── utils/ # shared utilities
│ ├── retry.py # generic retry with exponential backoff
│ └── semver.py
Expand All @@ -235,6 +236,7 @@ builders/server/
├── service/
├── db/
├── runtime/
├── workers/
└── utils/
```

Expand Down Expand Up @@ -474,7 +476,7 @@ Builder scripts may need secrets or config (API keys, credentials) passed via en
**Runtime behavior**:
- The `.env` file is validated at build time, not config load time. This means CI can load configs for datasets with `env-vars = true` without needing the actual `.env` file present.
- If `env-vars` is `true` and the `.env` file is missing at build time, a `FileNotFoundError` is raised.
- The main server process never reads the `.env` values. The `.env` file is parsed inside the subprocess only (using a minimal stdlib-only parser in `isolated_worker.py`), so secrets never enter the parent process memory.
- The main server process never reads the `.env` values. The `.env` file is parsed inside the subprocess only (using a minimal stdlib-only parser in `subprocess_worker.py`), so secrets never enter the parent process memory.
- Environment variables are scoped to the subprocess and do not leak to the parent.

### Per-builder virtual environments
Expand Down Expand Up @@ -514,8 +516,8 @@ builders/scripts/<dataset_name>/<version>/
Builder subprocesses use `subprocess.Popen` with JSON over stdin/stdout for IPC:

1. The runner serializes builder inputs (dependencies, timestamp, paths, env file) to JSON via `serialization.py`
2. The subprocess runs `isolated_worker.py` using the builder's venv python (or system python)
3. `isolated_worker.py` is stdlib-only: it deserializes input, imports and runs the builder, serializes output
2. The subprocess runs `subprocess_worker.py` using the builder's venv python (or system python)
3. `subprocess_worker.py` is stdlib-only: it deserializes input, imports and runs the builder, serializes output
4. The runner deserializes the JSON response from stdout

This model supports per-builder venvs since each subprocess uses its own Python interpreter. The worker script has no dependencies on server code.
Expand Down
Loading