Skip to content

Latest commit

 

History

History
115 lines (87 loc) · 10.1 KB

File metadata and controls

115 lines (87 loc) · 10.1 KB

Python Scripts

This directory contains scripts for the data pipeline.

Instructions

To install dependencies:

cd python
uv sync

When a new sitting is added to the Hansard, we need to

  1. Ingest that sitting's transcript in our desired format using the Hansard API
  2. Generate summaries for the questions, bills, and motions in that sitting
  3. Update the summaries for the MPs' contributions based on any new involvements from this sitting

For example, if the sitting on 27 February 2026 has just been added, we would run

uv run batch_process_sqlite.py 27-02-2026
uv run generate_summaries_sqlite.py --sittings 27-02-2026
uv run generate_summaries_sqlite.py --members

These scripts are described in more detail below.

Main Scripts

batch_process_sqlite.py

Ingests parliament sitting data for a given date range (inclusive of both start and end) into the SQLite database at data/parliament.db.

Usage

uv run batch_process_sqlite.py START_DATE [END_DATE]

Examples

# Single date
uv run batch_process_sqlite.py 14-01-2026

# Range of dates
uv run batch_process_sqlite.py 12-01-2026 14-01-2026

generate_summaries_sqlite.py

Generates AI summaries for sitting sections and MP profiles using Gemini. The --only-blank flag generates summaries only for entries that don't have one yet.

Usage

# For sittings
uv run generate_summaries_sqlite.py --sittings START_DATE [END_DATE] [--only-blank]

# For MPs
uv run generate_summaries_sqlite.py --members [--only-blank]

Examples

# Range of dates
uv run generate_summaries_sqlite.py --sittings 12-01-2026 14-01-2026

# MPs (based on last 20 contributions)
uv run generate_summaries_sqlite.py --members

# Only fill in missing summaries
uv run generate_summaries_sqlite.py --sittings 12-01-2026 --only-blank

Supporting Modules

File Description
db_sqlite.py Database connection and CRUD operations for SQLite
hansard_api.py Client for fetching data from the Hansard API
parliament_sitting.py Parsing and structuring of sitting data
prompts.py Prompt templates for AI summary generation
util.py Shared utility functions

Manual changes

Below are the manual changes that were made post-ingestion.

Changes in ministry names

  • Before 8 July 2024, the Ministry of Digital Development and Information was known as the Ministry for Communications and Information.
  • Before 25 July 2020, the Ministry of Sustainability and the Environment was known as the Ministry for the Environment and Water Resources.

Sections before the changes were re-categorised under the new ministry names after ingestion using an adhoc script.

Errors in the Hansard

Note that this is just a list of errors that we have found so far, and it is very possible that there might be more that we are not aware of. Feel free to contact us or raise an issue if you identify more that need correcting!