feat(backend): pdf + txt report generator#45
Conversation
There was a problem hiding this comment.
Pull request overview
This pull request adds PDF and text report generation functionality for project transcripts. The feature allows users to export formatted transcripts of their interview projects, including speaker utterances and AI analyses, in both plain text and PDF formats.
Changes:
- Added
reportlab(v4.4.9) andpillow(v12.1.0) dependencies for PDF generation capabilities - Created new standalone script
generate_transcript.pywith CLI interface for generating transcript reports - Implemented PDF formatting with custom styles, colors, and layouts for professional-looking reports
Reviewed changes
Copilot reviewed 2 out of 3 changed files in this pull request and generated 17 comments.
| File | Description |
|---|---|
| backend/uv.lock | Added reportlab and pillow dependencies with all platform-specific wheels |
| backend/pyproject.toml | Added reportlab>=4.4.9 to project dependencies |
| backend/src/generate_transcript.py | New script implementing transcript generation logic with PDF and text output |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| TableStyle, | ||
| ) | ||
| from reportlab.lib.enums import TA_CENTER | ||
| from dateutil.relativedelta import relativedelta |
There was a problem hiding this comment.
Missing dependency: The code imports 'dateutil.relativedelta' but 'python-dateutil' is not listed as a dependency in pyproject.toml. While it may be a transitive dependency, it should be explicitly declared since the code directly imports from it.
| conn = sqlite3.connect(db_path) | ||
| conn.row_factory = sqlite3.Row | ||
| cursor = conn.cursor() | ||
|
|
||
| # Fetch all transcriptions for the project, ordered by transcription_id (ULID) | ||
| # Since ULID is chronologically sortable, sorting by transcription_id gives us time order | ||
| query = """ | ||
| SELECT | ||
| transcription_id, | ||
| speaker, | ||
| text_output, | ||
| created_at, | ||
| session_id, | ||
| user_id | ||
| FROM transcriptions | ||
| WHERE project_id = ? | ||
| ORDER BY transcription_id ASC | ||
| """ | ||
|
|
||
| _ = cursor.execute(query, (project_id,)) | ||
| rows = cursor.fetchall() | ||
|
|
||
| if not rows: | ||
| conn.close() | ||
| return f"No transcriptions found for project: {project_id}" | ||
|
|
||
| # Get project name | ||
| _ = cursor.execute("SELECT name FROM project WHERE project_id = ?", (project_id,)) | ||
| project_row = cursor.fetchone() # pyright: ignore[reportAny] | ||
| project_name: str = ( | ||
| cast(str, project_row["name"]) | ||
| if project_row and project_row["name"] | ||
| else "Untitled Project" | ||
| ) | ||
|
|
||
| # Fetch AI analyses for the project | ||
| ai_analyses_query = """ | ||
| SELECT | ||
| analysis_id, | ||
| text, | ||
| span, | ||
| transcript_context_start, | ||
| transcript_context_end, | ||
| summary | ||
| FROM ai_analyses | ||
| WHERE project_id = ? | ||
| ORDER BY transcript_context_end ASC | ||
| """ | ||
| _ = cursor.execute(ai_analyses_query, (project_id,)) | ||
| ai_analyses = cursor.fetchall() | ||
|
|
||
| # Build a map of transcription_id -> list of analyses that end at that transcription | ||
| analyses_by_end: dict[str, list[dict[str, str]]] = {} | ||
| for analysis in ai_analyses: # pyright: ignore[reportAny] | ||
| end_id = cast(str, analysis["transcript_context_end"]) | ||
| if end_id not in analyses_by_end: | ||
| analyses_by_end[end_id] = [] | ||
| analyses_by_end[end_id].append( | ||
| { | ||
| "analysis_id": cast(str, analysis["analysis_id"]), | ||
| "text": cast(str, analysis["text"]), | ||
| "span": cast(str, analysis["span"]) if analysis["span"] else "", | ||
| "summary": cast(str, analysis["summary"]), | ||
| "start": cast(str, analysis["transcript_context_start"]), | ||
| "end": end_id, | ||
| } | ||
| ) | ||
|
|
||
| conn.close() |
There was a problem hiding this comment.
Resource leak: The database connection is not properly managed. If an exception occurs after conn = sqlite3.connect(db_path) but before conn.close(), the connection will not be closed. Use a context manager (with statement) to ensure proper cleanup: 'with sqlite3.connect(db_path) as conn:' instead of manually calling conn.close().
| if output_file: | ||
| output_path = Path(output_file) | ||
| output_path.parent.mkdir(parents=True, exist_ok=True) | ||
| _ = output_path.write_text(transcript, encoding="utf-8") | ||
| print(f"Transcript saved to: {output_file}") |
There was a problem hiding this comment.
File overwrite without warning: Both the text transcript and PDF files are written without checking if they already exist, and without asking for user confirmation. This could silently overwrite existing files. Consider adding a check or warning when files already exist, or provide a --force flag for the CLI.
|
|
||
| args = parser.parse_args() | ||
|
|
||
| project_id_arg = cast(str, args.project_id) |
There was a problem hiding this comment.
Missing validation for CLI arguments: The CLI doesn't validate the project_id format (should be a valid ULID). If an invalid project_id is provided, it will fail later with potentially confusing error messages. Consider validating the project_id format early and providing clear error messages.
| project_id_arg = cast(str, args.project_id) | |
| project_id_arg = cast(str, args.project_id) | |
| # Validate that the provided project_id is a well-formed ULID before proceeding. | |
| try: | |
| # This will raise ValueError if the format is invalid. | |
| _ = ulid.ULID.from_str(project_id_arg.upper()) # pyright: ignore[reportAny] | |
| except ValueError as exc: | |
| parser.error(f"Invalid project_id '{project_id_arg}': {exc}") |
| doc = SimpleDocTemplate( | ||
| pdf_file, pagesize=letter, topMargin=0.75 * inch, bottomMargin=0.75 * inch |
There was a problem hiding this comment.
Magic number without explanation: The value 0.75 for top and bottom margins is not explained. Consider extracting this to a named constant at the module level with a descriptive name, such as DEFAULT_PAGE_MARGIN_INCHES = 0.75.
| total_length_duration = extract_timestamp_from_ulid( | ||
| rows[-1]["transcription_id"] # pyright: ignore[reportAny] | ||
| ) - extract_timestamp_from_ulid(rows[0]["transcription_id"]) # pyright: ignore[reportAny] | ||
| humanh_readable_length = " ".join( | ||
| human_readable( | ||
| relativedelta(seconds=int(total_length_duration.total_seconds())) | ||
| ) | ||
| ) |
There was a problem hiding this comment.
Incorrect duration calculation: The total_length_duration is calculated as the difference between the last and first transcription timestamps. However, this only measures the time span from start to finish, not the actual accumulated duration of all transcriptions. If there were gaps in recording (e.g., pauses), this would incorrectly report the total time including those gaps. Consider clarifying the metric name to "Time Span" or "Session Duration" instead of "Total Time" to avoid confusion.
| total_length_duration = extract_timestamp_from_ulid( | ||
| rows[-1]["transcription_id"] # pyright: ignore[reportAny] | ||
| ) - extract_timestamp_from_ulid(rows[0]["transcription_id"]) # pyright: ignore[reportAny] | ||
| humanh_readable_length = " ".join( |
There was a problem hiding this comment.
Typo in variable name: 'humanh_readable_length' should be 'human_readable_length'. The variable name has an extra 'h'.
| conn = sqlite3.connect(db_path) | ||
| conn.row_factory = sqlite3.Row | ||
| cursor = conn.cursor() |
There was a problem hiding this comment.
Inconsistent database access pattern: This script uses raw sqlite3 connections directly, while the rest of the codebase (e.g., database.py) uses SQLAlchemy with the PersistentDatabase abstraction. This creates inconsistency in how the database is accessed and makes the code harder to maintain. Consider using the existing PersistentDatabase class and SQLAlchemy for consistency.
| and current_speaker is not None | ||
| and current_timestamp is not None | ||
| ): | ||
| timestamp_str = current_timestamp.strftime("%Y-%m-%d %H:%M:%S.%f")[:-3] |
There was a problem hiding this comment.
Inconsistent date formatting: The code uses strftime("%Y-%m-%d %H:%M:%S.%f")[:-3] to format timestamps in multiple places. This format string is repeated throughout the code. Consider extracting this to a constant or helper function for consistency and maintainability.
No description provided.