Skip to content

Re-branching after screwing up last PR. #4

Open
jreakin wants to merge 26 commits intomainfrom
jan25-polars
Open

Re-branching after screwing up last PR. #4
jreakin wants to merge 26 commits intomainfrom
jan25-polars

Conversation

@jreakin
Copy link
Member

@jreakin jreakin commented Jan 18, 2025

This is just the start of a new PR after I messed up the last one.

Summary by CodeRabbit

  • New Features

    • Parquet read/write support, Ohio downloader, Texas search workflow, terminal live progress tracker, unified PostgreSQL-backed data manager, and a production batch loader.
  • Documentation

    • README expanded with package overview, examples, and built-in data download notes.
  • Refactor

    • Dataflow moved to dataframe-first processing; Texas schemas, validators, and models reorganized for modularity.
  • Chores

    • VCS ignore/LFS rules, IDE and CI configs, and dependency metadata updated.
  • Tests

    • Added hypothesis-driven ingestion and consolidation tests.

✏️ Tip: You can customize this high-level summary in your review settings.

Copilot AI review requested due to automatic review settings November 9, 2025 02:17
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a unified data model system for campaign finance data across multiple states, incorporating PostgreSQL support with SQLModel/SQLAlchemy ORM. The changes include dependency updates, new infrastructure for data loading and processing, and significant refactoring of Texas state validators.

Key changes:

  • Introduced unified SQLModel-based data models for cross-state campaign finance data normalization
  • Added production-ready data loader with batch processing and deduplication capabilities
  • Updated dependencies including pyarrow, selenium, icecream, and added web-scraping utilities
  • Refactored Texas validators to use base models and reduce code duplication
  • Added database management utilities for PostgreSQL operations

Reviewed Changes

Copilot reviewed 51 out of 64 changed files in this pull request and generated 35 comments.

Show a summary per file
File Description
uv.lock Updated package versions and added new dependencies (pyarrow, web-scrape-utils, websockets, undetected-chromedriver)
pyproject.toml Added pyarrow and web-scrape-utils as dependencies
app/states/unified_sqlmodels.py New comprehensive unified data model definitions with SQLModel/SQLAlchemy
app/states/unified_database.py New database manager for PostgreSQL operations
production_loader.py New production data loader with batch processing
recreate_tables.py Utility script for recreating database tables
app/states/texas/validators/*.py Refactored validators to use base models and reduce duplication
app/states/texas/texas_downloader.py Updated to use web-scraping utilities and polars
app/states/texas/texas_search.py Refactored search functionality with polars DataFrames
tmp/texas/*.parquet Binary data files (21 parquet files added)
tmp/.DS_Store macOS system file added
Files not reviewed (4)
  • .idea/campaignfinance.iml: Language not supported
  • .idea/csv-editor.xml: Language not supported
  • .idea/misc.xml: Language not supported
  • .idea/ruff.xml: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +60 to +65
# class TECFilerCTALinkModel(TECBaseModel, table=True):
# __tablename__ = "tx_filer_cta_link"
# __table_args__ = {"schema": "texas"}
# filerIdent: int = Field(foreign_key="texas.tx_filer_name.filerIdent", primary_key=True)
# ctaSeekOfficeCd: Optional[str] = Field(foreign_key="texas.tx_cta.id", primary_key=True)

Copy link

Copilot AI Nov 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment appears to contain commented-out code.

Suggested change
# class TECFilerCTALinkModel(TECBaseModel, table=True):
# __tablename__ = "tx_filer_cta_link"
# __table_args__ = {"schema": "texas"}
# filerIdent: int = Field(foreign_key="texas.tx_filer_name.filerIdent", primary_key=True)
# ctaSeekOfficeCd: Optional[str] = Field(foreign_key="texas.tx_cta.id", primary_key=True)

Copilot uses AI. Check for mistakes.
Comment on lines +89 to 96
# class TECFilerTreasurerLinkModel(TECBaseModel, table=True):
# __tablename__ = "tx_filer_treasurer_link"
# __table_args__ = {"schema": "texas"}
# filer_name_id: int = Field(foreign_key="texas.tx_filer_name.filerIdent", primary_key=True)
# treasurer_id: Optional[str] = Field(foreign_key="texas.tx_treasurer.id", primary_key=True)
# assttreas_id: Optional[str] = Field(foreign_key="texas.tx_assistant_treasurer.id", primary_key=True)
# chair_id: Optional[str] = Field(foreign_key="texas.tx_chair.id", primary_key=True)

Copy link

Copilot AI Nov 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment appears to contain commented-out code.

Suggested change
# class TECFilerTreasurerLinkModel(TECBaseModel, table=True):
# __tablename__ = "tx_filer_treasurer_link"
# __table_args__ = {"schema": "texas"}
# filer_name_id: int = Field(foreign_key="texas.tx_filer_name.filerIdent", primary_key=True)
# treasurer_id: Optional[str] = Field(foreign_key="texas.tx_treasurer.id", primary_key=True)
# assttreas_id: Optional[str] = Field(foreign_key="texas.tx_assistant_treasurer.id", primary_key=True)
# chair_id: Optional[str] = Field(foreign_key="texas.tx_chair.id", primary_key=True)

Copilot uses AI. Check for mistakes.
Comment on lines +65 to 70
# @model_validator(mode='before')
# @classmethod
# def store_raw(cls, values):
# values['raw_input'] = {k: v for k, v in values.items()}
# return values

Copy link

Copilot AI Nov 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment appears to contain commented-out code.

Suggested change
# @model_validator(mode='before')
# @classmethod
# def store_raw(cls, values):
# values['raw_input'] = {k: v for k, v in values.items()}
# return values

Copilot uses AI. Check for mistakes.
Comment on lines +53 to +57
# def __init__(self, **data):
# _flags = TECFlags(**data)
# if len(_flags.model_dump(exclude_none=True).values()) > 0:
# self.flags = _flags
# super().__init__(**data) No newline at end of file
Copy link

Copilot AI Nov 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment appears to contain commented-out code.

Copilot uses AI. Check for mistakes.
unlinked_transactions = session.exec(
select(UnifiedTransactionPerson).where(
UnifiedTransactionPerson.person_id == cp.person_id,
UnifiedTransactionPerson.committee_person_id == None
Copy link

Copilot AI Nov 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing for None should use the 'is' operator.

Copilot uses AI. Check for mistakes.
if original_address is not ensured_address and original_address is not None:
try:
session.expunge(original_address)
except InvalidRequestError:
Copy link

Copilot AI Nov 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Copilot uses AI. Check for mistakes.
if original_address is not ensured_address and original_address is not None:
try:
session.expunge(original_address)
except InvalidRequestError:
Copy link

Copilot AI Nov 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Copilot uses AI. Check for mistakes.
if original_person is not ensured_person:
try:
session.expunge(original_person)
except InvalidRequestError:
Copy link

Copilot AI Nov 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Copilot uses AI. Check for mistakes.
result = self.choose_options(result)
if not result.selected_choices:
return result
result.selected_choices = result.selected_choices
Copy link

Copilot AI Nov 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assignment assigns a variable to itself.

Suggested change
result.selected_choices = result.selected_choices

Copilot uses AI. Check for mistakes.
}
if type_str in type_synonyms:
return type_synonyms[type_str]
for transaction_type in TransactionType:
Copy link

Copilot AI Nov 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This for-loop may attempt to iterate over a non-iterable instance of class type.

Copilot uses AI. Check for mistakes.
- Modified app/funcs/__init__.py and csv_reader.py
- Updated Texas state modules (downloader, search, validators)
- Updated unified_database.py
- Updated test_validation_class.py
- Updated pyproject.toml and uv.lock
- Removed parquet files from git tracking (now in .gitignore)
@coderabbitai
Copy link

coderabbitai bot commented Dec 4, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds Parquet/TXT readers and Polars-based data flows, a Rich-based ProgressTracker, Texas downloader/search/consolidation refactors, an Ohio downloader scaffold, a unified SQLModel schema and DB manager, a production batch ingestion pipeline with deduplication, extensive validator/model restructures, and project/IDE/dependency updates.

Changes

Cohort / File(s) Summary
Repo config & IDE
\.gitattributes, \.gitignore, pyproject.toml, .idea/*
Replace LFS patterns (add *.parquet, remove *.zip/*.csv), reintroduce *.parquet and /tmp in .gitignore, add runtime/dev deps (psycopg2-binary, web-scrape-utils, pyarrow, hypothesis, pytest), update IDE SDK/exclusions and add Ruff config.
Progress UI
app/live_display.py
New ProgressTracker class: thread-safe Rich Live terminal task table with start/stop, add/update/remove tasks, background refresh and print API.
ABCs & download helpers
app/abcs/*
app/abcs/__init__.py, app/abcs/abc_download.py, app/abcs/abc_category.py, app/abcs/abc_db_loader.py, app/abcs/abc_validation.py, app/abcs/abc_validation_errors.py
Migrate imports to app.*, expose module-level progress (started at import), convert FileDownloaderABC to use ClassVar driver and classmethod interfaces, add ZIP extraction and CSV→Parquet helpers, and add headless control helper.
File readers & validators
app/funcs/*
app/funcs/__init__.py, app/funcs/csv_reader.py, app/funcs/validator_functions.py, app/funcs/db_loader.py
Switch to package-scoped imports; add Parquet and TXT readers (lazy + fallback), per-file try/except handling, metadata wrappers (file_origin/download_date), and new check_contains_factory validator factory.
Texas downloader, consolidation & search
app/states/texas/*
app/states/texas/texas_downloader.py, app/states/texas/texas_search.py, app/states/texas/__init__.py
Refactor TECDownloader to classmethod/shared-driver flow with headless control, safe cleanup, ZIP extraction and CSV→Parquet conversion, per-category consolidation (align columns, stack, write consolidated parquet), and add Polars-based TexasSearch pipeline and types; update imports and validator references.
Texas models & validators (large restructure)
app/states/texas/models.py, app/states/texas/validators/*
Add TECBaseModel and check_contains_factory; add flags models; refactor person/address/filer/expense models into base vs table variants, add link models and FK updates, consolidate payee handling, and change exported validator symbols.
Ohio downloader
app/states/ohio/ohio_downloader.py
New OhioDownloader with entry URL and initial driver-based FTP discovery/download scaffold; placeholders for consolidate/read.
Unified schema & DB manager
app/states/unified_sqlmodels.py, app/states/unified_database.py, recreate_tables.py
New unified SQLModel layer (many Unified* entities, versions, associations), builder/processor for mapping state input to unified models, and UnifiedDatabaseManager with engine/session management, ingest/query APIs and recreate utility.
Production ingestion
production_loader.py
New ProductionLoader with LoaderConfig/LoaderStats: batch processing, in-memory caches, deduplication (_dedupe_addresses, _dedupe_persons_and_entities), upserts, DB interactions, progress UI and summary display.
Main, docs & logs
app/main.py, README.md, app/logs/*
app/main.py converted to a Polars/pandas dataframe-centric processing path; README updated; new log file added.
Tests & housekeeping
app/tests/*, removed app/workflows/texas_download.py
Add Hypothesis-driven ingestion/consolidation tests and update validation tests for injected validator style; remove a commented example workflow file.
Field library & unified helpers
app/states/unified_field_library.py
New UnifiedFieldLibrary with FieldDefinition/StateFieldMapping, category/type enums, baked-in state mappings, and JSON import/export utilities.
Misc import path updates
app/states/oklahoma/*, many state validators and modules
Numerous relative/local imports converted to absolute app.* package imports across state modules and validators; behavior unchanged beyond import resolution.

Sequence Diagram(s)

sequenceDiagram
    actor CLI as Client/CLI
    participant Loader as ProductionLoader
    participant Reader as GenericFileReader
    participant Processor as UnifiedSQLDataProcessor
    participant DB as UnifiedDatabaseManager
    participant Session as DB Session

    CLI->>Loader: load_file(file_path, state)
    Loader->>DB: load_existing_data(session)
    Loader->>Reader: _get_reader(state)
    Reader-->>Loader: reader

    loop per batch
        Loader->>Reader: read batch
        Reader-->>Loader: records
        Loader->>Processor: process_records(records, state)
        Processor-->>Loader: UnifiedTransaction objects
        Loader->>Loader: dedupe/cache & manage associations
        Loader->>DB: save_transactions(transactions)
        DB->>Session: persist & commit
        Session-->>DB: commit result
        Loader->>Loader: update progress
    end

    Loader->>Loader: _dedupe_persons_and_entities(session)
    Loader->>CLI: display_results()
Loading
sequenceDiagram
    actor User as Client
    participant TEC as TECDownloader
    participant Driver as CreateWebDriver
    participant Progress as ProgressTracker
    participant Extract as Zip/Parquet Processor
    participant FS as Filesystem

    User->>TEC: download(read_from_temp, headless)
    TEC->>Progress: add_task("cleanup")
    TEC->>FS: safe temp cleanup
    Progress->>Progress: update_task("cleanup","completed")

    TEC->>Driver: CreateWebDriver(...)
    Driver-->>TEC: driver instance
    TEC->>Progress: add_task("download")
    TEC->>Driver: navigate & click to download
    Driver-->>FS: save zip
    TEC->>Progress: update_task("download","completed")

    TEC->>Extract: extract_zipfile(zip, tmp)
    Extract->>Extract: _process_csv -> convert CSV→Parquet
    Extract-->>FS: write parquet files
    TEC->>TEC: consolidate_files() (align columns & stack)
    TEC->>Progress: stop()
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

  • Areas needing focused review:
    • app/states/unified_sqlmodels.py and app/states/unified_database.py — FK correctness, indexes, versioning, builder/processor mapping and JSON/raw storage.
    • production_loader.py — batch atomicity, deduplication correctness, reference rewiring, caching and error handling.
    • app/states/texas/texas_downloader.py and app/funcs/csv_reader.py — Parquet consolidation/read/write, column alignment, lazy-read fallbacks, and metadata handling.
    • app/states/texas/validators and models — model/table separation, Before/field validators, FK string targets, and link models.
    • app/abcs/abc_download.py — module-level side effects (ProgressTracker.start()) and shared driver lifecycle as ClassVar.

Poem

🐇
I hopped through schemas, columns neat and bright,
Parquets stacked softly in the cool night light.
Texas names and Ohio's FTP file,
Loaders and trackers marched on for a while.
A rabbit applauds — pipelines humming with style.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 48.91% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title check ❓ Inconclusive The title is vague and self-referential, using informal language ('screwing up') that doesn't describe the actual technical changes in the changeset. Replace with a specific, professional title that describes the primary changes, such as 'Migrate to Polars-based data processing and unified ORM models' or similar.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch jan25-polars

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 95b4304 and eaedb1d.

📒 Files selected for processing (1)
  • .github/workflows/ci.yml (1 hunks)
🧰 Additional context used
🪛 GitHub Check: CodeQL
.github/workflows/ci.yml

[warning] 9-40: Workflow does not contain permissions
Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {{contents: read}}

🪛 YAMLlint (1.37.1)
.github/workflows/ci.yml

[warning] 3-3: truthy value should be one of [false, true]

(truthy)


[error] 41-41: too many blank lines (1 > 0)

(empty-lines)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: test
  • GitHub Check: test

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 81

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (7)
.idea/misc.xml (1)

3-9: Inconsistent SDK references between components.

The Black component (line 7) still references the old "Poetry (campaignfinance) (2)" SDK, while the ProjectRootManager (line 9) has been updated to "Python 3.12 UV (campaignfinance)". This inconsistency may cause Black to fail or use the wrong Python environment.

Apply this diff to align the Black SDK reference:

   <component name="Black">
     <option name="enabledOnReformat" value="true" />
     <option name="executionMode" value="BINARY" />
     <option name="pathToExecutable" value="/opt/homebrew/bin/black" />
-    <option name="sdkName" value="Poetry (campaignfinance) (2)" />
+    <option name="sdkName" value="Python 3.12 UV (campaignfinance)" />
   </component>
app/tests/test_validation_class.py (1)

3-25: Tests are incompatible with the StateFileValidation API.

Several issues will cause these tests to fail:

  1. StateFileValidation is an abstract class (abc.ABC) and cannot be instantiated directly
  2. validate_record(record) takes only one argument, not (record, MockModel)
  3. validate(records) takes only one argument, not (records, MockModel)
  4. The validator_to_use should be a class attribute, not passed to methods

Based on app/abcs/abc_validation.py, the tests need to create a concrete subclass:

-@given(st.dictionaries(st.text(), st.text()))
-def test_validate_record(record):
-    validator = StateFileValidation()
-    result = validator.validate_record(record, MockModel)
+class TestValidator(StateFileValidation):
+    validator_to_use = MockModel
+
+@given(st.dictionaries(st.text(), st.text()))
+def test_validate_record(record):
+    validator = TestValidator()
+    result = validator.validate_record(record)
     assert result[0] in ['passed', 'failed']
app/states/texas/validators/texas_filers.py (1)

186-638: Large block of commented-out legacy code.

Lines 186-638 contain extensive commented-out code. If this code serves as a reference during refactoring, consider documenting the migration plan or removing it entirely, relying on version control for history.

app/states/texas/validators/texas_address.py (1)

83-86: Critical logic bug: condition is always True.

The expression if "PO" or "PO BOX" in _address: is always truthy because "PO" (a non-empty string) is truthy by itself. This means the PO BOX parsing logic runs for every address, not just PO boxes.

-        if "PO" or "PO BOX" in _address:
+        if "PO" in _address or "PO BOX" in _address:
app/states/texas/validators/texas_expenses.py (1)

235-246: Incorrect field name in error message.

The validator validates expendDescr and expendDt, but the error message on line 243 incorrectly references 'filerName'. This will confuse users when validation fails.

     @field_validator('expendDescr', 'expendDt', mode='before')
     @classmethod
     def validate_required_fields(cls, v):
         if v == "" or v is None:
             raise PydanticCustomError(
                 'missing_required_value',
                 "Field is required",
                 {
-                    'column': 'filerName',
+                    'column': 'expendDescr/expendDt',
                     'value': v
                 }
             )
         return v

Alternatively, use Pydantic's validation info to get the actual field name dynamically.

app/states/texas/validators/texas_personname.py (2)

74-75: Potential AttributeError if no key contains 'NamePrefixCd'.

If _key_pfx is None, calling .split() on it will raise an AttributeError. Add a guard or handle the None case.

     @model_validator(mode='before')
     @classmethod
     def set_type_prefix(cls, values):
         _key_pfx = next((key for key in values.keys() if 'NamePrefixCd' in key), None)
+        if not _key_pfx:
+            values['field_pfx'] = None
+            return values
         values['field_pfx'] = _key_pfx.split('NamePrefixCd')[0]
         return values

92-94: Same potential AttributeError for _pfx in validate_address.

If no key contains 'NamePrefixCd', _pfx will be None, and line 94 will fail when calling .split().

         _pfx = next((key for key in values.keys() if 'NamePrefixCd' in key), None)
-        if _pfx:
-            _pfx = _pfx.split('NamePrefixCd')[0]
+        if not _pfx:
+            return values
+        _pfx = _pfx.split('NamePrefixCd')[0]
♻️ Duplicate comments (20)
app/main.py (2)

4-5: Unused import re.

The re module is imported but not used anywhere in the active code.


30-32: Unused imports: Path and contextmanager.

Path and contextmanager are imported but not used in the active code. The polars import (pl) is used and should be kept.

app/states/texas/validators/texas_filers.py (2)

60-65: Commented-out code.

This block contains commented-out code that should be removed or documented.


89-96: Commented-out code.

This block contains commented-out code that should be removed or documented.

production_loader.py (4)

602-605: Empty except clause silently swallows InvalidRequestError.

The except InvalidRequestError: pass silently ignores the error without logging. Add a comment explaining why this is expected, or log at debug level.


628-631: Empty except clause silently swallows InvalidRequestError.


775-778: Empty except clause silently swallows InvalidRequestError.


796-799: Empty except clause silently swallows InvalidRequestError.

app/states/unified_database.py (3)

1078-1085: Use is None for None comparisons.

Testing for None should use the is operator rather than ==.

This was flagged in a previous review. Apply:

-                        UnifiedTransactionPerson.committee_person_id == None
+                        UnifiedTransactionPerson.committee_person_id.is_(None)

1176-1184: Use is None for None comparisons.

Testing for None should use the is operator rather than ==.

This was flagged in a previous review. Apply:

-                        UnifiedTransactionPerson.committee_person_id == None
+                        UnifiedTransactionPerson.committee_person_id.is_(None)

16-22: Unused import: PersonType.

PersonType is imported but not used in this file.

 from .unified_sqlmodels import (
     UnifiedTransaction, UnifiedPerson, UnifiedAddress, UnifiedCommittee,
-    UnifiedTransactionPerson, PersonRole, TransactionType, PersonType,
+    UnifiedTransactionPerson, PersonRole, TransactionType,
     unified_sql_processor, UnifiedTransactionVersion,
     UnifiedPersonVersion, UnifiedCommitteeVersion, UnifiedAddressVersion,
     UnifiedCommitteePerson, UnifiedCommitteePersonVersion, CommitteeRole, State
 )
app/states/texas/validators/texas_settings.py (1)

51-57: Remove commented-out code.

This commented-out code for flags handling should be removed. If it's intended for future implementation, track it in an issue instead.

app/states/ohio/ohio_downloader.py (1)

2-4: Unused imports detected.

CategoryTypes, RecordGen, CategoryConfig, progress, and CreateWebDriver are imported but not used in this file.

Remove unused imports:

-from abcs import (
-    FileDownloaderABC, StateConfig, CategoryTypes, RecordGen, CategoryConfig, progress, CSVReaderConfig)
-from web_scrape_utils import CreateWebDriver, By
+from abcs import FileDownloaderABC, StateConfig, CSVReaderConfig
+from web_scrape_utils import By
app/states/texas/models.py (1)

4-9: Multiple unused imports.

BaseModel, ConfigDict, model_validator (line 4), SQLModel (line 5), tx_funcs (line 7), and RecordKeyGenerator (line 9) are imported but not used.

app/states/texas/texas_downloader.py (2)

2-18: Remove unused imports.

Optional (line 2) and CreateWebDriver (line 18) are imported but never used in the code.

-from typing import Self, ClassVar, Optional
+from typing import Self, ClassVar
-from web_scrape_utils import CreateWebDriver, By
+from web_scrape_utils import By

103-103: Unused variables _file_count.

_file_count is assigned but never used on both lines 103 and 122.

-        _file_count = len(list(_folder.glob("*.parquet")))
-        _file_count = len(_file_types)

Also applies to: 122-122

app/states/texas/validators/texas_personname.py (2)

2-4: Remove unused imports.

Annotated (line 2), field_validator, create_model, AliasChoices (line 4), and re (line 12) are imported but never used.

-from typing import Optional, Annotated
+from typing import Optional
-from pydantic import field_validator, model_validator, BeforeValidator, create_model, AliasChoices
+from pydantic import model_validator, BeforeValidator
-import re

Also applies to: 12-12


65-70: Remove commented-out code.

Commented-out code should be removed to improve readability. If it's needed for reference, it should be tracked in version control or documented elsewhere.

app/states/texas/texas_search.py (2)

2-2: Use modern type hints and remove unused imports.

Type, Self, and Tuple are imported but never used. Also, List and Dict are deprecated in favor of built-in list and dict.

-from typing import Optional, List, Dict, Any, Type, NamedTuple, Self, Tuple
+from typing import Optional, Any, NamedTuple

Then update usages:

  • List[str]list[str]
  • Dict[str, List[str]]dict[str, list[str]]

258-258: Self-assignment has no effect.

result.selected_choices = result.selected_choices is a no-op.

-        result.selected_choices = result.selected_choices
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 33af3b3 and 2b78c27.

⛔ Files ignored due to path filters (6)
  • .DS_Store is excluded by !**/.DS_Store
  • .idea/.DS_Store is excluded by !**/.DS_Store
  • app/.DS_Store is excluded by !**/.DS_Store
  • poetry.lock is excluded by !**/*.lock
  • tmp/.DS_Store is excluded by !**/.DS_Store
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (36)
  • .gitattributes (1 hunks)
  • .gitignore (2 hunks)
  • .idea/campaignfinance.iml (1 hunks)
  • .idea/csv-editor.xml (0 hunks)
  • .idea/misc.xml (1 hunks)
  • .idea/ruff.xml (1 hunks)
  • README.md (1 hunks)
  • app/abcs/__init__.py (1 hunks)
  • app/abcs/abc_download.py (2 hunks)
  • app/abcs/abc_state_config.py (1 hunks)
  • app/abcs/abc_validation_errors.py (1 hunks)
  • app/funcs/__init__.py (1 hunks)
  • app/funcs/csv_reader.py (2 hunks)
  • app/funcs/validator_functions.py (1 hunks)
  • app/live_display.py (1 hunks)
  • app/logs/campaign_finance.log.2025-01-31 (1 hunks)
  • app/main.py (3 hunks)
  • app/states/ohio/ohio_downloader.py (1 hunks)
  • app/states/texas/__init__.py (2 hunks)
  • app/states/texas/models.py (1 hunks)
  • app/states/texas/texas_downloader.py (2 hunks)
  • app/states/texas/texas_search.py (1 hunks)
  • app/states/texas/validators/__init__.py (1 hunks)
  • app/states/texas/validators/texas_address.py (4 hunks)
  • app/states/texas/validators/texas_expenses.py (4 hunks)
  • app/states/texas/validators/texas_filers.py (5 hunks)
  • app/states/texas/validators/texas_flags.py (1 hunks)
  • app/states/texas/validators/texas_personname.py (2 hunks)
  • app/states/texas/validators/texas_settings.py (3 hunks)
  • app/states/unified_database.py (1 hunks)
  • app/states/unified_sqlmodels.py (1 hunks)
  • app/tests/test_validation_class.py (1 hunks)
  • app/workflows/texas_download.py (0 hunks)
  • production_loader.py (1 hunks)
  • pyproject.toml (1 hunks)
  • recreate_tables.py (1 hunks)
💤 Files with no reviewable changes (2)
  • .idea/csv-editor.xml
  • app/workflows/texas_download.py
🧰 Additional context used
🧬 Code graph analysis (19)
recreate_tables.py (1)
app/op.py (2)
  • database (91-92)
  • database_url (111-122)
app/abcs/abc_state_config.py (1)
app/funcs/toml_reader.py (1)
  • read_toml (6-8)
app/tests/test_validation_class.py (1)
app/abcs/abc_validation.py (1)
  • StateFileValidation (24-85)
app/main.py (2)
app/states/texas/texas_search.py (1)
  • TexasSearch (165-294)
app/states/texas/texas_downloader.py (2)
  • download (30-98)
  • dataframes (277-279)
app/abcs/__init__.py (1)
app/abcs/abc_download.py (1)
  • FileDownloaderABC (24-142)
app/states/texas/validators/texas_flags.py (3)
app/states/texas/models.py (1)
  • TECBaseModel (58-64)
app/states/texas/validators/texas_settings.py (2)
  • TECBaseModel (44-57)
  • check_contains_factory (15-20)
app/funcs/validator_functions.py (1)
  • check_contains_factory (166-171)
app/states/texas/__init__.py (2)
app/states/texas/texas_search.py (1)
  • TexasSearch (165-294)
app/states/texas/validators/texas_filers.py (1)
  • TECFilerName (126-184)
app/funcs/validator_functions.py (1)
app/states/texas/validators/texas_settings.py (2)
  • check_contains_factory (15-20)
  • check_contains (16-19)
app/states/texas/validators/texas_personname.py (3)
app/states/texas/models.py (1)
  • TECBaseModel (58-64)
app/states/texas/validators/texas_settings.py (2)
  • TECBaseModel (44-57)
  • check_contains_factory (15-20)
app/states/texas/validators/texas_address.py (1)
  • TECAddressBase (20-142)
app/states/ohio/ohio_downloader.py (2)
app/abcs/abc_download.py (4)
  • FileDownloaderABC (24-142)
  • download (109-110)
  • consolidate_files (114-115)
  • read (119-120)
app/abcs/abc_state_config.py (4)
  • StateConfig (140-180)
  • CategoryTypes (110-136)
  • CategoryConfig (65-106)
  • CSVReaderConfig (40-42)
app/states/texas/validators/texas_settings.py (2)
app/funcs/validator_functions.py (2)
  • check_contains_factory (166-171)
  • check_contains (167-170)
app/states/texas/models.py (1)
  • TECBaseModel (58-64)
app/funcs/csv_reader.py (3)
app/abcs/abc_validation.py (1)
  • logger (32-34)
app/abcs/abc_category.py (1)
  • logger (99-101)
app/logger.py (1)
  • Logger (10-71)
app/states/texas/validators/texas_address.py (2)
app/states/texas/models.py (1)
  • TECBaseModel (58-64)
app/states/texas/validators/texas_settings.py (1)
  • TECBaseModel (44-57)
app/states/texas/validators/__init__.py (4)
app/states/texas/models.py (1)
  • TECBaseModel (58-64)
app/states/texas/validators/texas_filers.py (1)
  • TECFilerName (126-184)
app/states/texas/validators/texas_contributions.py (1)
  • TECContribution (10-294)
app/states/texas/validators/texas_address.py (1)
  • TECAddressBase (20-142)
app/funcs/__init__.py (4)
app/funcs/csv_reader.py (1)
  • FileReader (54-144)
app/funcs/toml_reader.py (1)
  • read_toml (6-8)
app/funcs/file_exporters.py (1)
  • write_records_to_csv_validation (8-38)
app/funcs/depreciated.py (1)
  • deprecated (5-17)
app/abcs/abc_download.py (4)
app/abcs/abc_state_config.py (3)
  • StateConfig (140-180)
  • CategoryTypes (110-136)
  • TEMP_FOLDER (149-150)
app/live_display.py (4)
  • ProgressTracker (9-106)
  • start (21-25)
  • add_task (34-44)
  • update_task (46-52)
app/states/texas/texas_downloader.py (3)
  • download (30-98)
  • consolidate_files (120-206)
  • read (209-274)
app/abcs/abc_category.py (1)
  • read (103-125)
app/states/texas/validators/texas_expenses.py (2)
app/states/texas/validators/texas_personname.py (1)
  • TECPersonNameBase (15-107)
app/states/texas/validators/texas_settings.py (1)
  • TECSettings (22-41)
app/states/unified_sqlmodels.py (2)
app/op.py (1)
  • role (107-108)
app/states/unified_database.py (1)
  • get_session (77-79)
production_loader.py (5)
app/states/unified_sqlmodels.py (2)
  • State (101-119)
  • process_record (1321-1444)
app/abcs/abc_validation.py (1)
  • failed_records (62-65)
app/logger.py (3)
  • error (61-62)
  • warning (58-59)
  • info (52-53)
app/states/unified_database.py (1)
  • get_session (77-79)
app/live_display.py (1)
  • add_task (34-44)
🪛 LanguageTool
README.md

[style] ~15-~15: Using many exclamation marks might seem excessive (in this case: 3 exclamation marks for a text that’s 1026 characters long)
Context: ...r-the-badge&logo=python&logoColor=blue) Pydantic Pandas ![Selenium](https://img.shields.io/badge/...

(EN_EXCESSIVE_EXCLAMATION)

🪛 markdownlint-cli2 (0.18.1)
README.md

3-3: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


7-7: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


10-10: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


14-14: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


18-18: Files should end with a single newline character

(MD047, single-trailing-newline)

🪛 Ruff (0.14.7)
recreate_tables.py

1-1: Shebang is present but file is not executable

(EXE001)


12-12: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


30-30: Do not catch blind exception: Exception

(BLE001)


65-65: Do not catch blind exception: Exception

(BLE001)


70-70: Do not catch blind exception: Exception

(BLE001)


81-81: Do not catch blind exception: Exception

(BLE001)


109-109: Do not catch blind exception: Exception

(BLE001)

app/states/texas/models.py

5-5: Redefinition of unused Field from line 1

(F811)


13-13: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

app/states/texas/validators/texas_flags.py

10-10: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

app/states/texas/texas_downloader.py

25-25: Missing return type annotation for special method __init__

Add return type annotation: None

(ANN204)


32-32: Boolean-typed positional argument in function definition

(FBT001)


32-32: Boolean default positional argument in function definition

(FBT002)


32-32: Unused class method argument: read_from_temp

(ARG003)


33-33: Boolean-typed positional argument in function definition

(FBT001)


33-33: Boolean default positional argument in function definition

(FBT002)


36-36: Unnecessary map() usage (rewrite using a generator expression)

Replace map() with a generator expression

(C417)


63-63: Avoid specifying long messages outside the exception class

(TRY003)


83-83: Unnecessary map() usage (rewrite using a list comprehension)

Replace map() with a list comprehension

(C417)


87-87: Unnecessary generator (rewrite as a list comprehension)

Rewrite as a list comprehension

(C400)


89-89: Avoid specifying long messages outside the exception class

(TRY003)


95-95: Unnecessary map() usage (rewrite using a list comprehension)

Replace map() with a list comprehension

(C417)


113-113: Avoid specifying long messages outside the exception class

(TRY003)


120-120: Missing return type annotation for classmethod consolidate_files

(ANN206)


147-147: Do not catch blind exception: Exception

(BLE001)


178-178: Do not catch blind exception: Exception

(BLE001)


201-201: datetime.datetime.now() called without a tz argument

(DTZ005)


204-204: Unnecessary map() usage (rewrite using a list comprehension)

Replace map() with a list comprehension

(C417)


209-209: Boolean-typed positional argument in function definition

(FBT001)


209-209: Boolean default positional argument in function definition

(FBT002)


217-217: Avoid specifying long messages outside the exception class

(TRY003)


240-240: Do not catch blind exception: Exception

(BLE001)


264-264: Do not catch blind exception: Exception

(BLE001)


269-269: Do not catch blind exception: Exception

(BLE001)


277-277: Missing return type annotation for classmethod dataframes

(ANN206)

app/funcs/validator_functions.py

169-169: Avoid specifying long messages outside the exception class

(TRY003)

app/live_display.py

4-4: typing.Dict is deprecated, use dict instead

(UP035)


10-10: Missing return type annotation for special method __init__

Add return type annotation: None

(ANN204)


40-40: datetime.datetime.now() called without a tz argument

(DTZ005)


41-41: datetime.datetime.now() called without a tz argument

(DTZ005)


51-51: datetime.datetime.now() called without a tz argument

(DTZ005)


74-74: datetime.datetime.now() called without a tz argument

(DTZ005)


94-94: Missing return type annotation for private function _queue_update

Add return type annotation: None

(ANN202)


96-96: Boolean positional value in function call

(FBT003)


98-98: Missing return type annotation for private function _update_display

Add return type annotation: None

(ANN202)

app/states/ohio/ohio_downloader.py

21-21: Boolean-typed positional argument in function definition

(FBT001)


21-21: Unused method argument: overwrite

(ARG002)


21-21: Boolean-typed positional argument in function definition

(FBT001)


21-21: Unused method argument: read_from_temp

(ARG002)


29-29: Unnecessary ... literal

Remove unnecessary ...

(PIE790)

app/states/texas/validators/texas_settings.py

5-5: Redefinition of unused Field from line 1

(F811)


18-18: Avoid specifying long messages outside the exception class

(TRY003)

app/funcs/csv_reader.py

100-101: Replace yield over for loop with yield from

Replace with yield from

(UP028)


102-102: Do not catch blind exception: Exception

(BLE001)


103-103: Use explicit conversion flag

Replace with conversion flag

(RUF010)


108-108: Missing type annotation for **kwargs

(ANN003)


121-122: Replace yield over for loop with yield from

Replace with yield from

(UP028)


124-124: Do not catch blind exception: Exception

(BLE001)


125-125: Use explicit conversion flag

Replace with conversion flag

(RUF010)


129-130: Replace yield over for loop with yield from

Replace with yield from

(UP028)


131-131: Do not catch blind exception: Exception

(BLE001)


132-132: Use explicit conversion flag

Replace with conversion flag

(RUF010)


136-136: Missing type annotation for **kwargs

(ANN003)


137-137: Unnecessary mode argument

Remove mode argument

(UP015)

app/states/texas/validators/texas_address.py

147-147: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

app/states/texas/validators/texas_filers.py

23-23: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


30-30: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


46-46: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


53-53: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


68-68: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


84-84: Missing return type annotation for special method __init__

Add return type annotation: None

(ANN204)


84-84: Missing type annotation for **data

(ANN003)


99-99: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


114-114: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


123-123: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


128-128: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


152-152: Missing return type annotation for classmethod validate_name

(ANN206)


158-158: Missing return type annotation for classmethod validate_addresses

(ANN206)


167-167: Missing return type annotation for classmethod validate_treasurer

(ANN206)


174-174: Missing return type annotation for classmethod validate_assistant_treasurer

(ANN206)


181-181: Missing return type annotation for classmethod validate_chair

(ANN206)

app/states/unified_database.py

5-5: typing.List is deprecated, use list instead

(UP035)


5-5: typing.Dict is deprecated, use dict instead

(UP035)


31-31: Missing return type annotation for special method __init__

Add return type annotation: None

(ANN204)


44-44: Avoid specifying long messages outside the exception class

(TRY003)


105-105: Avoid specifying long messages outside the exception class

(TRY003)


152-152: Boolean-typed positional argument in function definition

(FBT001)


152-152: Boolean default positional argument in function definition

(FBT002)


283-283: Naive datetime constructed using datetime.datetime.strptime() without %z

(DTZ007)


284-284: Naive datetime constructed using datetime.datetime.strptime() without %z

(DTZ007)


574-574: datetime.datetime.utcnow() used

(DTZ003)


583-583: datetime.datetime.utcnow() used

(DTZ003)


617-617: datetime.datetime.utcnow() used

(DTZ003)


651-651: datetime.datetime.utcnow() used

(DTZ003)


685-685: datetime.datetime.utcnow() used

(DTZ003)


738-738: Avoid equality comparisons to True; use UnifiedCommitteePerson.is_active: for truth checks

Replace with UnifiedCommitteePerson.is_active

(E712)


744-744: datetime.datetime.utcnow() used

(DTZ003)


745-745: datetime.datetime.utcnow() used

(DTZ003)


753-753: Boolean-typed positional argument in function definition

(FBT001)


753-753: Boolean default positional argument in function definition

(FBT002)


760-760: Avoid equality comparisons to True; use UnifiedCommitteePerson.is_active: for truth checks

Replace with UnifiedCommitteePerson.is_active

(E712)


768-768: Boolean-typed positional argument in function definition

(FBT001)


768-768: Boolean default positional argument in function definition

(FBT002)


777-777: Avoid equality comparisons to True; use UnifiedCommitteePerson.is_active: for truth checks

Replace with UnifiedCommitteePerson.is_active

(E712)


817-817: datetime.datetime.utcnow() used

(DTZ003)


827-827: datetime.datetime.utcnow() used

(DTZ003)


855-855: Avoid equality comparisons to True; use UnifiedCommitteePerson.is_active: for truth checks

Replace with UnifiedCommitteePerson.is_active

(E712)


865-865: Boolean-typed positional argument in function definition

(FBT001)


865-865: Boolean default positional argument in function definition

(FBT002)


878-878: Unused method argument: user

(ARG002)


891-891: datetime.datetime.utcnow() used

(DTZ003)


935-935: Avoid equality comparisons to True; use UnifiedCommitteePerson.is_active: for truth checks

Replace with UnifiedCommitteePerson.is_active

(E712)


985-985: Avoid equality comparisons to True; use UnifiedCommitteePerson.is_active: for truth checks

Replace with UnifiedCommitteePerson.is_active

(E712)


1059-1059: Unused method argument: user

(ARG002)


1069-1069: Avoid equality comparisons to True; use UnifiedCommitteePerson.is_active: for truth checks

Replace with UnifiedCommitteePerson.is_active

(E712)


1081-1081: Comparison to None should be cond is None

Replace with cond is None

(E711)


1092-1092: datetime.datetime.utcnow() used

(DTZ003)


1107-1107: Unused method argument: user

(ARG002)


1132-1132: Avoid equality comparisons to True; use UnifiedCommitteePerson.is_active: for truth checks

Replace with UnifiedCommitteePerson.is_active

(E712)


1148-1148: datetime.datetime.utcnow() used

(DTZ003)


1163-1163: Avoid equality comparisons to True; use UnifiedCommitteePerson.is_active: for truth checks

Replace with UnifiedCommitteePerson.is_active

(E712)


1179-1179: Comparison to None should be cond is None

Replace with cond is None

(E711)


1203-1203: Do not catch blind exception: Exception

(BLE001)


1208-1208: No explicit stacklevel keyword argument found

Set stacklevel=2

(B028)

app/abcs/abc_download.py

6-6: Import from collections.abc instead: Generator

Import from collections.abc

(UP035)


6-6: typing.Dict is deprecated, use dict instead

(UP035)


30-30: Missing return type annotation for special method __post_init__

(ANN204)


36-36: Missing return type annotation for classmethod not_headless

(ANN206)


62-62: Missing return type annotation for classmethod extract_zipfile

Add return type annotation: None

(ANN206)


68-68: Do not catch blind exception: Exception

(BLE001)


73-73: Missing return type annotation for classmethod _process_csv

Add return type annotation: None

(ANN206)


85-85: datetime.datetime.now() called without a tz argument

(DTZ005)


109-109: Boolean-typed positional argument in function definition

(FBT001)


109-109: Boolean-typed positional argument in function definition

(FBT001)


114-114: Missing return type annotation for classmethod consolidate_files

(ANN206)


119-119: Missing return type annotation for classmethod read

(ANN206)

app/states/texas/validators/texas_expenses.py

161-161: Missing return type annotation for classmethod validate_payee

(ANN206)

app/states/texas/texas_search.py

2-2: typing.List is deprecated, use list instead

(UP035)


2-2: typing.Dict is deprecated, use dict instead

(UP035)


2-2: typing.Type is deprecated, use type instead

(UP035)


2-2: typing.Tuple is deprecated, use tuple instead

(UP035)


48-48: Missing return type annotation for special method __post_init__

(ANN204)


53-53: Missing return type annotation for private function _sort_fields

(ANN202)


63-63: Missing return type annotation for private function _recast_sorted_field_types

(ANN202)


85-85: Avoid specifying long messages outside the exception class

(TRY003)


94-94: Missing return type annotation for private function _coalesce_names

(ANN202)


120-120: Missing return type annotation for special method __repr__

Add return type annotation: str

(ANN204)


169-169: Missing return type annotation for special method __repr__

Add return type annotation: str

(ANN204)


172-172: Missing return type annotation for special method __post_init__

(ANN204)


180-180: Unnecessary dict() call (rewrite as a literal)

Rewrite as a literal

(C408)


230-230: Missing type annotation for *args

(ANN002)


235-235: Unnecessary dict() call (rewrite as a literal)

Rewrite as a literal

(C408)

app/states/unified_sqlmodels.py

8-8: typing.List is deprecated, use list instead

(UP035)


8-8: typing.Dict is deprecated, use dict instead

(UP035)


136-136: Unnecessary UTF-8 encoding argument to encode

Remove unnecessary encoding argument

(UP012)


164-164: Missing return type annotation for special method __post_init__

(ANN204)


227-227: Missing return type annotation for special method __post_init__

(ANN204)


288-288: Missing return type annotation for special method __post_init__

(ANN204)


414-414: Class field state_id is defined multiple times

Remove duplicate field definition for state_id

(PIE794)


747-747: Missing return type annotation for special method __init__

Add return type annotation: None

(ANN204)


853-853: Unused method argument: entity_role

(ARG002)


953-953: Dynamically typed expressions (typing.Any) are disallowed in _get_field_value

(ANN401)


990-990: Consider moving this statement to an else block

(TRY300)


991-991: Do not catch blind exception: Exception

(BLE001)


1007-1007: Do not catch blind exception: Exception

(BLE001)


1036-1036: Do not catch blind exception: Exception

(BLE001)


1094-1094: Do not catch blind exception: Exception

(BLE001)


1183-1183: Possible SQL injection vector through string-based query construction

(S608)


1189-1189: Do not catch blind exception: Exception

(BLE001)


1201-1201: Do not catch blind exception: Exception

(BLE001)


1204-1204: Dynamically typed expressions (typing.Any) are disallowed in value

(ANN401)


1218-1218: Dynamically typed expressions (typing.Any) are disallowed in value

(ANN401)


1228-1228: Naive datetime constructed using datetime.datetime.strptime() without %z

(DTZ007)


1234-1234: Consider moving this statement to an else block

(TRY300)


1238-1238: Dynamically typed expressions (typing.Any) are disallowed in value

(ANN401)


1293-1293: Dynamically typed expressions (typing.Any) are disallowed in obj

(ANN401)


1293-1293: Dynamically typed expressions (typing.Any) are disallowed in _json_default

(ANN401)


1304-1304: Missing return type annotation for special method __init__

Add return type annotation: None

(ANN204)

production_loader.py

1-1: Shebang is present but file is not executable

(EXE001)


19-19: typing.Dict is deprecated, use dict instead

(UP035)


19-19: typing.List is deprecated, use list instead

(UP035)


19-19: typing.Tuple is deprecated, use tuple instead

(UP035)


152-152: Missing return type annotation for special method __init__

Add return type annotation: None

(ANN204)


203-203: Loop control variable slug not used within loop body

Rename unused slug to _slug

(B007)


206-206: Avoid specifying long messages outside the exception class

(TRY003)


451-451: Possible SQL injection vector through string-based query construction

(S608)


469-469: Possible SQL injection vector through string-based query construction

(S608)


536-536: Missing return type annotation for private function _cache_entity

(ANN202)


552-552: Missing return type annotation for private function _cache_campaign

(ANN202)


568-568: Missing return type annotation for private function _ensure_address

(ANN202)


700-700: Avoid specifying long messages outside the exception class

(TRY003)


823-823: Do not catch blind exception: Exception

(BLE001)


832-832: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


832-832: Logging statement uses f-string

(G004)


855-855: Avoid specifying long messages outside the exception class

(TRY003)


876-876: Avoid specifying long messages outside the exception class

(TRY003)


931-931: Logging statement uses f-string

(G004)


1038-1038: f-string without any placeholders

Remove extraneous f prefix

(F541)


1044-1044: Logging .exception(...) should be used instead of .error(..., exc_info=True)

(G201)


1044-1044: Logging statement uses f-string

(G004)


1048-1048: f-string without any placeholders

Remove extraneous f prefix

(F541)


1049-1049: f-string without any placeholders

Remove extraneous f prefix

(F541)


1050-1050: f-string without any placeholders

Remove extraneous f prefix

(F541)

🔇 Additional comments (12)
.gitattributes (1)

1-1: Verify removal of CSV and ZIP from LFS tracking.

The change removes *.csv and *.zip from Git LFS tracking and replaces them with *.parquet. Combined with the .gitignore changes that now actively ignore *.csv, CSV files would be completely untracked (neither in Git nor LFS).

Confirm this is intentional: Are CSV files temporary/transient, or do they need to be persioned in some form? If CSVs are generated/downloaded in CI pipelines or shared workflows, verify that removing LFS tracking won't cause issues.

app/abcs/abc_validation_errors.py (1)

7-7: LGTM!

The logging prefix standardization to 'campaignfinance|' aligns with the project-wide logging consistency effort.

recreate_tables.py (1)

12-12: The import is intentional for side-effect registration.

The # noqa: F401 comment is appropriate here since the import registers SQLModel metadata as a side effect. However, static analysis reports the noqa directive itself is flagged. This is a linter configuration issue—the import is correctly used.

app/states/texas/validators/texas_filers.py (1)

28-37: Unusual composite primary key design with all-optional fields.

TECFilerIDNameLink defines multiple primary_key=True fields that are all Optional except filer_id. Having optional primary key columns is unconventional and may cause issues with NULL values in primary key constraints. Consider whether this link table design is intentional.

app/abcs/abc_state_config.py (1)

154-158: LGTM!

The reformatted multiline expression improves readability while maintaining identical functionality.

app/states/texas/__init__.py (2)

6-6: LGTM!

The new TexasSearch import exposes the search functionality from the module's public interface.


32-34: TECFilerName validator is correctly implemented with proper field mappings.

The change from TECFiler to TECFilerName is intentional and complete. TECFilerName is properly defined with comprehensive field mappings including filerIdent, filerJdiCd, committeeStatusCd, filerName, filerAddress, filerMailing, and relationships to related entities. The old TECFiler class is fully deprecated (commented out), with no active references remaining. The integration into the filers category configuration is correct.

app/abcs/__init__.py (1)

7-7: LGTM!

The progress object is now part of the public API, enabling progress tracking across downloader components. This aligns with its usage in FileDownloaderABC and the Texas downloader modules.

app/states/texas/validators/__init__.py (1)

1-10: LGTM on the restructured exports.

The updated public API surface correctly exposes the new base models (TECBaseModel, TECAddressBase) and renamed entities (TECFilerName), aligning with the refactored validator architecture.

app/live_display.py (1)

64-92: LGTM on the table generation logic.

The color-coded status display with duration tracking is well-implemented. Thread-safe access via self._lock during table generation ensures consistent reads.

app/states/texas/validators/texas_expenses.py (1)

18-20: table=True removed but table metadata retained.

TECExpense no longer has table=True but still has __tablename__ and __table_args__ defined. If this is intentionally not a table model, remove the metadata. If it should be a table, add table=True back.

app/states/ohio/ohio_downloader.py (1)

21-35: Method signature mismatch with abstract base class.

The download, consolidate_files, and read methods use self (instance methods), but FileDownloaderABC declares them as @classmethod with cls parameter. This inconsistency will cause issues when these methods are invoked.

Apply this diff to align with the ABC:

-    def download(self, overwrite: bool, read_from_temp: bool) -> FileDownloaderABC:
+    @classmethod
+    def download(cls, overwrite: bool, read_from_temp: bool) -> FileDownloaderABC:
-        _driver = self.driver.create_driver()
-        _wait = self.driver
+        _driver = cls.driver.create_driver()
+        _wait = cls.driver
         _driver.get(OHIO_ENTRY_URL)
         _wait.wait_until_clickable(By.LINK_TEXT, "FTP Site")
         _ftp_site_url = _driver.find_element(By.LINK_TEXT, "FTP Site").get_attribute("href")
         _driver.get(_ftp_site_url)
-
-        ...
+        return cls

-    def consolidate_files(self):
-        ...
+    @classmethod
+    def consolidate_files(cls):
+        pass

-    def read(self):
-        ...
+    @classmethod
+    def read(cls):
+        pass

Likely an incorrect or invalid review comment.

*.zip
*.DS_Store
*.env
# *.parquet
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Consolidate redundant commented and uncommented ignore patterns.

Lines 7 and 10 contain commented versions of patterns, while lines 178–179 contain their uncommented equivalents. This duplication creates maintenance burden. Consider removing the commented versions or consolidating at the end of the file with a single definitive block.

Also applies to: 10-10, 178-179

🤖 Prompt for AI Agents
.gitignore lines 7, 10 and lines 178–179: there are duplicate commented ignore
patterns (lines 7 and 10) that match the uncommented equivalents at 178–179;
remove the redundant commented entries (or move any intentionally kept comments
into a single definitive explanatory block at the end of the file) so each
ignore pattern appears only once (keep the active/uncommented patterns where
they currently are and delete the commented duplicates).

Comment on lines 176 to 177
*.csv
*.txt
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

wc -l .gitignore

Repository: Abstract-Data/campaignfinance-2023

Length of output: 92


🏁 Script executed:

head -10 .gitignore | cat -n

Repository: Abstract-Data/campaignfinance-2023

Length of output: 244


🏁 Script executed:

sed -n '170,180p' .gitignore | cat -n

Repository: Abstract-Data/campaignfinance-2023

Length of output: 543


🏁 Script executed:

# Check if *.csv appears multiple times
rg "^\*\.csv$" .gitignore

Repository: Abstract-Data/campaignfinance-2023

Length of output: 89


🏁 Script executed:

# Check if *.txt already exists elsewhere
rg "^\*\.txt$" .gitignore

Repository: Abstract-Data/campaignfinance-2023

Length of output: 83


Remove duplicate CSV rule and clarify intent of broad *.txt pattern.

Line 176 duplicates the *.csv rule already present at line 2. Additionally, line 177's *.txt pattern is overly broad and could inadvertently ignore important text files (README, documentation, config templates, etc.).

Clarify the intent: Is *.txt meant to ignore temporary text outputs, or is this too broad for the repository?

🤖 Prompt for AI Agents
In .gitignore around lines 176-177, remove the duplicate "*.csv" entry (it's
already present at line 2) and replace or clarify the overly broad "*.txt" rule:
either delete it if not needed, scope it to a specific directory/pattern for
temporary outputs (e.g., temp/*.txt or *.out.txt) or add explicit negations for
important files (e.g., !README.txt, !docs/*.txt) and include a short comment
above the rule explaining the intent (temporary outputs vs generated artifacts).

Comment on lines 6 to +8
<excludeFolder url="file://$MODULE_DIR$/.venv/bin" />
<excludeFolder url="file://$MODULE_DIR$/.venv/lib" />
<excludeFolder url="file://$MODULE_DIR$/.venv" />
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Redundant folder exclusions.

Lines 6-7 exclude specific subdirectories of .venv, but line 8 excludes the entire .venv directory. When a parent directory is excluded, all its subdirectories are implicitly excluded, making lines 6-7 redundant.

Apply this diff to remove the redundant exclusions:

     <content url="file://$MODULE_DIR$">
       <sourceFolder url="file://$MODULE_DIR$/app" isTestSource="false" />
-      <excludeFolder url="file://$MODULE_DIR$/.venv/bin" />
-      <excludeFolder url="file://$MODULE_DIR$/.venv/lib" />
       <excludeFolder url="file://$MODULE_DIR$/.venv" />
     </content>
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
<excludeFolder url="file://$MODULE_DIR$/.venv/bin" />
<excludeFolder url="file://$MODULE_DIR$/.venv/lib" />
<excludeFolder url="file://$MODULE_DIR$/.venv" />
<excludeFolder url="file://$MODULE_DIR$/.venv" />
🤖 Prompt for AI Agents
In .idea/campaignfinance.iml around lines 6 to 8, remove the two redundant
excludeFolder entries for file://$MODULE_DIR$/.venv/bin and
file://$MODULE_DIR$/.venv/lib and keep only the single excludeFolder entry that
excludes file://$MODULE_DIR$/.venv so that the parent .venv exclusion covers its
subdirectories.

<excludeFolder url="file://$MODULE_DIR$/.venv" />
</content>
<orderEntry type="jdk" jdkName="Poetry (campaignfinance) (2)" jdkType="Python SDK" />
<orderEntry type="jdk" jdkName="Python 3.12 UV (campaignfinance)" jdkType="Python SDK" />
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Hardcoded SDK reference is machine-specific.

The SDK name "Python 3.12 UV (campaignfinance)" is an IDE-specific identifier that may differ across team members' machines. IDE configuration files in .idea/ are typically added to .gitignore because they contain machine-specific settings like SDK paths, tool locations, and personal preferences.

Consider adding .idea/ to .gitignore (keeping only shared configurations like code style if needed), or document that developers must configure their local SDK to match this exact name.

🤖 Prompt for AI Agents
In .idea/campaignfinance.iml around line 10 the file contains a hardcoded,
machine-specific SDK entry "Python 3.12 UV (campaignfinance)"; remove that
SDK-specific orderEntry (or delete the whole .iml) and add .idea/ to .gitignore
(or keep only shared IDE files) so personal SDK names aren't committed;
alternatively document required local SDK name if you must keep a shared
reference.

@@ -6,7 +6,7 @@
<option name="pathToExecutable" value="/opt/homebrew/bin/black" />
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Hardcoded absolute path is machine-specific.

The path /opt/homebrew/bin/black is specific to macOS with Homebrew and won't work on Linux, Windows, or systems with different package managers. This will cause Black integration to fail for other developers.

As noted for the previous file, .idea/ configurations are typically machine-specific and should be excluded from version control. If you choose to commit them, document the expected tool locations or use IDE features to detect tools from the project's virtual environment automatically.

🤖 Prompt for AI Agents
In .idea/misc.xml around line 6 the Black executable path is hardcoded to
/opt/homebrew/bin/black which is machine-specific; replace this by removing the
absolute path and either rely on IDE's auto-detection (use the project's
virtualenv/interpreter setting) or store a relative/variable-based path (e.g.,
use a configurable IDEA path macro or project interpreter) and add .idea/ files
to .gitignore or document expected tool locations in repo docs so other
developers don't get a broken config.

Comment on lines +1 to +18
# Texas Campaign Finance Package

## Overview
This package is designed to provide a simple interface for accessing campaign finance data from the Texas Ethics Commission.
It also reduces duplication of fields and joins data from multiple files into a single table to reduce the size of the data.

## Examples
Across all files, there are over 317 columns. This package reduces the number of columns to [number of columns] by joining data from multiple files.

## Ability to Download TEC File Data Built-In
Using [Selenium](https://www.selenium.dev/), this package can download the latest campaign finance data from the Texas Ethics Commission website. The data is then processed and saved as CSV files.


## Dependencies
![Python](https://img.shields.io/badge/Python-FFD43B?style=for-the-badge&logo=python&logoColor=blue)
![Pydantic](https://img.shields.io/badge/Pydantic-E92063?style=for-the-badge&logo=Pydantic&logoColor=white)
![Pandas](https://img.shields.io/badge/Pandas-2C2D72?style=for-the-badge&logo=pandas&logoColor=white)
![Selenium](https://img.shields.io/badge/Selenium-43B02A?style=for-the-badge&logo=Selenium&logoColor=white) No newline at end of file
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Minor markdown formatting improvements.

Add blank lines after headings and ensure file ends with a newline for proper markdown rendering and linting compliance.

 # Texas Campaign Finance Package
 
 ## Overview
+
 This package is designed to provide a simple interface for accessing campaign finance data from the Texas Ethics Commission.
 It also reduces duplication of fields and joins data from multiple files into a single table to reduce the size of the data.
 
 ## Examples
+
 Across all files, there are over 317 columns. This package reduces the number of columns to [number of columns] by joining data from multiple files.
 
 ## Ability to Download TEC File Data Built-In
+
 Using [Selenium](https://www.selenium.dev/), this package can download the latest campaign finance data from the Texas Ethics Commission website. The data is then processed and saved as CSV files.
 
 
 ## Dependencies
+
 ![Python](https://img.shields.io/badge/Python-FFD43B?style=for-the-badge&logo=python&logoColor=blue)
 ![Pydantic](https://img.shields.io/badge/Pydantic-E92063?style=for-the-badge&logo=Pydantic&logoColor=white)
 ![Pandas](https://img.shields.io/badge/Pandas-2C2D72?style=for-the-badge&logo=pandas&logoColor=white)
 ![Selenium](https://img.shields.io/badge/Selenium-43B02A?style=for-the-badge&logo=Selenium&logoColor=white)
+
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Texas Campaign Finance Package
## Overview
This package is designed to provide a simple interface for accessing campaign finance data from the Texas Ethics Commission.
It also reduces duplication of fields and joins data from multiple files into a single table to reduce the size of the data.
## Examples
Across all files, there are over 317 columns. This package reduces the number of columns to [number of columns] by joining data from multiple files.
## Ability to Download TEC File Data Built-In
Using [Selenium](https://www.selenium.dev/), this package can download the latest campaign finance data from the Texas Ethics Commission website. The data is then processed and saved as CSV files.
## Dependencies
![Python](https://img.shields.io/badge/Python-FFD43B?style=for-the-badge&logo=python&logoColor=blue)
![Pydantic](https://img.shields.io/badge/Pydantic-E92063?style=for-the-badge&logo=Pydantic&logoColor=white)
![Pandas](https://img.shields.io/badge/Pandas-2C2D72?style=for-the-badge&logo=pandas&logoColor=white)
![Selenium](https://img.shields.io/badge/Selenium-43B02A?style=for-the-badge&logo=Selenium&logoColor=white)
# Texas Campaign Finance Package
## Overview
This package is designed to provide a simple interface for accessing campaign finance data from the Texas Ethics Commission.
It also reduces duplication of fields and joins data from multiple files into a single table to reduce the size of the data.
## Examples
Across all files, there are over 317 columns. This package reduces the number of columns to [number of columns] by joining data from multiple files.
## Ability to Download TEC File Data Built-In
Using [Selenium](https://www.selenium.dev/), this package can download the latest campaign finance data from the Texas Ethics Commission website. The data is then processed and saved as CSV files.
## Dependencies
![Python](https://img.shields.io/badge/Python-FFD43B?style=for-the-badge&logo=python&logoColor=blue)
![Pydantic](https://img.shields.io/badge/Pydantic-E92063?style=for-the-badge&logo=Pydantic&logoColor=white)
![Pandas](https://img.shields.io/badge/Pandas-2C2D72?style=for-the-badge&logo=pandas&logoColor=white)
![Selenium](https://img.shields.io/badge/Selenium-43B02A?style=for-the-badge&logo=Selenium&logoColor=white)
🧰 Tools
🪛 LanguageTool

[style] ~15-~15: Using many exclamation marks might seem excessive (in this case: 3 exclamation marks for a text that’s 1026 characters long)
Context: ...r-the-badge&logo=python&logoColor=blue) Pydantic Pandas ![Selenium](https://img.shields.io/badge/...

(EN_EXCESSIVE_EXCLAMATION)

🪛 markdownlint-cli2 (0.18.1)

3-3: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


7-7: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


10-10: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


14-14: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


18-18: Files should end with a single newline character

(MD047, single-trailing-newline)

🤖 Prompt for AI Agents
In README.md around lines 1 to 18, headings are missing blank lines after them
and the file likely does not end with a newline; add a single blank line
immediately after each Markdown heading (e.g., after "# Texas Campaign Finance
Package", "## Overview", "## Examples", "## Ability to Download TEC File Data
Built-In", and "## Dependencies") and ensure the file ends with a trailing
newline character.

It also reduces duplication of fields and joins data from multiple files into a single table to reduce the size of the data.

## Examples
Across all files, there are over 317 columns. This package reduces the number of columns to [number of columns] by joining data from multiple files.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Placeholder text needs to be replaced.

The text [number of columns] appears to be a placeholder that should be replaced with the actual column count.

🤖 Prompt for AI Agents
In README.md around line 8, the placeholder "[number of columns]" must be
replaced with the actual column count; determine the correct number (e.g., 317
or the current accurate count) and replace the placeholder with that numeric
value, keeping the sentence grammatical and updating any surrounding punctuation
or wording as needed.

Comment on lines +30 to +32
except Exception as e:
console.print(f"[red]❌ Setup failed: {e}[/red]")
return
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Catching bare Exception hides root causes.

Broad exception handling makes debugging difficult. Catch specific exceptions:

-    except Exception as e:
-        console.print(f"[red]❌ Setup failed: {e}[/red]")
+    except (ImportError, ConnectionError, OperationalError) as e:
+        console.print(f"[red]❌ Setup failed: {e}[/red]")

At minimum, log the full traceback for debugging:

import traceback
except Exception as e:
    console.print(f"[red]❌ Setup failed: {e}[/red]")
    console.print(f"[dim]{traceback.format_exc()}[/dim]")

Also applies to: 70-72, 81-83

🧰 Tools
🪛 Ruff (0.14.7)

30-30: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
In recreate_tables.py around lines 30-32 (and similarly at 70-72 and 81-83), the
code catches a bare Exception and only prints the exception message; update each
except block to catch more specific exceptions where possible and at minimum
include the full traceback in the output for debugging: import the traceback
module at the top, and in those except blocks print the original error message
and the formatted traceback (traceback.format_exc()) so root causes are visible;
prefer replacing broad Exception with the specific exception types raised by the
failing operations if known.

Comment on lines +34 to +83
# Drop existing unified tables
console.print("\n[bold blue]Dropping existing unified tables...[/bold blue]")
try:
with engine.connect() as conn:
# Drop tables in reverse dependency order
tables_to_drop = [
'unified_committee_person_versions',
'unified_transaction_versions',
'unified_person_versions',
'unified_committee_versions',
'unified_address_versions',
'unified_campaign_entities',
'unified_entity_associations',
'unified_transaction_persons',
'unified_committee_persons',
'unified_contributions',
'unified_loans',
'unified_transactions',
'file_origins',
'unified_campaigns',
'unified_entities',
'unified_persons',
'unified_committees',
'unified_addresses',
'states'
]

for table in tables_to_drop:
try:
conn.execute(text(f"DROP TABLE IF EXISTS {table} CASCADE"))
console.print(f"[green]✅ Dropped {table}[/green]")
except Exception as e:
console.print(f"[yellow]⚠️ Could not drop {table}: {e}[/yellow]")

conn.commit()

except Exception as e:
console.print(f"[red]❌ Error dropping tables: {e}[/red]")
return

# Create new tables with correct schema
console.print("\n[bold blue]Creating new unified tables...[/bold blue]")
try:
# Create all tables
SQLModel.metadata.create_all(engine)
console.print("[green]✅ All unified tables created successfully![/green]")

except Exception as e:
console.print(f"[red]❌ Error creating tables: {e}[/red]")
return
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Schema migration lacks transactional safety.

If table creation fails after tables are dropped, the database is left in an inconsistent state with missing tables. Consider:

  1. Wrap the entire operation in a transaction
  2. Or use a backup/restore approach
  3. Or at minimum, warn users this is destructive
+    console.print("[yellow]⚠️ WARNING: This operation is destructive and non-recoverable![/yellow]")
+    console.print("[yellow]Consider backing up your database first.[/yellow]")
+    
     # Drop existing unified tables
     console.print("\n[bold blue]Dropping existing unified tables...[/bold blue]")

For production use, consider migration tools like Alembic that provide rollback capabilities.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Drop existing unified tables
console.print("\n[bold blue]Dropping existing unified tables...[/bold blue]")
try:
with engine.connect() as conn:
# Drop tables in reverse dependency order
tables_to_drop = [
'unified_committee_person_versions',
'unified_transaction_versions',
'unified_person_versions',
'unified_committee_versions',
'unified_address_versions',
'unified_campaign_entities',
'unified_entity_associations',
'unified_transaction_persons',
'unified_committee_persons',
'unified_contributions',
'unified_loans',
'unified_transactions',
'file_origins',
'unified_campaigns',
'unified_entities',
'unified_persons',
'unified_committees',
'unified_addresses',
'states'
]
for table in tables_to_drop:
try:
conn.execute(text(f"DROP TABLE IF EXISTS {table} CASCADE"))
console.print(f"[green]✅ Dropped {table}[/green]")
except Exception as e:
console.print(f"[yellow]⚠️ Could not drop {table}: {e}[/yellow]")
conn.commit()
except Exception as e:
console.print(f"[red]❌ Error dropping tables: {e}[/red]")
return
# Create new tables with correct schema
console.print("\n[bold blue]Creating new unified tables...[/bold blue]")
try:
# Create all tables
SQLModel.metadata.create_all(engine)
console.print("[green]✅ All unified tables created successfully![/green]")
except Exception as e:
console.print(f"[red]❌ Error creating tables: {e}[/red]")
return
# Drop existing unified tables
console.print("\n[bold blue]Dropping existing unified tables...[/bold blue]")
console.print("[yellow]⚠️ WARNING: This operation is destructive and non-recoverable![/yellow]")
console.print("[yellow]Consider backing up your database first.[/yellow]")
try:
with engine.connect() as conn:
# Drop tables in reverse dependency order
tables_to_drop = [
'unified_committee_person_versions',
'unified_transaction_versions',
'unified_person_versions',
'unified_committee_versions',
'unified_address_versions',
'unified_campaign_entities',
'unified_entity_associations',
'unified_transaction_persons',
'unified_committee_persons',
'unified_contributions',
'unified_loans',
'unified_transactions',
'file_origins',
'unified_campaigns',
'unified_entities',
'unified_persons',
'unified_committees',
'unified_addresses',
'states'
]
for table in tables_to_drop:
try:
conn.execute(text(f"DROP TABLE IF EXISTS {table} CASCADE"))
console.print(f"[green]✅ Dropped {table}[/green]")
except Exception as e:
console.print(f"[yellow]⚠️ Could not drop {table}: {e}[/yellow]")
conn.commit()
except Exception as e:
console.print(f"[red]❌ Error dropping tables: {e}[/red]")
return
# Create new tables with correct schema
console.print("\n[bold blue]Creating new unified tables...[/bold blue]")
try:
# Create all tables
SQLModel.metadata.create_all(engine)
console.print("[green]✅ All unified tables created successfully![/green]")
except Exception as e:
console.print(f"[red]❌ Error creating tables: {e}[/red]")
return
🧰 Tools
🪛 Ruff (0.14.7)

65-65: Do not catch blind exception: Exception

(BLE001)


70-70: Do not catch blind exception: Exception

(BLE001)


81-81: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
In recreate_tables.py around lines 34 to 83, the script drops tables then
creates them without transactional safety; change the flow so the destructive
drops and subsequent create_all occur inside a single transaction (use
engine.begin()/engine.connect().begin() or a Session/transaction context) and
only commit after SQLModel.metadata.create_all succeeds, bind create_all to the
same connection/transaction; additionally add a pre-check/backup step (e.g.,
prompt the user and optionally dump or rename existing tables/schemas) and
explicit destructive-warning messaging, and consider switching this workflow to
Alembic migrations for production to enable proper rollbacks.

Comment on lines +61 to +68
for table in tables_to_drop:
try:
conn.execute(text(f"DROP TABLE IF EXISTS {table} CASCADE"))
console.print(f"[green]✅ Dropped {table}[/green]")
except Exception as e:
console.print(f"[yellow]⚠️ Could not drop {table}: {e}[/yellow]")

conn.commit()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Use parameterized queries or identifier quoting to avoid SQL injection patterns.

While the table names are hardcoded in tables_to_drop, directly interpolating into SQL strings establishes a risky pattern. Use SQLAlchemy's quoted_name or text().bindparams() for safety:

+from sqlalchemy import quoted_name
+from sqlalchemy.sql import text
 ...
 for table in tables_to_drop:
     try:
-        conn.execute(text(f"DROP TABLE IF EXISTS {table} CASCADE"))
+        # Use quoted identifier to prevent SQL injection patterns
+        conn.execute(text(f"DROP TABLE IF EXISTS {quoted_name(table, quote=True)} CASCADE"))

Alternatively, use SQLAlchemy's Table.drop() API with the metadata for type-safe operations.

🧰 Tools
🪛 Ruff (0.14.7)

65-65: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
In recreate_tables.py around lines 61 to 68, the code interpolates table names
directly into a DROP TABLE SQL string which encourages SQL-injection-prone
patterns; replace this by using SQLAlchemy-safe identifier handling — either
construct Table objects from your MetaData and call table.drop(bind=conn,
checkfirst=True) for each name, or if you must use text(), quote identifiers
with sqlalchemy.sql.elements.quoted_name or use sql.text() with proper
identifier binding via metadata/engine.dialect.identifier_preparer.quote to
produce a safe quoted identifier before executing; update the try/except branch
to execute the safe API and keep the conn.commit() after successful drops.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review continued from previous batch...

Comment on lines +20 to +21
progress = ProgressTracker()
progress.start()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Side effect at module import time.

Starting ProgressTracker at module load (progress.start()) means the live display begins whenever this module is imported, even if no download is happening. This can cause unexpected console output and resource usage.

Consider lazy initialization or explicit start:

 progress = ProgressTracker()
-progress.start()

Then call progress.start() explicitly in the download() method when needed.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In app/abcs/abc_download.py around lines 20-21, creating and starting
ProgressTracker at module import time (progress = ProgressTracker();
progress.start()) causes an unwanted live display and resource use whenever the
module is imported; change to lazy initialization by removing progress.start()
from module scope, instantiate ProgressTracker (or call progress.start()) inside
the download() method when an actual download begins (or use a helper that
creates/starts the tracker on first use), and ensure the tracker is properly
stopped/cleaned up after the download completes or on errors.

Comment on lines +66 to +69
try:
cls._process_csv(zip_ref, file, tmp)
except Exception as e:
ic(f"Zip File Extraction Error on {file.filename.upper()}: {e}")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Broad exception catch may hide critical errors.

Catching bare Exception and only logging with ic() can hide serious issues like disk full, permission errors, or corrupted archives. Consider catching specific exceptions or at least logging the full traceback.

             try:
                 cls._process_csv(zip_ref, file, tmp)
-            except Exception as e:
-                ic(f"Zip File Extraction Error on {file.filename.upper()}: {e}")
+            except (OSError, IOError, pl.exceptions.ComputeError) as e:
+                ic(f"Zip File Extraction Error on {file.filename.upper()}: {e}")
+                raise
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
try:
cls._process_csv(zip_ref, file, tmp)
except Exception as e:
ic(f"Zip File Extraction Error on {file.filename.upper()}: {e}")
try:
cls._process_csv(zip_ref, file, tmp)
except (OSError, IOError, pl.exceptions.ComputeError) as e:
ic(f"Zip File Extraction Error on {file.filename.upper()}: {e}")
raise
🧰 Tools
🪛 Ruff (0.14.7)

68-68: Do not catch blind exception: Exception

(BLE001)

Comment on lines +79 to +83
_csv_task = progress.add_task("T5", f"Extract CSV {file_name.stem}", "Started")
zip_ref.extract(file, tmp)

if file_name.suffix == '.txt':
return
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Progress task not completed for .txt files.

When the file is a .txt, the function returns early on line 83 without calling progress.update_task(_csv_task, "Complete"). This leaves the task in "Started" status indefinitely.

         if file_name.suffix == '.txt':
+            progress.update_task(_csv_task, "Complete")
             return
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
_csv_task = progress.add_task("T5", f"Extract CSV {file_name.stem}", "Started")
zip_ref.extract(file, tmp)
if file_name.suffix == '.txt':
return
_csv_task = progress.add_task("T5", f"Extract CSV {file_name.stem}", "Started")
zip_ref.extract(file, tmp)
if file_name.suffix == '.txt':
progress.update_task(_csv_task, "Complete")
return
🤖 Prompt for AI Agents
In app/abcs/abc_download.py around lines 79 to 83, the progress task _csv_task
is left in "Started" when a .txt file causes an early return; before returning
for file_name.suffix == '.txt' call progress.update_task(_csv_task, "Complete")
(or an equivalent completion/cleanup call) so the task status is set to
"Complete" before the function returns.

if file_name.suffix == '.txt':
return

rename = f"{file_name.stem}_{datetime.now():%Y%m%d}dl"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

datetime.now() called without timezone.

Using naive datetime for file naming can cause inconsistencies across different environments or if the system timezone changes.

+from datetime import datetime, timezone
...
-        rename = f"{file_name.stem}_{datetime.now():%Y%m%d}dl"
+        rename = f"{file_name.stem}_{datetime.now(timezone.utc):%Y%m%d}dl"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
rename = f"{file_name.stem}_{datetime.now():%Y%m%d}dl"
rename = f"{file_name.stem}_{datetime.now(timezone.utc):%Y%m%d}dl"
🧰 Tools
🪛 Ruff (0.14.7)

85-85: datetime.datetime.now() called without a tz argument

(DTZ005)

🤖 Prompt for AI Agents
In app/abcs/abc_download.py around line 85, the filename uses a naive datetime
(datetime.now()) which lacks timezone information; replace it with a
timezone-aware timestamp (for example datetime.now(tz=timezone.utc) or
datetime.now().astimezone()) when formatting the rename string and ensure the
appropriate timezone import (from datetime import datetime, timezone) is added
at the top of the file so the generated filename is consistently based on an
explicit timezone (e.g., UTC).

Comment on lines +107 to 120
@classmethod
@abc.abstractmethod
def download(cls, overwrite: bool, read_from_temp: bool) -> FileDownloaderABC:
...

@classmethod
@abc.abstractmethod
def download(self, overwrite: bool, read_from_temp: bool) -> FileDownloaderABC:
def consolidate_files(cls):
...

@classmethod
@abc.abstractmethod
def read(self):
def read(cls):
...
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider keyword-only arguments for boolean parameters.

The download method has two boolean positional parameters which can be confusing at call sites (e.g., download(True, False)).

     @classmethod
     @abc.abstractmethod
-    def download(cls, overwrite: bool, read_from_temp: bool) -> FileDownloaderABC:
+    def download(cls, *, overwrite: bool, read_from_temp: bool) -> FileDownloaderABC:
         ...
🧰 Tools
🪛 Ruff (0.14.7)

109-109: Boolean-typed positional argument in function definition

(FBT001)


109-109: Boolean-typed positional argument in function definition

(FBT001)


114-114: Missing return type annotation for classmethod consolidate_files

(ANN206)


119-119: Missing return type annotation for classmethod read

(ANN206)

🤖 Prompt for AI Agents
In app/abcs/abc_download.py around lines 107 to 120, the download method
currently accepts two boolean positional arguments which are ambiguous at call
sites; change the abstractmethod signature to require keyword-only booleans
(e.g., introduce a positional-only marker by adding a lone * so overwrite and
read_from_temp must be passed by name), update all concrete implementations to
match the new signature, and update all call sites to use explicit keywords
(overwrite=..., read_from_temp=...) to avoid confusion; ensure type hints and
return type remain unchanged and run tests to catch any mismatches.

Comment on lines +1 to +5
from dataclasses import Field
from typing import Optional
from datetime import date
from pydantic import ConfigDict, model_validator, BaseModel, BeforeValidator
from sqlmodel import SQLModel, Field
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove unused and redefined Field import.

Line 1 imports Field from dataclasses, which is immediately shadowed by the Field import from sqlmodel on line 5. The dataclasses.Field is never used. Additionally, BaseModel from pydantic (line 4) appears unused since you're using SQLModel.

-from dataclasses import Field
 from typing import Optional
 from datetime import date
-from pydantic import ConfigDict, model_validator, BaseModel, BeforeValidator
+from pydantic import ConfigDict, model_validator, BeforeValidator
 from sqlmodel import SQLModel, Field
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from dataclasses import Field
from typing import Optional
from datetime import date
from pydantic import ConfigDict, model_validator, BaseModel, BeforeValidator
from sqlmodel import SQLModel, Field
from typing import Optional
from datetime import date
from pydantic import ConfigDict, model_validator, BeforeValidator
from sqlmodel import SQLModel, Field
🧰 Tools
🪛 Ruff (0.14.7)

5-5: Redefinition of unused Field from line 1

(F811)

🤖 Prompt for AI Agents
In app/states/texas/validators/texas_settings.py around lines 1 to 5, remove the
unused and shadowed imports: drop "Field" from dataclasses (line 1) since it is
immediately shadowed by sqlmodel.Field, and remove "BaseModel" from pydantic
(line 4) if it is not used elsewhere in the file; also trim any other unused
pydantic imports (e.g., BeforeValidator) so that only the actually used symbols
(like model_validator or ConfigDict) remain, leaving the sqlmodel Field import
intact.

Comment on lines +15 to +20
def check_contains_factory(match_string: str):
def check_contains(value: str) -> str:
if value and match_string not in value:
raise ValueError(f"Value must contain '{match_string}'")
return value
return check_contains
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Duplicate of check_contains_factory in validator_functions.py.

This function is identical to the one in app/funcs/validator_functions.py (lines 166-171). Import it from there instead of redefining it.

-def check_contains_factory(match_string: str):
-    def check_contains(value: str) -> str:
-        if value and match_string not in value:
-            raise ValueError(f"Value must contain '{match_string}'")
-        return value
-    return check_contains
+from funcs.validator_functions import check_contains_factory

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Ruff (0.14.7)

18-18: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In app/states/texas/validators/texas_settings.py around lines 15 to 20 there is
a duplicate definition of check_contains_factory; remove the local function and
instead import check_contains_factory from app/funcs/validator_functions.py (add
the import at the top of the file), update any local references to use the
imported name, and ensure imports are organized (and tests/types still pass).

Comment on lines +44 to +50
class TECBaseModel(TECSettings):
recordType: Optional[str] = Field(BeforeValidator(check_contains_factory('recordType')), description="Record type code")
formTypeCd: Optional[str] = Field(BeforeValidator(check_contains_factory('formType')), description="TEC form used")
schedFormTypeCd: Optional[str] = Field(BeforeValidator(check_contains_factory('schedFormType')), description="TEC Schedule Used")
receivedDt: Optional[date] = Field(BeforeValidator(check_contains_factory('receivedDt')), description="Date report received by TEC")
filerIdent: Optional[str] = Field(default=None, description="Filer account #", max_length=100)
filerTypeCd: Optional[str] = Field(default=None, description="Type of filer", max_length=30)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, find the check_contains_factory function definition
rg -n "def check_contains_factory|check_contains_factory" --type py -B 2 -A 10

Repository: Abstract-Data/campaignfinance-2023

Length of output: 12330


🏁 Script executed:

# Search for receivedDt usage patterns throughout the codebase
rg -n "receivedDt" --type py -C 3

Repository: Abstract-Data/campaignfinance-2023

Length of output: 9386


🏁 Script executed:

# Check if there are test files that show how data is instantiated
fd -e "test_" -o -e "_test.py" | head -20

Repository: Abstract-Data/campaignfinance-2023

Length of output: 194


🏁 Script executed:

# Let me examine the check_contains function more closely and see if there's type coercion
cat -n app/funcs/validator_functions.py | sed -n '166,171p'

Repository: Abstract-Data/campaignfinance-2023

Length of output: 373


🏁 Script executed:

# Check if there are any Pydantic validators that handle date parsing before the BeforeValidator
cat -n app/states/texas/validators/texas_settings.py | sed -n '1,60p'

Repository: Abstract-Data/campaignfinance-2023

Length of output: 3032


🏁 Script executed:

# Look for how data is loaded/instantiated to understand input format
fd -type f -name "*.py" | xargs rg -l "TECBaseModel\|receivedDt" | head -10

Repository: Abstract-Data/campaignfinance-2023

Length of output: 248


🏁 Script executed:

# Search for validate_dates function to understand what it does
rg -n "def validate_dates" --type py -A 15

Repository: Abstract-Data/campaignfinance-2023

Length of output: 3023


🏁 Script executed:

# Also check if there are any tests that show actual instantiation patterns
fd -type f -name "*.py" | xargs rg -l "test.*TECBaseModel\|TECBaseModel.*test" | head -5

Repository: Abstract-Data/campaignfinance-2023

Length of output: 248


🏁 Script executed:

# Verify the execution order - check if there are any tests showing the error
rg -n "receivedDt" app/states/texas/validators/texas_settings.py -B 5 -A 5

Repository: Abstract-Data/campaignfinance-2023

Length of output: 971


🏁 Script executed:

# Double-check the Pydantic v2 behavior by looking at imports and config
head -20 app/states/texas/validators/texas_settings.py

Repository: Abstract-Data/campaignfinance-2023

Length of output: 758


Remove BeforeValidator(check_contains_factory('receivedDt')) from the receivedDt field.

The validate_dates model validator (mode='before') runs before field-level validators and converts the receivedDt string to a date object. When check_contains_factory subsequently runs, it receives a date object but expects a string, causing a TypeError at runtime. The substring-check logic is also semantically incorrect for date fields. Apply the same fix to any other non-string fields using this validator pattern.

🤖 Prompt for AI Agents
In app/states/texas/validators/texas_settings.py around lines 44 to 50, remove
the Field-level BeforeValidator(check_contains_factory('receivedDt')) from the
receivedDt field because validate_dates (mode='before') already converts the
incoming string to a date and the check_contains validator expects a string;
update receivedDt to be a plain Optional[date] Field (no BeforeValidator) and
audit other non-string fields to remove any
BeforeValidator(check_contains_factory(...)) uses so substring checks only run
on string fields (keep check_contains_factory on string-typed fields only or
move checks into appropriate pre-validation for strings).

Comment on lines +554 to +589
# Get current version number
version_count = session.exec(select(UnifiedTransactionVersion).where(UnifiedTransactionVersion.transaction_id == tx.id)).count()
# Save current state as version
# Convert data to JSON-serializable format
data_dict = {}
for k in tx.__fields__.keys():
value = getattr(tx, k)
if isinstance(value, date):
data_dict[k] = value.isoformat()
elif isinstance(value, datetime):
data_dict[k] = value.isoformat()
elif isinstance(value, Decimal):
data_dict[k] = float(value)
else:
data_dict[k] = value

version = UnifiedTransactionVersion(
transaction_id=tx.id,
version_number=version_count + 1,
data=json.dumps(data_dict),
changed_at=datetime.utcnow(),
changed_by=user,
change_reason=reason,
amendment_details=amendment_details
)
session.add(version)
# Update fields
for k, v in updates.items():
setattr(tx, k, v)
tx.last_modified_at = datetime.utcnow()
tx.last_modified_by = user
tx.change_reason = reason
tx.amendment_details = amendment_details
session.add(tx)
session.commit()
session.refresh(tx)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Race condition in version counting and deprecated datetime.utcnow().

  1. The version count query followed by version creation is not atomic—concurrent updates could result in duplicate version numbers.
  2. datetime.utcnow() is deprecated; use datetime.now(timezone.utc) instead.

For the race condition, consider using a database sequence or SELECT ... FOR UPDATE. For the datetime issue:

+from datetime import datetime, date, timezone
...
-            changed_at=datetime.utcnow(),
+            changed_at=datetime.now(timezone.utc),
...
-            tx.last_modified_at = datetime.utcnow()
+            tx.last_modified_at = datetime.now(timezone.utc)

The same pattern applies to all other datetime.utcnow() calls at lines 583, 617, 651, 685, 744, 745, 817, 827, 891, 1092, 1148.

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Ruff (0.14.7)

574-574: datetime.datetime.utcnow() used

(DTZ003)


583-583: datetime.datetime.utcnow() used

(DTZ003)

🤖 Prompt for AI Agents
In app/states/unified_database.py around lines 554 to 589, the version_count
query followed by creating a new UnifiedTransactionVersion is not atomic (race
can yield duplicate version_number) and uses deprecated datetime.utcnow(); to
fix, perform the version calculation and insert inside the same DB transaction
while acquiring a lock on the related transaction row (e.g., SELECT ... FOR
UPDATE or ORM equivalent) or use a dedicated DB sequence/auto-increment for
version_number so concurrent workers cannot generate the same number; also
replace datetime.utcnow() with datetime.now(timezone.utc) (and apply that
replacement to the other occurrences at lines 583, 617, 651, 685, 744, 745, 817,
827, 891, 1092, 1148) so timestamps are timezone-aware.

Comment on lines +613 to +628
version = UnifiedPersonVersion(
person_id=person.id,
version_number=version_count + 1,
data=json.dumps({k: getattr(person, k) for k in person.__fields__.keys()}),
changed_at=datetime.utcnow(),
changed_by=user,
change_reason=reason,
amendment_details=amendment_details
)
session.add(version)
for k, v in updates.items():
setattr(person, k, v)
session.add(person)
session.commit()
session.refresh(person)
return person
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Non-serializable data in person version snapshot.

Unlike update_transaction, the update_person method doesn't handle non-JSON-serializable types (e.g., date, datetime, Decimal) before calling json.dumps(). This will raise TypeError if the person model contains such fields.

Apply consistent serialization logic:

-            version = UnifiedPersonVersion(
-                person_id=person.id,
-                version_number=version_count + 1,
-                data=json.dumps({k: getattr(person, k) for k in person.__fields__.keys()}),
+            data_dict = {}
+            for k in person.__fields__.keys():
+                value = getattr(person, k)
+                if isinstance(value, date):
+                    data_dict[k] = value.isoformat()
+                elif isinstance(value, datetime):
+                    data_dict[k] = value.isoformat()
+                elif isinstance(value, Decimal):
+                    data_dict[k] = float(value)
+                else:
+                    data_dict[k] = value
+            version = UnifiedPersonVersion(
+                person_id=person.id,
+                version_number=version_count + 1,
+                data=json.dumps(data_dict),

The same issue exists in update_committee (line 650) and update_address (line 684).

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Ruff (0.14.7)

617-617: datetime.datetime.utcnow() used

(DTZ003)

🤖 Prompt for AI Agents
In app/states/unified_database.py around lines 613-628, the snapshot JSON for
UnifiedPersonVersion uses json.dumps on model fields directly which will raise
TypeError for non-serializable types (date, datetime, Decimal, UUID, etc.);
update the code to serialize the person dict using the same approach as
update_transaction (use a shared helper or json.dumps(..., default=str) or
explicitly convert known field types to ISO strings/primitive types) so all
values are JSON-serializable before creating the version. Apply the same fix to
the analogous snapshots in update_committee (around line 650) and update_address
(around line 684) so they also use the shared serialization helper or json.dumps
with a default converter.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review continued from previous batch...

Comment on lines +201 to +202
file=cls.config.TEMP_FOLDER / f"{k}_{datetime.now():%Y%m%d}w.parquet",
compression='lz4')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Use timezone-aware datetime.

datetime.now() without a timezone argument produces a naive datetime, which can cause issues in systems with multiple timezones.

+from datetime import datetime, timezone
...
             df.write_parquet(
-                file=cls.config.TEMP_FOLDER / f"{k}_{datetime.now():%Y%m%d}w.parquet",
+                file=cls.config.TEMP_FOLDER / f"{k}_{datetime.now(timezone.utc):%Y%m%d}w.parquet",
                 compression='lz4')
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
file=cls.config.TEMP_FOLDER / f"{k}_{datetime.now():%Y%m%d}w.parquet",
compression='lz4')
file=cls.config.TEMP_FOLDER / f"{k}_{datetime.now(timezone.utc):%Y%m%d}w.parquet",
compression='lz4')
🧰 Tools
🪛 Ruff (0.14.7)

201-201: datetime.datetime.now() called without a tz argument

(DTZ005)

🤖 Prompt for AI Agents
In app/states/texas/texas_downloader.py around lines 201-202, the code uses
datetime.now() to generate a filename which produces a naive datetime; update
this to a timezone-aware timestamp by using datetime.now(timezone.utc) (or
datetime.now(tz=your_app_tz) if a local zone is required) and ensure you import
timezone from datetime (or use datetime.utcnow().replace(tzinfo=timezone.utc) if
you prefer); replace the current call in the filename construction with the
timezone-aware call so the generated filename includes an aware datetime.

- Updated all imports to use app. prefix for proper module resolution
- Fixed app/abcs/*.py imports
- Fixed app/funcs/*.py imports
- Fixed app/states/texas/**/*.py imports
- Fixed app/states/oklahoma/**/*.py imports
- Fixed app/states/ohio/*.py imports
- Fixed app/main.py imports
- Updated test_validation_class.py to use new StateFileValidation API
- Updated test_ingest_hypothesis.py with Hypothesis health check suppression
- Cast pl.lit(None) to String type when adding missing columns to fix
  'type String is incompatible with expected type Null' error
- Add cleanup of leftover parquet files at start of test_texas_consolidate_files
  to prevent flaky test failures from tmp_path reuse between Hypothesis examples
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)
app/abcs/abc_category.py (1)

79-94: validation setter is recursively assigning to itself.

In the setter, self.validation = value calls the same setter again, leading to infinite recursion if validation is ever set. Either drop the setter or have it assign to a private backing attribute (e.g. _validation) instead.

app/states/texas/validators/texas_personname.py (1)

87-101: Guard against missing NamePrefixCd when deriving address prefixes.

If no key containing 'NamePrefixCd' is present, _pfx will be None, and key.startswith(_pfx) will raise a TypeError. While your data likely always includes a prefix field, a defensive check would avoid surprising failures:

-        _pfx = next((key for key in values.keys() if 'NamePrefixCd' in key), None)
-        if _pfx:
-            _pfx = _pfx.split('NamePrefixCd')[0]
-        _mail_keys = [key for key in values.keys() if 'Mail' in key and key.startswith(_pfx)]
+        _pfx = next((key for key in values.keys() if 'NamePrefixCd' in key), None)
+        if not _pfx:
+            return values
+        _pfx = _pfx.split('NamePrefixCd')[0]
+        _mail_keys = [key for key in values.keys() if 'Mail' in key and key.startswith(_pfx)]

This keeps the happy path untouched while making the validator resilient to missing prefix fields.

app/states/texas/validators/direct_expenditures.py (1)

1-44: Import is fine, but most TECDirectExpenses fields lack type annotations and won’t be modeled.

The new import of TECSettings from app.states.texas.validators.texas_settings is correct, but in TECDirectExpenses only the first few attributes are typed. All of the bare names (e.g. infoOnlyFlag, filerIdent, expendDt, etc.) will not become SQLModel/Pydantic fields and therefore won’t be persisted or validated.

If this model is meant to represent the full direct‑expenses record, you’ll want to add explicit type annotations (and Field(...) where appropriate) for each of those attributes, for example:

-    infoOnlyFlag
-    filerIdent
+    infoOnlyFlag: bool | None = None
+    filerIdent: str | None = None
    # ...and so on for the remaining attributes, with appropriate types...

Adjust the concrete types/defaults to match your TEC schema.

app/states/texas/validators/texas_address.py (1)

83-83: Logic bug: condition always evaluates to True.

The expression if "PO" or "PO BOX" in _address: is incorrect. It's parsed as if ("PO") or ("PO BOX" in _address):, and since "PO" is a non-empty string (truthy), this condition is always True regardless of _address content.

Apply this diff:

-        if "PO" or "PO BOX" in _address:
+        if "PO" in _address or "PO BOX" in _address:

Or more succinctly:

-        if "PO" or "PO BOX" in _address:
+        if any(x in _address for x in ("PO", "PO BOX")):
app/states/texas/validators/texas_expenses.py (1)

235-247: Minor: Incorrect field name in error message.

Line 243 references 'column': 'filerName' in the error dict, but this validator is for expendDescr and expendDt fields. The error message should reference the actual field being validated.

     @field_validator('expendDescr', 'expendDt', mode='before')
     @classmethod
-    def validate_required_fields(cls, v):
+    def validate_required_fields(cls, v, info):
         if v == "" or v is None:
             raise PydanticCustomError(
                 'missing_required_value',
                 "Field is required",
                 {
-                    'column': 'filerName',
+                    'column': info.field_name,
                     'value': v
                 }
             )
         return v
♻️ Duplicate comments (27)
app/states/texas/validators/texas_personname.py (1)

51-55: field_pfx validator still appears to be a copy‑paste error.

field_pfx is populated by set_type_prefix from a key containing NamePrefixCd, but its BeforeValidator checks for 'NameSuffixCd', which doesn’t match the field’s purpose and duplicates the suffix logic. This is almost certainly unintended and can also interfere if a value is ever passed in.

Recommend removing the BeforeValidator from field_pfx (or changing it to the correct key if you truly want validation there), e.g.:

-    field_pfx: Optional[str] = Field(
-        BeforeValidator(funcs.check_contains_factory('NameSuffixCd')),
-        description="Field prefix"
-    )
+    field_pfx: Optional[str] = Field(
+        default=None,
+        description="Field prefix",
+    )
app/states/texas/validators/texas_flags.py (2)

9-10: Annotate __table_args__ with ClassVar.

Mutable class attributes should be annotated with typing.ClassVar to indicate they are class-level configuration, not instance fields.

-from typing import Optional
+from typing import Optional, ClassVar

 class TECFlagsBase(TECBaseModel):
     __tablename__ = "tx_flags"
-    __table_args__ = {"schema": "texas"}
+    __table_args__: ClassVar[dict] = {"schema": "texas"}

12-14: Type mismatch: BeforeValidator expects str, but field is Optional[bool].

The check_contains_factory returns a validator operating on str values, but primaryUsaPhoneFlag is typed as Optional[bool]. This will cause a runtime error when the validator attempts string operations on a boolean.

Either remove the BeforeValidator if the field should be boolean, or change the type to Optional[str] if it holds string values like "Y"/"N":

-    primaryUsaPhoneFlag: Optional[bool] = Field(
-        BeforeValidator(funcs.check_contains_factory('PrimaryUsaPhoneFlag')),
-        description="Y indicates that the phone is a USA")
+    primaryUsaPhoneFlag: Optional[str] = Field(
+        default=None,
+        description="Y indicates that the phone is a USA")
app/states/texas/validators/texas_settings.py (3)

1-5: Remove unused Field import from dataclasses and unused BaseModel import.

Line 1 imports Field from dataclasses, which is immediately shadowed by the Field import from sqlmodel on line 5. The dataclasses.Field is never used. Additionally, BaseModel from pydantic appears unused.

-from dataclasses import Field
 from typing import Optional
 from datetime import date
-from pydantic import ConfigDict, model_validator, BaseModel, BeforeValidator
+from pydantic import ConfigDict, model_validator, BeforeValidator
 from sqlmodel import SQLModel, Field

15-20: Duplicate of check_contains_factory in validator_functions.py.

This function is identical to the one in app/funcs/validator_functions.py (lines 165-170). Import it from there instead of redefining it to maintain DRY principles.

-def check_contains_factory(match_string: str):
-    def check_contains(value: str) -> str:
-        if value and match_string not in value:
-            raise ValueError(f"Value must contain '{match_string}'")
-        return value
-    return check_contains
+from app.funcs.validator_functions import check_contains_factory

44-50: Remove BeforeValidator from receivedDt field - type mismatch will cause runtime error.

The validate_dates model validator (mode='before') converts receivedDt from string to date object before field-level validators run. When check_contains_factory subsequently executes, it receives a date object but expects a string, causing a TypeError. The substring-check logic is semantically incorrect for date fields.

-    receivedDt: Optional[date] = Field(BeforeValidator(check_contains_factory('receivedDt')), description="Date report received by TEC")
+    receivedDt: Optional[date] = Field(default=None, description="Date report received by TEC")
app/main.py (2)

71-73: doug_deason is created but not included in donation_list.

Line 71 creates doug_deason filtering for DOUG DEASON contributions, but line 73 only concatenates tim_dunn, farris_wilks, and don_dyer.

Apply this diff to include doug_deason:

-donation_list = pd.concat([tim_dunn, farris_wilks, don_dyer])
+donation_list = pd.concat([tim_dunn, farris_wilks, don_dyer, doug_deason])

83-348: Large block of commented-out code should be removed.

This extensive commented-out code clutters the file and reduces readability. Rely on version control history for reference instead.

app/states/ohio/ohio_downloader.py (1)

21-29: Incomplete implementation with unused parameters.

The overwrite and read_from_temp parameters are declared but never used. The trailing ... on line 29 is unnecessary since there's preceding code. Either implement the logic using these parameters or remove them if not needed.

app/states/texas/validators/texas_address.py (2)

32-32: Type annotation mismatch with default value.

standardized has default=None but is typed as str (non-optional). This will cause validation errors when the field is None.

-    standardized: str = Field(default=None, description="Address standardized")
+    standardized: Optional[str] = Field(default=None, description="Address standardized")

145-151: Nullable primary key fields violate database integrity.

TECPersonAddressLinkModel uses Optional fields with primary_key=True. NULL values in primary key columns prevent SQL from enforcing uniqueness, potentially allowing duplicate rows.

Consider creating separate link tables with non-nullable primary keys, or add a required discriminator column with validation ensuring exactly one optional field is set per row.

app/abcs/abc_download.py (3)

20-21: Side effect at module import time.

Starting ProgressTracker at module load (progress.start()) means the live display begins whenever this module is imported, even if no download is happening. This can cause unexpected console output and resource usage.

Consider lazy initialization:

 progress = ProgressTracker()
-progress.start()

Then call progress.start() explicitly in the download() method when needed.


82-83: Progress task not completed for .txt files.

When the file is a .txt, the function returns early on line 83 without calling progress.update_task(_csv_task, "Complete"). This leaves the task in "Started" status indefinitely.

         if file_name.suffix == '.txt':
+            progress.update_task(_csv_task, "Complete")
             return

66-69: Broad exception catch may hide critical errors.

Catching bare Exception and only logging with ic() can hide serious issues like disk full, permission errors, or corrupted archives.

             try:
                 cls._process_csv(zip_ref, file, tmp)
-            except Exception as e:
-                ic(f"Zip File Extraction Error on {file.filename.upper()}: {e}")
+            except (OSError, IOError, pl.exceptions.ComputeError) as e:
+                ic(f"Zip File Extraction Error on {file.filename.upper()}: {e}")
+                raise
app/states/texas/texas_downloader.py (4)

36-36: Consuming the iterator prematurely causes deletion to fail.

_existing_files is an iterator from itertools.chain. It's consumed when checking _safe_to_delete, but the actual deletion at line 83 happens after the download completes. By that time, the iterator is exhausted and no files get deleted.

-        _existing_files = itertools.chain(*map(lambda x: tmp.glob(x), ["*.csv", "*.parquet", "*.txt"]))
+        _existing_files = list(tmp.glob("*.csv")) + list(tmp.glob("*.parquet")) + list(tmp.glob("*.txt"))

218-227: _data is reassigned to empty dict, discarding previously collected files.

Line 218-221 populates _data with file mappings, but line 223 immediately reassigns _data = {}, discarding all the collected data. The subsequent loop iterates over the now-empty dictionary, making the entire parquet reading logic a no-op.

             _data = {}
             for file in parquet_files:
                 _type = file.stem.rsplit('_', 1)[0]
                 _data.setdefault(_type, []).append(file)

-            _data = {}
             for k, v in _data.items():

143-149: Second file is skipped during consolidation due to iterator consumption.

The loop at lines 143-149 iterates v[1:] to collect columns, but then line 157 iterates _files (the iterator). Since next(_files) already consumed the first file at line 132, and v[1:] doesn't consume _files, the iterator _files still points at v[1]. However, when you iterate _files at line 157, it starts from v[1], meaning files are read correctly BUT the second file (v[1]) is read twice - once for column discovery and once for processing.

Use the list directly instead of an iterator:

-                _files = iter(v)
-                # Start with scanning the first file instead of empty DataFrame
-                _first_file = next(_files)
                 df = (
-                    pl.read_parquet(_first_file)
+                    pl.read_parquet(v[0])
                     .with_columns(
-                        pl.lit(_first_file.stem)
+                        pl.lit(v[0].stem)
                         .alias('file_origin')
                     )
                 )
...
-                for _file in _files:
+                for _file in v[1:]:

Also applies to: 157-157


32-33: Unused parameter read_from_temp and boolean positional arguments.

read_from_temp is declared but never used. Additionally, boolean positional arguments can be confusing at call sites.

Either implement the logic using read_from_temp or remove it, and consider keyword-only arguments:

     @classmethod
     def download(
             cls,
-            read_from_temp: bool = True,
-            headless: bool = True
+            *,
+            headless: bool = True
     ) -> Self:
app/states/texas/models.py (3)

1-9: Critical: Remove dataclasses Field import and clean up unused imports.

Line 1 imports Field from dataclasses, but Line 5 redefines it by importing from sqlmodel. The dataclasses import is incorrect and unused. Additionally, several imports appear unused: ConfigDict, model_validator, BaseModel from pydantic (Line 4), SQLModel from sqlmodel (Line 5), tx_funcs (Line 7), and RecordKeyGenerator (Line 9).

Apply this diff to fix the import issues:

-from dataclasses import Field
 from typing import Optional
 from datetime import date
-from pydantic import ConfigDict, model_validator, BaseModel, BeforeValidator
+from pydantic import BeforeValidator
-from sqlmodel import SQLModel, Field
+from sqlmodel import Field
 from .validators.texas_settings import TECSettings
-import app.states.texas.funcs.tx_validation_funcs as tx_funcs
 import app.funcs.validator_functions as funcs
-from app.funcs.record_keygen import RecordKeyGenerator

15-17: Critical: Incorrect BeforeValidator usage with Field().

BeforeValidator must be used with Annotated type hints in Pydantic v2, not as a positional argument to Field(). This will cause runtime errors.

Apply this pattern to fix:

+from typing import Annotated
...
-    primaryUsaPhoneFlag: Optional[bool] = Field(
-        BeforeValidator(funcs.check_contains_factory('PrimaryUsaPhoneFlag')),
-        description="Y indicates that the phone is a USA")
+    primaryUsaPhoneFlag: Annotated[
+        Optional[bool],
+        BeforeValidator(funcs.check_contains_factory('PrimaryUsaPhoneFlag'))
+    ] = Field(default=None, description="Y indicates that the phone is a USA")

Apply the same fix to Lines 59-62 in TECBaseModel for recordType, formTypeCd, schedFormTypeCd, and receivedDt.


58-64: Major: Potential duplicate TECBaseModel definition.

According to past review comments, TECBaseModel appears to be already defined in app/states/texas/validators/texas_settings.py. The AI summary also references TECBaseModel being introduced in texas_settings.py as part of this PR. This could create code duplication and inconsistency.

Verify whether TECBaseModel already exists in texas_settings.py:

#!/bin/bash
# Check for TECBaseModel definitions across the codebase
rg -n "class TECBaseModel" --type py -C3
app/states/texas/validators/texas_expenses.py (3)

71-135: Remove commented-out code.

This large block of commented payee field definitions (65 lines) should be removed rather than left in the codebase. Git history preserves the old implementation if needed.


159-173: Critical: Field extraction incompatible with BeforeValidator checks in TECPersonNameBase.

The validate_payee validator extracts payee fields by key prefix (k.startswith("payee")), creating a dictionary like {'payeeNameFirst': 'John', 'payeeNameLast': 'Doe'}. It then passes this to TECPersonNameBase(**_payee). However, TECPersonNameBase fields use BeforeValidator(funcs.check_contains_factory(...)) which expect the VALUE to contain substrings like 'NameFirst'. After extraction, _payee['payeeNameFirst'] = "John", but the validator expects "John" to contain 'NameFirst', which will fail.

The check_contains_factory validators are designed to validate raw input keys/metadata, not cleaned extracted values. The validate_name method in TECPersonNameBase (shown in relevant snippets) extracts values by finding keys containing substrings (e.g., 'NameFirst' in key), which works with raw input but conflicts with the BeforeValidator checks.

Consider one of these solutions:

Solution 1 (Recommended): Pass raw values dict directly and let TECPersonNameBase handle extraction:

     @model_validator(mode='before')
     @classmethod
     def validate_payee(cls, values):
-        _payee = {k: v for k, v in values.items() if k.startswith("payee")}
         if not _payee:
             raise PydanticCustomError(
                 'missing_required_value',
                 "Payee data is required",
                 {
                     'column': 'payee',
                     'value': _payee
                 }
             )
-        values['payee'] = TECPersonNameBase(**_payee)
+        # Pass all values; TECPersonNameBase will extract payee fields via validate_name
+        values['payee'] = TECPersonNameBase(**values)
         return values

Solution 2: Remove BeforeValidator constraints from TECPersonNameBase fields that are populated via validate_name.


174-233: Remove commented-out code.

This 60-line block of commented validation logic should be removed. Git history preserves it if needed.

app/states/texas/validators/texas_filers.py (3)

110-110: Critical: Broken back_populates references on relationships.

Lines 145-147 in TECFilerName define relationships with back_populates="campaign" pointing to TECTreasurer, TECAssistantTreasurer, and TECCampaignChair. However, the corresponding campaign relationships are commented out (Lines 110, 119, 124). This will cause SQLAlchemy relationship configuration errors at runtime.

Either uncomment the campaign relationships in the target classes:

 class TECTreasurer(TECPersonNameBase, table=True):
     ...
-    # campaign: "TECFilerName" = Relationship(back_populates="treasurer", link_model=TECFilerIDNameLink)
+    campaign: "TECFilerName" = Relationship(back_populates="treasurer", link_model=TECFilerIDNameLink)

 class TECAssistantTreasurer(TECPersonNameBase, table=True):
     ...
-    # campaign: "TECFilerName" = Relationship(back_populates="asstTreasurer", link_model=TECFilerIDNameLink)
+    campaign: "TECFilerName" = Relationship(back_populates="asstTreasurer", link_model=TECFilerIDNameLink)

 class TECCampaignChair(TECPersonNameBase, table=True):
     ...
-    # campaign: "TECFilerName" = Relationship(back_populates="chair", link_model=TECFilerIDNameLink)
+    campaign: "TECFilerName" = Relationship(back_populates="chair", link_model=TECFilerIDNameLink)

Or remove the back_populates parameters from Lines 145-147.

Also applies to: 119-119, 124-124


150-154: Critical: Type mismatch - assigning TECPersonName object to str field.

The validate_name validator assigns a TECPersonName object to values['filerName'] (Line 153), but filerName is defined as str = Field(...) on Line 135. This will cause validation errors or type mismatches at runtime.

Consider one of these fixes:

Solution 1: Change filerName field type to TECPersonName:

-    filerName: str = Field(..., description="Filer name")
+    filerName: TECPersonName = Field(..., description="Filer name", sa_type=JSON)

Solution 2: Extract string representation before assignment:

     @model_validator(mode='before')
     @classmethod
     def validate_name(cls, values):
-        values['filerName'] = TECPersonName(**values)
+        person_name = TECPersonName(**values)
+        values['filerName'] = person_name.nameFull or f"{person_name.first} {person_name.last}"
         return values

Solution 3: Store in a separate field:

+    filer_person_name: TECPersonName = Field(default=None, description="Filer person name", sa_type=JSON)
     ...
     @model_validator(mode='before')
     @classmethod
     def validate_name(cls, values):
-        values['filerName'] = TECPersonName(**values)
+        values['filer_person_name'] = TECPersonName(**values)
         return values

60-65: Remove hundreds of lines of commented-out code.

This file contains massive blocks of commented-out code:

  • Lines 60-65: Old link model
  • Lines 89-96: Old link model
  • Lines 186-729: 500+ lines of commented class definitions and fields

These should be removed entirely. Git history preserves the old implementations if needed. Keeping this much commented code severely harms readability and maintainability.

Also applies to: 89-96, 186-729

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2b78c27 and 1fb1a7c.

📒 Files selected for processing (30)
  • app/abcs/__init__.py (1 hunks)
  • app/abcs/abc_category.py (2 hunks)
  • app/abcs/abc_db_loader.py (1 hunks)
  • app/abcs/abc_download.py (2 hunks)
  • app/abcs/abc_state_config.py (2 hunks)
  • app/abcs/abc_validation.py (1 hunks)
  • app/funcs/db_loader.py (1 hunks)
  • app/main.py (3 hunks)
  • app/states/ohio/ohio_downloader.py (1 hunks)
  • app/states/oklahoma/funcs/ok_validation_funcs.py (1 hunks)
  • app/states/oklahoma/oklahoma.py (1 hunks)
  • app/states/oklahoma/validators/ok_contribution.py (1 hunks)
  • app/states/oklahoma/validators/ok_expenditure.py (1 hunks)
  • app/states/oklahoma/validators/ok_lobbyist_expenditure.py (1 hunks)
  • app/states/oklahoma/validators/ok_settings.py (1 hunks)
  • app/states/texas/__init__.py (2 hunks)
  • app/states/texas/funcs/tx_validation_funcs.py (1 hunks)
  • app/states/texas/models.py (1 hunks)
  • app/states/texas/texas_downloader.py (1 hunks)
  • app/states/texas/validators/direct_expenditures.py (1 hunks)
  • app/states/texas/validators/texas_address.py (3 hunks)
  • app/states/texas/validators/texas_contributions.py (1 hunks)
  • app/states/texas/validators/texas_debtdata.py (1 hunks)
  • app/states/texas/validators/texas_expenses.py (4 hunks)
  • app/states/texas/validators/texas_filers.py (5 hunks)
  • app/states/texas/validators/texas_flags.py (1 hunks)
  • app/states/texas/validators/texas_personname.py (2 hunks)
  • app/states/texas/validators/texas_settings.py (2 hunks)
  • app/tests/test_ingest_hypothesis.py (1 hunks)
  • app/tests/test_validation_class.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (16)
app/states/ohio/ohio_downloader.py (3)
app/abcs/abc_download.py (4)
  • FileDownloaderABC (24-142)
  • download (109-110)
  • consolidate_files (114-115)
  • read (119-120)
app/abcs/abc_state_config.py (4)
  • StateConfig (140-180)
  • CategoryTypes (110-136)
  • CategoryConfig (65-106)
  • CSVReaderConfig (40-42)
app/abcs/abc_category.py (1)
  • read (103-125)
app/funcs/db_loader.py (3)
app/abcs/abc_category.py (1)
  • logger (99-101)
app/abcs/abc_validation.py (1)
  • logger (32-34)
app/logger.py (1)
  • Logger (10-71)
app/abcs/abc_validation.py (4)
app/abcs/abc_validation_errors.py (1)
  • ValidationErrorList (29-95)
app/funcs/validator_functions.py (1)
  • create_record_id (16-27)
app/abcs/abc_category.py (1)
  • logger (99-101)
app/logger.py (1)
  • Logger (10-71)
app/abcs/abc_state_config.py (1)
app/funcs/toml_reader.py (1)
  • read_toml (6-8)
app/states/texas/texas_downloader.py (5)
app/funcs/csv_reader.py (3)
  • FileReader (54-144)
  • read_parquet (108-132)
  • read_folder (82-104)
app/abcs/abc_download.py (6)
  • FileDownloaderABC (24-142)
  • download (109-110)
  • not_headless (36-39)
  • extract_zipfile (62-70)
  • consolidate_files (114-115)
  • read (119-120)
app/abcs/abc_state_config.py (4)
  • StateConfig (140-180)
  • CategoryTypes (110-136)
  • CategoryConfig (65-106)
  • TEMP_FOLDER (149-150)
app/states/ohio/ohio_downloader.py (3)
  • download (21-29)
  • consolidate_files (31-32)
  • read (34-35)
app/live_display.py (2)
  • add_task (34-44)
  • update_task (46-52)
app/tests/test_validation_class.py (1)
app/abcs/abc_validation.py (3)
  • StateFileValidation (24-85)
  • validate_record (36-47)
  • validate (49-55)
app/abcs/abc_db_loader.py (3)
app/abcs/abc_category.py (1)
  • logger (99-101)
app/abcs/abc_validation.py (1)
  • logger (32-34)
app/logger.py (1)
  • Logger (10-71)
app/states/texas/validators/direct_expenditures.py (1)
app/states/texas/validators/texas_settings.py (1)
  • TECSettings (22-41)
app/states/texas/__init__.py (4)
app/logger.py (1)
  • Logger (10-71)
app/states/texas/texas_downloader.py (1)
  • TECDownloader (22-279)
app/states/texas/texas_search.py (1)
  • TexasSearch (165-294)
app/states/texas/validators/texas_filers.py (1)
  • TECFilerName (126-184)
app/states/texas/validators/texas_flags.py (3)
app/states/texas/models.py (1)
  • TECBaseModel (58-64)
app/states/texas/validators/texas_settings.py (2)
  • TECBaseModel (44-57)
  • check_contains_factory (15-20)
app/funcs/validator_functions.py (1)
  • check_contains_factory (166-171)
app/abcs/__init__.py (6)
app/abcs/abc_state_config.py (4)
  • CSVReaderConfig (40-42)
  • StateConfig (140-180)
  • CategoryConfig (65-106)
  • CategoryTypes (110-136)
app/abcs/abc_category.py (1)
  • StateCategoryClass (32-169)
app/abcs/abc_validation.py (1)
  • StateFileValidation (24-85)
app/abcs/abc_validation_errors.py (1)
  • ValidationErrorList (29-95)
app/abcs/abc_db_loader.py (1)
  • DBLoaderClass (13-162)
app/abcs/abc_download.py (1)
  • FileDownloaderABC (24-142)
app/states/texas/validators/texas_address.py (3)
app/states/texas/models.py (1)
  • TECBaseModel (58-64)
app/states/texas/validators/texas_settings.py (1)
  • TECBaseModel (44-57)
app/funcs/record_keygen.py (1)
  • RecordKeyGenerator (10-56)
app/states/texas/validators/texas_expenses.py (2)
app/states/texas/validators/texas_personname.py (1)
  • TECPersonNameBase (15-107)
app/states/texas/validators/texas_settings.py (1)
  • TECSettings (22-41)
app/main.py (2)
app/states/texas/texas_search.py (1)
  • TexasSearch (165-294)
app/states/texas/texas_downloader.py (2)
  • download (30-98)
  • dataframes (277-279)
app/states/texas/validators/texas_personname.py (4)
app/states/texas/models.py (1)
  • TECBaseModel (58-64)
app/states/texas/validators/texas_settings.py (2)
  • TECBaseModel (44-57)
  • check_contains_factory (15-20)
app/states/texas/validators/texas_address.py (1)
  • TECAddressBase (20-142)
app/funcs/validator_functions.py (1)
  • check_contains_factory (166-171)
app/states/texas/validators/texas_settings.py (3)
app/funcs/record_keygen.py (1)
  • RecordKeyGenerator (10-56)
app/funcs/validator_functions.py (2)
  • check_contains_factory (166-171)
  • check_contains (167-170)
app/states/texas/models.py (1)
  • TECBaseModel (58-64)
🪛 Ruff (0.14.7)
app/states/ohio/ohio_downloader.py

21-21: Boolean-typed positional argument in function definition

(FBT001)


21-21: Unused method argument: overwrite

(ARG002)


21-21: Boolean-typed positional argument in function definition

(FBT001)


21-21: Unused method argument: read_from_temp

(ARG002)


29-29: Unnecessary ... literal

Remove unnecessary ...

(PIE790)

app/states/texas/texas_downloader.py

25-25: Missing return type annotation for special method __init__

Add return type annotation: None

(ANN204)


32-32: Boolean-typed positional argument in function definition

(FBT001)


32-32: Boolean default positional argument in function definition

(FBT002)


32-32: Unused class method argument: read_from_temp

(ARG003)


33-33: Boolean-typed positional argument in function definition

(FBT001)


33-33: Boolean default positional argument in function definition

(FBT002)


36-36: Unnecessary map() usage (rewrite using a generator expression)

Replace map() with a generator expression

(C417)


63-63: Avoid specifying long messages outside the exception class

(TRY003)


83-83: Unnecessary map() usage (rewrite using a list comprehension)

Replace map() with a list comprehension

(C417)


87-87: Unnecessary generator (rewrite as a list comprehension)

Rewrite as a list comprehension

(C400)


89-89: Avoid specifying long messages outside the exception class

(TRY003)


95-95: Unnecessary map() usage (rewrite using a list comprehension)

Replace map() with a list comprehension

(C417)


113-113: Avoid specifying long messages outside the exception class

(TRY003)


120-120: Missing return type annotation for classmethod consolidate_files

(ANN206)


147-147: Do not catch blind exception: Exception

(BLE001)


178-178: Do not catch blind exception: Exception

(BLE001)


201-201: datetime.datetime.now() called without a tz argument

(DTZ005)


204-204: Unnecessary map() usage (rewrite using a list comprehension)

Replace map() with a list comprehension

(C417)


209-209: Boolean-typed positional argument in function definition

(FBT001)


209-209: Boolean default positional argument in function definition

(FBT002)


217-217: Avoid specifying long messages outside the exception class

(TRY003)


240-240: Do not catch blind exception: Exception

(BLE001)


264-264: Do not catch blind exception: Exception

(BLE001)


269-269: Do not catch blind exception: Exception

(BLE001)


277-277: Missing return type annotation for classmethod dataframes

(ANN206)

app/abcs/abc_download.py

6-6: Import from collections.abc instead: Generator

Import from collections.abc

(UP035)


6-6: typing.Dict is deprecated, use dict instead

(UP035)


30-30: Missing return type annotation for special method __post_init__

(ANN204)


36-36: Missing return type annotation for classmethod not_headless

(ANN206)


62-62: Missing return type annotation for classmethod extract_zipfile

Add return type annotation: None

(ANN206)


68-68: Do not catch blind exception: Exception

(BLE001)


73-73: Missing return type annotation for classmethod _process_csv

Add return type annotation: None

(ANN206)


85-85: datetime.datetime.now() called without a tz argument

(DTZ005)


109-109: Boolean-typed positional argument in function definition

(FBT001)


109-109: Boolean-typed positional argument in function definition

(FBT001)


114-114: Missing return type annotation for classmethod consolidate_files

(ANN206)


119-119: Missing return type annotation for classmethod read

(ANN206)

app/states/texas/validators/texas_flags.py

10-10: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

app/tests/test_ingest_hypothesis.py

7-7: typing.Dict is deprecated, use dict instead

(UP035)


7-7: typing.List is deprecated, use list instead

(UP035)


7-7: typing.Tuple is deprecated, use tuple instead

(UP035)


122-122: datetime.datetime.fromtimestamp() called without a tz argument

(DTZ006)


144-144: datetime.datetime.fromtimestamp() called without a tz argument

(DTZ006)


174-174: Unused lambda argument: self

(ARG005)


206-206: Loop control variable files not used within loop body

(B007)

app/states/texas/validators/texas_address.py

147-147: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

app/states/texas/models.py

5-5: Redefinition of unused Field from line 1

(F811)


13-13: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

app/states/texas/validators/texas_expenses.py

161-161: Missing return type annotation for classmethod validate_payee

(ANN206)

app/states/texas/validators/texas_filers.py

23-23: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


30-30: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


46-46: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


53-53: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


68-68: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


84-84: Missing return type annotation for special method __init__

Add return type annotation: None

(ANN204)


84-84: Missing type annotation for **data

(ANN003)


99-99: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


114-114: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


123-123: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


128-128: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


152-152: Missing return type annotation for classmethod validate_name

(ANN206)


158-158: Missing return type annotation for classmethod validate_addresses

(ANN206)


167-167: Missing return type annotation for classmethod validate_treasurer

(ANN206)


174-174: Missing return type annotation for classmethod validate_assistant_treasurer

(ANN206)


181-181: Missing return type annotation for classmethod validate_chair

(ANN206)

app/states/texas/validators/texas_settings.py

5-5: Redefinition of unused Field from line 1

(F811)


18-18: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (25)
app/abcs/abc_db_loader.py (1)

4-4: LGTM: Import path standardization.

The update to absolute import (from app.logger import Logger) improves consistency across the codebase and aligns with the project-wide refactor to app-level imports.

app/states/oklahoma/validators/ok_lobbyist_expenditure.py (1)

7-7: LGTM: Import path standardization.

The import path update to app.funcs.validator_functions aligns with the project-wide migration to absolute imports and is consistent with other Oklahoma validator modules.

app/states/texas/funcs/tx_validation_funcs.py (1)

2-2: LGTM: Import path standardization.

The update to app.funcs.validator_functions maintains consistency with the project-wide refactor to absolute imports.

app/abcs/abc_state_config.py (2)

15-15: LGTM: Import path standardization.

The import path update to app.funcs aligns with the project-wide migration to absolute imports.


154-158: LGTM: Formatting improvement.

The multi-line formatting of the return statement improves readability without changing functionality.

app/states/texas/validators/texas_contributions.py (1)

7-7: LGTM: Import path standardization.

The import path update to app.states.texas.funcs.tx_validation_funcs maintains consistency with the project-wide refactor to absolute imports.

app/states/oklahoma/validators/ok_settings.py (1)

3-3: LGTM: Import path standardization.

The import path update to app.funcs.validator_functions aligns with other Oklahoma validator modules and the project-wide migration to absolute imports.

app/funcs/db_loader.py (1)

3-3: LGTM: Import path standardization.

The update to absolute import (from app.logger import Logger) maintains consistency with the project-wide refactor to app-level imports.

app/states/oklahoma/validators/ok_contribution.py (1)

7-7: LGTM: Import path standardization.

The import path update to app.funcs.validator_functions completes the consistent migration to absolute imports across Oklahoma validator modules.

app/states/oklahoma/validators/ok_expenditure.py (1)

6-6: Import path refactor looks correct and consistent.

Switching to app.funcs.validator_functions keeps existing usages (validate_date, person_name_parser) intact and aligns with the new app‑scoped import style.

app/states/oklahoma/funcs/ok_validation_funcs.py (1)

1-1: OK to shift validator_functions to app‑scoped import.

This keeps ok_date_validation behavior the same while matching the new app.funcs namespace.

app/states/texas/validators/texas_debtdata.py (1)

7-8: Texas validator import realignment looks good.

Using app.funcs.validator_functions and app.states.texas.funcs.tx_validation_funcs keeps the existing DebtData validators wired through the new app‑level paths.

app/abcs/abc_category.py (1)

6-20: App‑scoped ABC imports wired correctly.

The shift to app.abcs.*, app.logger, and app.funcs keeps existing usages (StateFileValidation, DBLoaderClass, FileReader, logging) consistent with the new package layout.

app/states/oklahoma/oklahoma.py (1)

5-9: Oklahoma configuration imports correctly re‑namespaced.

Using app.funcs, app.logger, app.abcs and a relative .validators keeps the Oklahoma configuration self‑contained under app.states.oklahoma and consistent with the rest of the project.

app/tests/test_ingest_hypothesis.py (1)

26-145: GenericFileReader CSV/Parquet property‑based tests look solid.

The schema builder + _expected_records_from_schema pattern, together with Hypothesis strategies, gives good coverage of type conversion and metadata (file_origin, download_date) for both CSV and Parquet inputs. The assertions align well with the expected schema behavior.

app/abcs/__init__.py (1)

2-7: Import path updates and new progress export look good.

The import paths are consistently updated to use the app.abcs.* prefix, aligning with the project-wide module structure changes. The addition of progress to the public exports makes the ProgressTracker instance accessible to downstream modules.

app/abcs/abc_validation.py (1)

9-11: Import path updates are consistent.

The import paths are correctly updated to use the app.* prefix, consistent with the project-wide module structure changes.

app/states/texas/__init__.py (2)

2-7: Import path updates look good.

The import paths are correctly updated to use app.* prefixes and relative imports where appropriate. The addition of TexasSearch import aligns with the module's expanded functionality.


32-34: Unable to locate the file or validators referenced in the review comment.

After comprehensive searching, I could not find the file app/states/texas/__init__.py or any references to TECFiler/TECFilerName validators in the repository. The verification cannot be completed because:

  1. The file path app/states/texas/__init__.py does not appear to exist in the repository
  2. No search results returned any definitions or usages of TECFiler or TECFilerName validators
  3. The validators module cannot be located to confirm the validator structure

Recommendation: Verify that the file path is correct and the validators module exists in the codebase before proceeding with this review comment. If this is a newly added file or the path has changed, please provide the correct location.

app/states/texas/validators/texas_flags.py (1)

15-53: Remaining boolean flag fields look correct.

The other flag fields are properly defined as Optional[bool] with default=None and descriptive documentation. The file is missing a trailing newline at the end.

app/tests/test_validation_class.py (3)

7-10: MockModel definition is appropriate for testing.

The MockModel correctly extends SQLModel with an optional id field and required field, suitable for testing the validation logic.


14-22: Test correctly validates single record handling.

The test properly instantiates StateFileValidation with the new validator_to_use parameter and correctly asserts the tuple structure of passed/failed results.


25-36: Test correctly handles generator-based validation.

The test properly iterates over the generator returned by validate() and validates each (status, record) tuple, aligning with the updated API.

app/states/ohio/ohio_downloader.py (1)

18-35: Method signatures don't match the ABC abstract methods.

The ABC FileDownloaderABC defines download, consolidate_files, and read as @classmethod, but OhioDownloader implements them as instance methods. This violates the abstract interface contract and will cause issues when called as classmethods.

Apply this diff to align with the ABC:

 class OhioDownloader(FileDownloaderABC):
     config: StateConfig = OHIO_CONFIGURATION

-    def download(self, overwrite: bool, read_from_temp: bool) -> FileDownloaderABC:
+    @classmethod
+    def download(cls, overwrite: bool, read_from_temp: bool) -> FileDownloaderABC:
-        _driver = self.driver.create_driver()
-        _wait = self.driver
+        _driver = cls.driver.create_driver()
+        _wait = cls.driver
         _driver.get(OHIO_ENTRY_URL)
         _wait.wait_until_clickable(By.LINK_TEXT, "FTP Site")
         _ftp_site_url = _driver.find_element(By.LINK_TEXT, "FTP Site").get_attribute("href")
         _driver.get(_ftp_site_url)
-
-        ...

-    def consolidate_files(self):
+    @classmethod
+    def consolidate_files(cls):
         ...

-    def read(self):
+    @classmethod
+    def read(cls):
         ...

Likely an incorrect or invalid review comment.

app/states/texas/validators/texas_filers.py (1)

165-184: Code snippet does not exist in the repository.

The review comment references lines 165-184 in app/states/texas/validators/texas_filers.py, but the file contains only 149 lines. The validators validate_treasurer, validate_assistant_treasurer, and validate_chair with the shown implementation do not exist in the codebase. Additionally, the referenced classes TECTreasurer, TECAssistantTreasurer, and TECCampaignChair are not defined or imported anywhere in the repository.

This appears to be a review comment for a different version or branch of the code than what is currently present.

Likely an incorrect or invalid review comment.

Comment on lines +77 to +80
donation_campaign_expenses['new_PayeeName'] = donation_campaign_expenses['payeeNameOrganization'].fillna('') \
.where(donation_campaign_expenses['payeeNameOrganization'].notna(),
(donation_campaign_expenses['payeeNameFirst'].fillna('') + ' ' +
donation_campaign_expenses['payeeNameLast'].fillna('')).str.strip())
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Chained .where() with .fillna() is confusing and potentially incorrect.

The logic here is convoluted. You use .fillna('') then .where() with the same notna() condition, which is redundant since fillna('') already handles NaN values. The intent seems to be: use organization name if present, otherwise concatenate first + last name.

Consider using a clearer approach:

-donation_campaign_expenses['new_PayeeName'] = donation_campaign_expenses['payeeNameOrganization'].fillna('') \
-    .where(donation_campaign_expenses['payeeNameOrganization'].notna(),
-           (donation_campaign_expenses['payeeNameFirst'].fillna('') + ' ' +
-            donation_campaign_expenses['payeeNameLast'].fillna('')).str.strip())
+donation_campaign_expenses['new_PayeeName'] = donation_campaign_expenses['payeeNameOrganization'].fillna(
+    (donation_campaign_expenses['payeeNameFirst'].fillna('') + ' ' +
+     donation_campaign_expenses['payeeNameLast'].fillna('')).str.strip()
+)
🤖 Prompt for AI Agents
In app/main.py around lines 77 to 80, the chained use of .fillna('') then
.where(...notna()) is redundant and confusing; replace it with a clear
conditional that picks payeeNameOrganization when it is non-empty/non-null,
otherwise concatenates payeeNameFirst + ' ' + payeeNameLast (ensuring first/last
are filled with '' before concatenation and the result is stripped). Implement
this by using a single np.where (or Series.where with a boolean mask) that
checks for organization presence (notna and not empty) and otherwise uses the
concatenated first+last name after fillna and strip; remove the redundant
fillna->where chaining.


class TECFlags(TECSettings, table=True):
__tablename__ = "tx_flags"
__table_args__ = {"schema": "texas"}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Minor: Annotate table_args with ClassVar.

SQLAlchemy's __table_args__ is a class-level configuration attribute and should be annotated with typing.ClassVar to indicate it's not an instance attribute.

+from typing import ClassVar
...
-    __table_args__ = {"schema": "texas"}
+    __table_args__: ClassVar[dict] = {"schema": "texas"}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
__table_args__ = {"schema": "texas"}
from typing import ClassVar
__table_args__: ClassVar[dict] = {"schema": "texas"}
🧰 Tools
🪛 Ruff (0.14.7)

13-13: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

🤖 Prompt for AI Agents
In app/states/texas/models.py around line 13, __table_args__ is defined as a
class-level SQLAlchemy config but lacks a typing.ClassVar annotation; update the
declaration to use typing.ClassVar (e.g., ClassVar[dict] or
ClassVar[Mapping[str, Any]]) and add the corresponding import from typing at the
top of the file so the attribute is clearly marked as a class variable rather
than an instance attribute.

@classmethod
def dataframes(cls):
_files = cls._create_file_type_dict()
return {k: pl.scan_parquet(v) for k, v in _files.items()}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

Does polars scan_parquet accept a list of files

💡 Result:

Short answer: No — polars.scan_parquet does not accept a Python list of file paths. It expects a single path/string (a file, directory, or a glob pattern) and can read multiple files via a directory or glob, or by using read_parquet which accepts multiple paths. [1][2]

References

  • scan_parquet signature and "Lazily read ... file or multiple files via glob patterns." [1]
  • Detailed scan_parquet params and behavior (source is a path/str; use glob or directory to read multiple files). [2]

🏁 Script executed:

# Find the file and examine context around line 279
cd app/states/texas && head -290 texas_downloader.py | tail -20

Repository: Abstract-Data/campaignfinance-2023

Length of output: 892


🏁 Script executed:

# Look at the _files variable and understand its structure
cd app/states/texas && grep -n "_files" texas_downloader.py | head -20

Repository: Abstract-Data/campaignfinance-2023

Length of output: 1034


🏁 Script executed:

cd app/states/texas && grep -n "_create_file_type_dict" texas_downloader.py -A 30 | head -50

Repository: Abstract-Data/campaignfinance-2023

Length of output: 2444


pl.scan_parquet does not accept a list of paths—it only accepts a single path, directory, or glob pattern.

The _create_file_type_dict() method returns values as lists of file paths (line 114: _file_type_dict.setdefault(parts, []).append(file)), but pl.scan_parquet(v) at line 279 will fail when v is a list. Use a glob pattern or convert to a string path, or iterate through files individually like the consolidate_files() method does (lines 133–146).

🤖 Prompt for AI Agents
In app/states/texas/texas_downloader.py around line 279, pl.scan_parquet is
being called with v which is a list of file paths but scan_parquet accepts a
single path/directory/glob; replace the single call with logic that handles
lists: either convert the list to an appropriate glob/string path if files share
a directory/pattern, or iterate the list and call pl.scan_parquet for each file
and then combine the resulting LazyFrames (mirroring the consolidate_files()
approach at lines 133–146) so the function returns a mapping of keys to a single
concatenated LazyFrame per key rather than attempting to pass a Python list into
pl.scan_parquet.



class TECExpense(TECSettings, table=True):
class TECExpense(TECSettings):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Get context around the class declaration
cat -n app/states/texas/validators/texas_expenses.py | head -30

Repository: Abstract-Data/campaignfinance-2023

Length of output: 1413


🏁 Script executed:

# Search for TECExpense usage in the codebase
rg "TECExpense" --type py -B2 -A2

Repository: Abstract-Data/campaignfinance-2023

Length of output: 1646


🏁 Script executed:

# Check what TECSettings looks like
rg "class TECSettings" --type py -B2 -A10

Repository: Abstract-Data/campaignfinance-2023

Length of output: 1046


🏁 Script executed:

# Check if tx_expenses table is referenced anywhere else
rg "tx_expenses" --type py

Repository: Abstract-Data/campaignfinance-2023

Length of output: 157


🏁 Script executed:

# Check other validators to see if they have table=True
rg "class TEC\w+\(TECSettings.*\)" --type py

Repository: Abstract-Data/campaignfinance-2023

Length of output: 826


Remove dead code: __tablename__ and __table_args__ have no effect without table=True.

The TECExpense class lost its table=True parameter (line 18), which means it is no longer a SQLModel table. However, lines 19-20 still define __tablename__ and __table_args__, which are only meaningful for table models in SQLModel and now serve no purpose. Either restore table=True if TECExpense should remain an ORM table, or remove these two attributes to clean up dead code.

🤖 Prompt for AI Agents
In app/states/texas/validators/texas_expenses.py around line 18, the TECExpense
class no longer declares table=True so its following __tablename__ and
__table_args__ are dead; either restore table=True on the TECExpense class
definition if it is intended to be a SQLModel ORM table, or remove the
__tablename__ and __table_args__ attributes entirely to eliminate dead code;
make the change consistent with other models (if you restore table=True, ensure
imports and any DB metadata usage remain correct; if you remove them, run
tests/lint to confirm no references rely on those attributes).

Comment on lines +153 to +157
# clear_blank_strings = model_validator(mode='before')(funcs.clear_blank_strings)
# check_dates = model_validator(mode='before')(tx_funcs.validate_dates)
# check_zipcodes = model_validator(mode='before')(tx_funcs.check_zipcodes)
# address_formatting = model_validator(mode='before')(tx_funcs.address_formatting)
# phone_number_validation = model_validator(mode='before')(tx_funcs.phone_number_validation)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

rg -n "clear_blank_strings|check_dates|check_zipcodes|phone_number_validation|address_formatting" app/states/texas/validators/texas_settings.py -A2

Repository: Abstract-Data/campaignfinance-2023

Length of output: 534


🏁 Script executed:

rg -n "class TECExpense" app/states/texas/validators/texas_expenses.py -A5

Repository: Abstract-Data/campaignfinance-2023

Length of output: 674


Commented-out validators are redundant and may mask inheritance issues.

These validators are inherited from TECSettings (via class inheritance on line 18), so commenting them out here is redundant. However, the situation is more nuanced:

  • clear_blank_strings and check_dates are active in TECSettings (lines 32-33), so commenting them in TECExpense serves no purpose and could confuse maintainers.
  • phone_number_validation in TECExpense corresponds to check_phone_numbers in TECSettings (line 35, active), with a naming mismatch that adds confusion.
  • check_zipcodes and address_formatting are also commented in TECSettings (lines 34, 36), making these redundant comments match the parent class state.

If duplicate execution is a concern, remove these comments entirely and rely on inheritance. If they're disabled intentionally, document why in a comment explaining the specific reason (e.g., "disabled due to performance concerns").

🤖 Prompt for AI Agents
In app/states/texas/validators/texas_expenses.py around lines 153 to 157,
commented-out model validators duplicate or conflict with validators inherited
from TECSettings and create confusion (and a naming mismatch for phone
validator). Remove these commented validator lines so the class relies on
inherited validators, or if you intentionally want to disable or rename any
validator, add a concise inline comment explaining the reason (e.g., "disabled
due to performance" or "renamed to check_phone_numbers in parent"); ensure names
match the parent (phone_number_validation -> check_phone_numbers) if you
reintroduce them.

__tablename__ = "tx_person_names"
__table_args__ = {"schema": "texas"}
addresses: list["TECAddress"] = Relationship(back_populates="person_name", link_model=TECPersonAddressLinkModel)
# filer_name: list["TECFilerName"] = Relationship(back_populates="filer_name", link_model=TECFilerIDNameLink)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Major: Commented-out filer_name relationship may cause issues.

Line 48 has a commented-out relationship: # filer_name: list["TECFilerName"] = Relationship(back_populates="filer_name", link_model=TECFilerIDNameLink). However, Line 144 in TECFilerName references this with filer_name: Optional["TECPersonName"] = Relationship(back_populates="filer_name", link_model=TECFilerIDNameLink). This asymmetry will cause SQLAlchemy relationship configuration errors.

Either uncomment and fix Line 48, or remove/update the relationship on Line 144.

🤖 Prompt for AI Agents
In app/states/texas/validators/texas_filers.py around line 48, the filer_name
relationship is commented out which creates an asymmetric relationship because
TECFilerName (line 144) declares filer_name with back_populates and
link_model=TECFilerIDNameLink; restore symmetry by uncommenting and correcting
the relationship on line 48 so the attribute name and back_populates value
exactly match the counterpart on TECFilerName, use the proper typing (e.g.,
list["TECFilerName"] or Optional depending on cardinality) and include
link_model=TECFilerIDNameLink and back_populates="filer_name"; alternatively, if
you intend a different cardinality, update TECFilerName’s back_populates to
match the relationship name used here and ensure forward refs/imports are
present.

Comment on lines +84 to +86
def __init__(self, **data):
super().__init__(**data)
self.id = self.generate_key(self.ctaSeekOfficeCd, self.ctaSeekOfficeDistrict, self.ctaSeekOfficePlace)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Add type annotations to init method.

The __init__ method is missing return type annotation (-> None) and parameter type annotation for **data.

-    def __init__(self, **data):
+    def __init__(self, **data: Any) -> None:
         super().__init__(**data)
         self.id = self.generate_key(self.ctaSeekOfficeCd, self.ctaSeekOfficeDistrict, self.ctaSeekOfficePlace)

Add to imports:

 from typing import Optional
+from typing import Any
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def __init__(self, **data):
super().__init__(**data)
self.id = self.generate_key(self.ctaSeekOfficeCd, self.ctaSeekOfficeDistrict, self.ctaSeekOfficePlace)
def __init__(self, **data: Any) -> None:
super().__init__(**data)
self.id = self.generate_key(self.ctaSeekOfficeCd, self.ctaSeekOfficeDistrict, self.ctaSeekOfficePlace)
🧰 Tools
🪛 Ruff (0.14.7)

84-84: Missing return type annotation for special method __init__

Add return type annotation: None

(ANN204)


84-84: Missing type annotation for **data

(ANN003)

🤖 Prompt for AI Agents
In app/states/texas/validators/texas_filers.py around lines 84 to 86, the
__init__ method lacks type annotations; change its signature to include
parameter and return types (e.g., def __init__(self, **data: Any) -> None:) and
update imports to include Any from typing (add "from typing import Any"). Ensure
the rest of the method remains unchanged and type-checks under these
annotations.

filer_names: Optional["TECFilerID"] = Relationship(back_populates="other_names", link_model=TECFilerIDNameLink)
@model_validator(mode='before')
@classmethod
def validate_name(cls, values):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Add return type annotations to validators.

All the model_validator classmethods are missing return type annotations. While this is a minor issue, adding -> dict or -> Any would improve code clarity and satisfy type checkers.

     @model_validator(mode='before')
     @classmethod
-    def validate_name(cls, values):
+    def validate_name(cls, values: dict) -> dict:
         ...

Apply the same pattern to all validators in this file.

Also applies to: 158-158, 167-167, 174-174, 181-181

🧰 Tools
🪛 Ruff (0.14.7)

152-152: Missing return type annotation for classmethod validate_name

(ANN206)

🤖 Prompt for AI Agents
In app/states/texas/validators/texas_filers.py around lines 152, 158, 167, 174
and 181, the model_validator classmethods (e.g., validate_name and the other
validators) lack return type annotations; update each validator signature to
include an explicit return type (-> dict or -> Any) consistent across the file,
import Any from typing if using Any, and run type checks to ensure signatures
match their actual return values.

Comment on lines +51 to +57
# flags: Optional[TECFlags] = Field(default=None, description="Flags")
#
# def __init__(self, **data):
# _flags = TECFlags(**data)
# if len(_flags.model_dump(exclude_none=True).values()) > 0:
# self.flags = _flags
# super().__init__(**data) No newline at end of file
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Remove or address commented-out code.

The commented-out flags field and __init__ method represent incomplete functionality. Either implement this feature or remove the dead code to avoid confusion.

🤖 Prompt for AI Agents
In app/states/texas/validators/texas_settings.py around lines 51-57, remove the
commented-out flags field and __init__ or implement them properly; either delete
the dead code to avoid confusion, or restore and implement the flags feature by
adding a typed Optional[TECFlags] Field and an initializer that constructs
TECFlags from incoming data only when flag values are present (using Pydantic v2
model validation/creation patterns and model_dump/exclude_none to detect
values), then call super().__init__(**data); ensure imports and typing match the
project Pydantic version.

import datetime as dt
from decimal import Decimal
from pathlib import Path
from typing import Dict, List, Tuple
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

🧩 Analysis chain

🏁 Script executed:

# First, let's check GenericFileReader's handling of download_date and timezone
rg -n "download_date|fromtimestamp" app/ingest/file_reader.py -A 3 -B 1

Repository: Abstract-Data/campaignfinance-2023

Length of output: 143


🏁 Script executed:

# Now let's examine the test file at the specific lines mentioned
cat -n app/tests/test_ingest_hypothesis.py | head -20

Repository: Abstract-Data/campaignfinance-2023

Length of output: 819


🏁 Script executed:

# Check lines 122-123, 144-145, 174-175 in the test file
sed -n '120,125p; 142,147p; 172,177p' app/tests/test_ingest_hypothesis.py

Repository: Abstract-Data/campaignfinance-2023

Length of output: 1026


🏁 Script executed:

# Also check the Python version requirement
fd -t f "setup.py|setup.cfg|pyproject.toml|.python-version" -x head -20 {} \;

Repository: Abstract-Data/campaignfinance-2023

Length of output: 576


🏁 Script executed:

# Search for GenericFileReader implementation across all files
find . -name "*.py" -type f -exec grep -l "class GenericFileReader\|download_date" {} \;

Repository: Abstract-Data/campaignfinance-2023

Length of output: 854


🏁 Script executed:

# Get more context around line 174
sed -n '168,180p' app/tests/test_ingest_hypothesis.py

Repository: Abstract-Data/campaignfinance-2023

Length of output: 684


🏁 Script executed:

# Check if there are any datetime timezone usages in GenericFileReader or related files
find . -name "*.py" -type f -exec grep -l "timezone\|UTC\|utc" {} \; | grep -E "(file_reader|ingest)" | head -5

Repository: Abstract-Data/campaignfinance-2023

Length of output: 60


🏁 Script executed:

# Try to find if GenericFileReader is defined in app directory structure
find ./app -name "*.py" -type f | xargs grep -l "class GenericFileReader"

Repository: Abstract-Data/campaignfinance-2023

Length of output: 62


Address minor Ruff hints: built‑in generics, tz‑aware datetime, and unused lambda arg.

  • Line 7: Switch to built‑in generics for annotations (UP035): dict[str, object], list[str], tuple[...] instead of Dict, List, Tuple.
  • Lines 122 & 144: Add timezone information to dt.datetime.fromtimestamp() (DTZ006), e.g., dt.datetime.fromtimestamp(file_path.stat().st_mtime, tz=dt.timezone.utc) to make tests explicit with tz-aware datetimes rather than naive local times.
  • Line 174: Change lambda parameter from self to _ (ARG005): property(lambda _: tmp_path) since the parameter is unused.

These are optional cleanups; functionality is otherwise fine.

🧰 Tools
🪛 Ruff (0.14.7)

7-7: typing.Dict is deprecated, use dict instead

(UP035)


7-7: typing.List is deprecated, use list instead

(UP035)


7-7: typing.Tuple is deprecated, use tuple instead

(UP035)

🤖 Prompt for AI Agents
In app/tests/test_ingest_hypothesis.py around lines 7, 122, 144 and 174, address
Ruff hints by switching typing imports to built-in generics on line 7 (use
dict[str, object], list[str], tuple[...] instead of Dict, List, Tuple), make
timestamps tz-aware on lines 122 and 144 by passing a timezone (e.g.,
tz=dt.timezone.utc) to dt.datetime.fromtimestamp(...), and rename the unused
lambda parameter on line 174 from self to _ (property(lambda _: tmp_path)) to
indicate it's unused.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (10)
app/states/texas/texas_downloader.py (7)

32-33: Unused parameter flagged in previous review.

The read_from_temp parameter is never used in the method body and should be removed as indicated in the previous review.


36-36: Iterator consumption issue flagged in previous review.

The iterator _existing_files will be exhausted when checking _safe_to_delete and cannot be reused for deletion later, as detailed in the previous review.


80-84: Deletion logic issue flagged in previous review.

The _existing_files iterator is exhausted by this point and the deletion will be a no-op, as explained in the previous review comment.


143-180: File reading inefficiency flagged in previous review.

The iterator consumption pattern causes files to be read multiple times, as detailed in the previous review.


201-202: Timezone-aware datetime recommended in previous review.

Using datetime.now() without a timezone argument was flagged in the previous review.


276-279: pl.scan_parquet does not accept a list, as flagged in previous review.

The _create_file_type_dict() method returns lists of file paths, but pl.scan_parquet(v) at line 279 expects a single path, directory, or glob pattern, not a list. This was detailed in the previous review.


208-274: Critical bug: _data dictionary is cleared before use.

Line 218 initializes _data = {}, lines 219-221 populate it with file mappings, but line 223 immediately reassigns _data = {}, discarding all collected data. The subsequent loop at line 224 iterates over an empty dictionary, making the entire read() method non-functional.

Apply this diff to remove the redundant reinitialization:

             _data = {}
             for file in parquet_files:
                 _type = file.stem.rsplit('_', 1)[0]
                 _data.setdefault(_type, []).append(file)

-            _data = {}
             for k, v in _data.items():
app/tests/test_ingest_hypothesis.py (3)

7-7: Deprecated typing imports flagged in previous review.

The previous review recommended switching to built-in generics (dict, list, tuple) instead of typing.Dict, typing.List, typing.Tuple.


122-122: Timezone-aware datetime recommended in previous review.

The previous review flagged dt.datetime.fromtimestamp() calls without a tz argument at lines 122 and 144.

Also applies to: 144-144


174-174: Unused lambda parameter flagged in previous review.

The previous review noted that the lambda parameter should be renamed from self to _ to indicate it's unused.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1fb1a7c and 8037b27.

📒 Files selected for processing (2)
  • app/states/texas/texas_downloader.py (1 hunks)
  • app/tests/test_ingest_hypothesis.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
app/states/texas/texas_downloader.py (3)
app/funcs/csv_reader.py (3)
  • FileReader (54-144)
  • read_parquet (108-132)
  • read_folder (82-104)
app/abcs/abc_download.py (6)
  • FileDownloaderABC (24-142)
  • download (109-110)
  • not_headless (36-39)
  • extract_zipfile (62-70)
  • consolidate_files (114-115)
  • read (119-120)
app/live_display.py (2)
  • add_task (34-44)
  • update_task (46-52)
app/tests/test_ingest_hypothesis.py (4)
app/abcs/abc_state_config.py (1)
  • StateConfig (140-180)
app/states/texas/texas_downloader.py (1)
  • consolidate_files (120-206)
app/abcs/abc_download.py (1)
  • consolidate_files (114-115)
app/states/ohio/ohio_downloader.py (1)
  • consolidate_files (31-32)
🪛 Ruff (0.14.7)
app/states/texas/texas_downloader.py

25-25: Missing return type annotation for special method __init__

Add return type annotation: None

(ANN204)


32-32: Boolean-typed positional argument in function definition

(FBT001)


32-32: Boolean default positional argument in function definition

(FBT002)


32-32: Unused class method argument: read_from_temp

(ARG003)


33-33: Boolean-typed positional argument in function definition

(FBT001)


33-33: Boolean default positional argument in function definition

(FBT002)


36-36: Unnecessary map() usage (rewrite using a generator expression)

Replace map() with a generator expression

(C417)


63-63: Avoid specifying long messages outside the exception class

(TRY003)


83-83: Unnecessary map() usage (rewrite using a list comprehension)

Replace map() with a list comprehension

(C417)


87-87: Unnecessary generator (rewrite as a list comprehension)

Rewrite as a list comprehension

(C400)


89-89: Avoid specifying long messages outside the exception class

(TRY003)


95-95: Unnecessary map() usage (rewrite using a list comprehension)

Replace map() with a list comprehension

(C417)


113-113: Avoid specifying long messages outside the exception class

(TRY003)


120-120: Missing return type annotation for classmethod consolidate_files

(ANN206)


147-147: Do not catch blind exception: Exception

(BLE001)


178-178: Do not catch blind exception: Exception

(BLE001)


201-201: datetime.datetime.now() called without a tz argument

(DTZ005)


204-204: Unnecessary map() usage (rewrite using a list comprehension)

Replace map() with a list comprehension

(C417)


209-209: Boolean-typed positional argument in function definition

(FBT001)


209-209: Boolean default positional argument in function definition

(FBT002)


217-217: Avoid specifying long messages outside the exception class

(TRY003)


240-240: Do not catch blind exception: Exception

(BLE001)


264-264: Do not catch blind exception: Exception

(BLE001)


269-269: Do not catch blind exception: Exception

(BLE001)


277-277: Missing return type annotation for classmethod dataframes

(ANN206)

app/tests/test_ingest_hypothesis.py

7-7: typing.Dict is deprecated, use dict instead

(UP035)


7-7: typing.List is deprecated, use list instead

(UP035)


7-7: typing.Tuple is deprecated, use tuple instead

(UP035)


122-122: datetime.datetime.fromtimestamp() called without a tz argument

(DTZ006)


144-144: datetime.datetime.fromtimestamp() called without a tz argument

(DTZ006)


174-174: Unused lambda argument: self

(ARG005)


211-211: Loop control variable files not used within loop body

(B007)

🔇 Additional comments (1)
app/tests/test_ingest_hypothesis.py (1)

1-224: LGTM: Comprehensive hypothesis-driven tests for ingestion and consolidation.

The test suite effectively validates:

  • GenericFileReader for both CSV and Parquet formats with schema mapping and type conversions
  • Metadata augmentation (file_origin, download_date)
  • Texas consolidation workflow with dynamic schema alignment and file lifecycle management

The property-based testing approach with Hypothesis provides strong coverage of edge cases and various input combinations.

Comment on lines +39 to +40
if not headless or not cls.driver.headless:
cls.not_headless()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd /repo && cat -n app/states/texas/texas_downloader.py | head -80

Repository: Abstract-Data/campaignfinance-2023

Length of output: 133


🏁 Script executed:

cd /repo && rg -n "not_headless" --type=py -A 3 -B 3

Repository: Abstract-Data/campaignfinance-2023

Length of output: 133


🏁 Script executed:

cd /repo && rg -n "def not_headless|driver\.headless\s*=" --type=py -B 2 -A 5

Repository: Abstract-Data/campaignfinance-2023

Length of output: 133


🏁 Script executed:

pwd && git ls-files | head -20

Repository: Abstract-Data/campaignfinance-2023

Length of output: 529


🏁 Script executed:

find . -name "texas_downloader.py" -type f 2>/dev/null

Repository: Abstract-Data/campaignfinance-2023

Length of output: 116


🏁 Script executed:

cat -n app/states/texas/texas_downloader.py | head -80

Repository: Abstract-Data/campaignfinance-2023

Length of output: 3225


🏁 Script executed:

rg -n "not_headless|def not_headless" --type=py -B 3 -A 5

Repository: Abstract-Data/campaignfinance-2023

Length of output: 1239


🏁 Script executed:

rg -n "driver\.headless" --type=py -B 2 -A 2

Repository: Abstract-Data/campaignfinance-2023

Length of output: 647


🏁 Script executed:

rg -n "class CreateWebDriver|def __init__" app/abcs/abc_download.py -B 2 -A 10 | head -50

Repository: Abstract-Data/campaignfinance-2023

Length of output: 60


🏁 Script executed:

rg -n "CreateWebDriver" --type=py -B 2 -A 8 | grep -A 8 "class CreateWebDriver"

Repository: Abstract-Data/campaignfinance-2023

Length of output: 60


🏁 Script executed:

grep -r "class CreateWebDriver" --include="*.py"

Repository: Abstract-Data/campaignfinance-2023

Length of output: 60


🏁 Script executed:

cat -n app/abcs/abc_download.py | head -50

Repository: Abstract-Data/campaignfinance-2023

Length of output: 2099


🏁 Script executed:

rg -n "self.driver|cls.driver" app/abcs/abc_download.py -B 2 -A 2 | head -40

Repository: Abstract-Data/campaignfinance-2023

Length of output: 222


🏁 Script executed:

rg -n "FileDownloaderABC.driver.*=" --type=py -B 2 -A 2

Repository: Abstract-Data/campaignfinance-2023

Length of output: 407


Simplify the headless mode logic.

The conditional logic is unnecessarily complex:

if not headless or not cls.driver.headless:
    cls.not_headless()

When headless=True, this evaluates to if False or not cls.driver.headless:, which redundantly calls not_headless() only if the driver is already in non-headless mode. When headless=False, the first condition is always True, making the second condition irrelevant. Simplify to directly handle the user's requested mode:

-if not headless or not cls.driver.headless:
-    cls.not_headless()
+if not headless:
+    cls.not_headless()
🤖 Prompt for AI Agents
In app/states/texas/texas_downloader.py around lines 39-40, the conditional "if
not headless or not cls.driver.headless: cls.not_headless()" is overly complex;
change it to directly act on the user's requested mode by calling
cls.not_headless() only when headless is False (i.e., use "if not headless:
cls.not_headless()") and remove the redundant driver.headless check so the code
clearly follows the requested headless parameter.


if not parquet:
_reader = FileReader()
cls.data = _reader.read_folder(cls.folder, file_counts=cls.config.FILE_COUNTS)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

cls.folder may not exist as a class attribute.

The read classmethod references cls.folder at line 213, but folder is defined as an instance attribute (field(init=False)) in the base class FileDownloaderABC. Accessing it as a class attribute will fail with an AttributeError.

Replace with the class-level cls.config.TEMP_FOLDER:

         if not parquet:
             _reader = FileReader()
-            cls.data = _reader.read_folder(cls.folder, file_counts=cls.config.FILE_COUNTS)
+            cls.data = _reader.read_folder(cls.config.TEMP_FOLDER, file_counts=cls.config.FILE_COUNTS)
         else:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cls.data = _reader.read_folder(cls.folder, file_counts=cls.config.FILE_COUNTS)
cls.data = _reader.read_folder(cls.config.TEMP_FOLDER, file_counts=cls.config.FILE_COUNTS)
🤖 Prompt for AI Agents
In app/states/texas/texas_downloader.py around line 213, the classmethod uses
cls.folder which is an instance-only attribute and will raise AttributeError;
replace the reference with the class-level temp folder (cls.config.TEMP_FOLDER)
so the classmethod uses a valid class attribute, i.e., call _reader.read_folder
using cls.config.TEMP_FOLDER and keep passing cls.config.FILE_COUNTS.

Comment on lines +211 to +213
for category, files in category_files.items():
consolidated = list(tmp_path.glob(f"{category}_*.parquet"))
assert len(consolidated) == 1, f"Expected a single consolidated file for {category}"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Unused loop variable files.

The loop unpacks category_files.items() into (category, files) but only category is used in the loop body. The files variable can be replaced with _ to indicate it's intentionally unused.

Apply this diff:

-    for category, files in category_files.items():
+    for category, _ in category_files.items():
         consolidated = list(tmp_path.glob(f"{category}_*.parquet"))
         assert len(consolidated) == 1, f"Expected a single consolidated file for {category}"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for category, files in category_files.items():
consolidated = list(tmp_path.glob(f"{category}_*.parquet"))
assert len(consolidated) == 1, f"Expected a single consolidated file for {category}"
for category, _ in category_files.items():
consolidated = list(tmp_path.glob(f"{category}_*.parquet"))
assert len(consolidated) == 1, f"Expected a single consolidated file for {category}"
🧰 Tools
🪛 Ruff (0.14.7)

211-211: Loop control variable files not used within loop body

(B007)

🤖 Prompt for AI Agents
In app/tests/test_ingest_hypothesis.py around lines 211 to 213, the loop unpacks
category_files.items() into (category, files) but never uses files; replace the
unused variable with an underscore (e.g., for category, _ in
category_files.items():) to indicate it is intentionally unused and silence
linters.

- Add direct lookup for unified field names before fuzzy matching
- Skip fuzzy matching on fields with unified prefixes (person_, address_, etc.)
- Make transaction fields optional in field library (required=False)

This fixes the issue where campaign names like 'LYTAL LYTAL LYTAL'
appeared instead of 'JAMES H. LYTAL' due to fuzzy matching incorrectly
matching person_last_name when looking for person_first_name.
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

♻️ Duplicate comments (6)
app/states/unified_sqlmodels.py (6)

129-129: datetime.utcnow() is deprecated in Python 3.12+.

This issue was previously flagged. Consider using datetime.now(timezone.utc) for forward compatibility.


413-414: Duplicate field definition: state_id is defined twice.

This bug was previously flagged. Remove Line 414 to fix.


853-853: Unused parameter entity_role.

This was previously flagged. Either remove the parameter or use it to differentiate address field lookups by role (e.g., contributor vs payee address fields).


1001-1002: Broad exception handling returns False silently.

This was previously flagged. Consider catching specific exceptions or logging unexpected errors.


1017-1018: Broad exception handling returns None silently.

This was previously flagged. Consider logging unexpected exceptions at debug level.


1191-1196: SQL injection risk and incorrect session.exec usage.

The SQL injection risk was previously flagged. Additionally, Line 1194 uses session.exec(text(query), params), but SQLModel's exec() method signature doesn't accept positional parameters this way. This will likely raise a TypeError or ignore the parameters.

Use SQLModel's query builder as suggested in the previous review, or fix the raw SQL approach:

-                if len(conditions) >= 2:
-                    query = f"SELECT * FROM unified_addresses WHERE {' AND '.join(conditions)} LIMIT 1"
-                    result = session.exec(text(query), params).first()
-                    if result:
-                        return UnifiedAddress(**dict(result))
+                if len(params) >= 2:
+                    stmt = select(UnifiedAddress)
+                    if params.get("street_1"):
+                        stmt = stmt.where(UnifiedAddress.street_1 == params["street_1"])
+                    if params.get("city"):
+                        stmt = stmt.where(UnifiedAddress.city == params["city"])
+                    if params.get("state"):
+                        stmt = stmt.where(UnifiedAddress.state == params["state"])
+                    if params.get("zip_code"):
+                        stmt = stmt.where(UnifiedAddress.zip_code == params["zip_code"])
+                    return session.exec(stmt).first()
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8037b27 and ff9a994.

📒 Files selected for processing (2)
  • app/states/unified_field_library.py (1 hunks)
  • app/states/unified_sqlmodels.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
app/states/unified_sqlmodels.py (2)
app/states/unified_field_library.py (1)
  • get_state_mappings (380-382)
app/states/unified_database.py (1)
  • get_session (77-79)
🪛 Ruff (0.14.7)
app/states/unified_field_library.py

8-8: typing.Dict is deprecated, use dict instead

(UP035)


8-8: typing.List is deprecated, use list instead

(UP035)


8-8: typing.Set is deprecated, use set instead

(UP035)


8-8: typing.Tuple is deprecated, use tuple instead

(UP035)


108-108: Missing return type annotation for special method __init__

Add return type annotation: None

(ANN204)


115-115: Missing return type annotation for private function _initialize_unified_fields

Add return type annotation: None

(ANN202)


296-296: Missing return type annotation for private function _initialize_state_mappings

Add return type annotation: None

(ANN202)


465-465: Unnecessary mode argument

Remove mode argument

(UP015)

app/states/unified_sqlmodels.py

8-8: typing.List is deprecated, use list instead

(UP035)


8-8: typing.Dict is deprecated, use dict instead

(UP035)


136-136: Unnecessary UTF-8 encoding argument to encode

Remove unnecessary encoding argument

(UP012)


164-164: Missing return type annotation for special method __post_init__

(ANN204)


227-227: Missing return type annotation for special method __post_init__

(ANN204)


288-288: Missing return type annotation for special method __post_init__

(ANN204)


414-414: Class field state_id is defined multiple times

Remove duplicate field definition for state_id

(PIE794)


747-747: Missing return type annotation for special method __init__

Add return type annotation: None

(ANN204)


853-853: Unused method argument: entity_role

(ARG002)


953-953: Dynamically typed expressions (typing.Any) are disallowed in _get_field_value

(ANN401)


1000-1000: Consider moving this statement to an else block

(TRY300)


1001-1001: Do not catch blind exception: Exception

(BLE001)


1017-1017: Do not catch blind exception: Exception

(BLE001)


1046-1046: Do not catch blind exception: Exception

(BLE001)


1104-1104: Do not catch blind exception: Exception

(BLE001)


1193-1193: Possible SQL injection vector through string-based query construction

(S608)


1199-1199: Do not catch blind exception: Exception

(BLE001)


1211-1211: Do not catch blind exception: Exception

(BLE001)


1214-1214: Dynamically typed expressions (typing.Any) are disallowed in value

(ANN401)


1228-1228: Dynamically typed expressions (typing.Any) are disallowed in value

(ANN401)


1238-1238: Naive datetime constructed using datetime.datetime.strptime() without %z

(DTZ007)


1244-1244: Consider moving this statement to an else block

(TRY300)


1248-1248: Dynamically typed expressions (typing.Any) are disallowed in value

(ANN401)


1303-1303: Dynamically typed expressions (typing.Any) are disallowed in obj

(ANN401)


1303-1303: Dynamically typed expressions (typing.Any) are disallowed in _json_default

(ANN401)


1314-1314: Missing return type annotation for special method __init__

Add return type annotation: None

(ANN204)

🔇 Additional comments (13)
app/states/unified_field_library.py (3)

15-77: LGTM!

The enum definitions are comprehensive and well-organized with clear semantic groupings for campaign finance data categories and field types.


80-99: LGTM!

The dataclass definitions are well-designed with appropriate defaults. The confidence field in StateFieldMapping is a thoughtful addition for handling approximate field mappings.


296-374: LGTM!

The state-specific field mappings are comprehensive and well-structured. The varying confidence scores (e.g., 0.9 for receivedDtfiled_date, 0.8 for Purposedescription) appropriately indicate mapping uncertainty.

app/states/unified_sqlmodels.py (10)

23-99: LGTM!

The enum definitions are comprehensive and well-organized, covering all necessary transaction types, person types, roles, and association types for campaign finance data modeling.


140-189: LGTM!

The UnifiedAddress model is well-structured with appropriate field types and normalization in __post_init__. Note: The datetime.utcnow deprecation applies to multiple models throughout this file (Lines 157-158, 214-215, 278-279, etc.)—consider a bulk update if addressing the deprecation.


438-546: LGTM!

The version/history tables follow a consistent pattern for audit trails with JSON snapshots. The UnifiedCommitteePerson junction table properly models committee-person relationships with role information and change tracking.


548-622: LGTM!

The entity association and campaign models are well-designed with proper relationship definitions. The UnifiedEntityAssociation pattern effectively models various relationship types (treasurer-of, vendor-for, etc.) between entities.


625-688: LGTM!

The UnifiedContribution and UnifiedLoan models properly normalize specialized transaction data with appropriate entity relationships and unique constraints on transaction_id.


741-786: LGTM!

The UnifiedSQLModelBuilder initialization and build_transaction method are well-implemented. Storing raw data as JSON provides valuable debugging/audit capability.


788-851: LGTM!

The build_person method has good logic for detecting person types, including handling special placeholder cases like "NON-ITEMIZED CONTRIBUTOR" and "ANONYMOUS".


885-951: LGTM!

The build_committee method has sensible fallback logic for handling committees without names, using candidate name or filer_id as alternatives.


1264-1301: LGTM!

The _determine_transaction_type method has good logic for inferring transaction types from both explicit type fields and field name patterns, with appropriate fallbacks.


1309-1477: LGTM!

The UnifiedSQLDataProcessor effectively orchestrates the building of complete transaction graphs with related entities (persons, committees, campaigns) and specialized financial records (contributions, loans). The process_record method properly creates all necessary junction table entries.

across different states, enabling consistent data processing and analysis.
"""

from typing import Dict, List, Set, Optional, Any, Tuple
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider using built-in generic types instead of typing imports.

Python 3.9+ supports dict, list, set, and tuple as generic types directly, making typing.Dict, typing.List, etc. unnecessary.

-from typing import Dict, List, Set, Optional, Any, Tuple
+from typing import Optional, Any

Then update type hints throughout the file (e.g., Dict[str, FieldDefinition]dict[str, FieldDefinition]).

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Ruff (0.14.7)

8-8: typing.Dict is deprecated, use dict instead

(UP035)


8-8: typing.List is deprecated, use list instead

(UP035)


8-8: typing.Set is deprecated, use set instead

(UP035)


8-8: typing.Tuple is deprecated, use tuple instead

(UP035)

Comment on lines +250 to +268
# Committee fields
self.unified_fields.update({
"committee_name": FieldDefinition(
name="committee_name",
category=FieldCategory.COMMITTEE_NAME,
field_type=FieldType.STRING,
description="Name of the political committee",
examples=["filerName", "Committee Name"],
validation_rules={"max_length": 200}
),
"committee_type": FieldDefinition(
name="committee_type",
category=FieldCategory.COMMITTEE_TYPE,
field_type=FieldType.CODE,
description="Type of political committee",
examples=["filerTypeCd", "Committee Type"],
validation_rules={"enum_values": ["candidate", "pac", "party", "other"]}
),
})
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing committee_filer_id unified field definition.

The state mappings reference committee_filer_id (Lines 332, 369), but this unified field is not defined in _initialize_unified_fields. This will cause get_unified_field("committee_filer_id") to return None and may break field lookups.

Add the missing field definition:

         "committee_type": FieldDefinition(
             name="committee_type",
             category=FieldCategory.COMMITTEE_TYPE,
             field_type=FieldType.CODE,
             description="Type of political committee",
             examples=["filerTypeCd", "Committee Type"],
             validation_rules={"enum_values": ["candidate", "pac", "party", "other"]}
         ),
+        "committee_filer_id": FieldDefinition(
+            name="committee_filer_id",
+            category=FieldCategory.COMMITTEE_IDENTIFICATION,
+            field_type=FieldType.IDENTIFIER,
+            description="Unique identifier for the committee from the state filing system",
+            examples=["filerIdent", "Org ID"],
+            validation_rules={"required": False}
+        ),
     })
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Committee fields
self.unified_fields.update({
"committee_name": FieldDefinition(
name="committee_name",
category=FieldCategory.COMMITTEE_NAME,
field_type=FieldType.STRING,
description="Name of the political committee",
examples=["filerName", "Committee Name"],
validation_rules={"max_length": 200}
),
"committee_type": FieldDefinition(
name="committee_type",
category=FieldCategory.COMMITTEE_TYPE,
field_type=FieldType.CODE,
description="Type of political committee",
examples=["filerTypeCd", "Committee Type"],
validation_rules={"enum_values": ["candidate", "pac", "party", "other"]}
),
})
# Committee fields
self.unified_fields.update({
"committee_name": FieldDefinition(
name="committee_name",
category=FieldCategory.COMMITTEE_NAME,
field_type=FieldType.STRING,
description="Name of the political committee",
examples=["filerName", "Committee Name"],
validation_rules={"max_length": 200}
),
"committee_type": FieldDefinition(
name="committee_type",
category=FieldCategory.COMMITTEE_TYPE,
field_type=FieldType.CODE,
description="Type of political committee",
examples=["filerTypeCd", "Committee Type"],
validation_rules={"enum_values": ["candidate", "pac", "party", "other"]}
),
"committee_filer_id": FieldDefinition(
name="committee_filer_id",
category=FieldCategory.COMMITTEE_IDENTIFICATION,
field_type=FieldType.IDENTIFIER,
description="Unique identifier for the committee from the state filing system",
examples=["filerIdent", "Org ID"],
validation_rules={"required": False}
),
})
🤖 Prompt for AI Agents
In app/states/unified_field_library.py around lines 250 to 268, add a unified
field definition for "committee_filer_id" because state mappings reference it
but it's missing; create a FieldDefinition with name="committee_filer_id", an
appropriate category (e.g., FieldCategory.COMMITTEE_ID or COMMITTEE_FILER_ID),
field_type=FieldType.STRING, a short description like "Unique filer identifier
for the committee", useful examples (e.g., ["filerId", "committeeFilerId"]), and
validation_rules such as {"max_length": 100} (and an optional pattern if your
IDs follow a specific regex); insert this into the
self.unified_fields.update({...}) block alongside the other committee fields so
get_unified_field("committee_filer_id") returns the new definition.

Comment on lines +463 to +466
def import_mappings(self, file_path: Path):
"""Import mappings from a JSON file"""
with open(file_path, 'r') as f:
import_data = json.load(f)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Minor: Unnecessary mode argument.

The 'r' mode is the default for open() and can be omitted.

-        with open(file_path, 'r') as f:
+        with open(file_path) as f:
             import_data = json.load(f)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def import_mappings(self, file_path: Path):
"""Import mappings from a JSON file"""
with open(file_path, 'r') as f:
import_data = json.load(f)
def import_mappings(self, file_path: Path):
"""Import mappings from a JSON file"""
with open(file_path) as f:
import_data = json.load(f)
🧰 Tools
🪛 Ruff (0.14.7)

465-465: Unnecessary mode argument

Remove mode argument

(UP015)

🤖 Prompt for AI Agents
In app/states/unified_field_library.py around lines 463 to 466, the
open(file_path, 'r') call uses the explicit 'r' mode which is unnecessary;
remove the mode argument (use open(file_path) or preferably file_path.open()) or
even simpler use file_path.read_text() and then json.loads to read the JSON, so
the function reads the file without the redundant 'r' argument.

"""Normalized file origin references for ingested data."""
__tablename__ = "file_origins"

id: str = Field(default=None, primary_key=True, max_length=64)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Search for FileOrigin instantiation patterns
rg -n "FileOrigin\(" --type py -A 3 -B 1

Repository: Abstract-Data/campaignfinance-2023

Length of output: 750


🏁 Script executed:

# Look for the build_key method implementation
rg -n "def build_key" --type py -A 10

Repository: Abstract-Data/campaignfinance-2023

Length of output: 928


🏁 Script executed:

# Search for how FileOrigin instances are created in tests or usage
rg -n "FileOrigin\s*=" --type py -A 2

Repository: Abstract-Data/campaignfinance-2023

Length of output: 60


FileOrigin.id has default=None but is a primary key—remove the default or use default_factory.

All instantiations in the codebase explicitly set id via the build_key() result (e.g., FileOrigin(id=key, ...)), so the default is never actually used. However, defining default=None on a primary key field is poor practice and confusing. Either remove the default entirely or use default_factory to compute the key automatically if instances are created without an id.

🤖 Prompt for AI Agents
In app/states/unified_sqlmodels.py around line 126, the FileOrigin.id field is
declared as a primary_key with default=None which is confusing and poor
practice; remove the default=None so callers must supply an id, or replace it
with a default_factory that generates the key (e.g., call the existing build_key
function) if you want automatic ids—pick one approach, update the Field
declaration accordingly, and ensure any tests/instantiations that relied on
implicit defaults still pass.

updated_at: datetime = Field(default_factory=datetime.utcnow)

# Relationships
address: Optional[UnifiedAddress] = Relationship()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing back_populates on UnifiedCommittee.address relationship.

Unlike other address relationships (e.g., UnifiedPerson.address, UnifiedEntity.address), this relationship doesn't specify back_populates. This inconsistency may cause bidirectional relationship sync issues.

-    address: Optional[UnifiedAddress] = Relationship()
+    address: Optional[UnifiedAddress] = Relationship(back_populates="committees")

Note: You'll also need to add a corresponding relationship on UnifiedAddress:

# In UnifiedAddress class
committees: List["UnifiedCommittee"] = Relationship(back_populates="address")
🤖 Prompt for AI Agents
In app/states/unified_sqlmodels.py around line 282, the UnifiedCommittee.address
Relationship is missing back_populates which breaks bidirectional sync; update
UnifiedCommittee.address to include back_populates="committees" and add the
corresponding Relationship on the UnifiedAddress model (e.g., committees:
List["UnifiedCommittee"] = Relationship(back_populates="address") ) so both
sides reference each other consistently.

Comment on lines +690 to +739
# Database indexes for performance
class UnifiedTransactionIndexes:
"""Database indexes for the unified transaction system"""

# Transaction indexes
idx_transactions_state = Index("idx_transactions_state", UnifiedTransaction.state_id)
idx_transactions_type = Index("idx_transactions_type", UnifiedTransaction.transaction_type)
idx_transactions_date = Index("idx_transactions_date", UnifiedTransaction.transaction_date)
idx_transactions_amount = Index("idx_transactions_amount", UnifiedTransaction.amount)
idx_transactions_committee = Index("idx_transactions_committee", UnifiedTransaction.committee_id)
idx_transactions_id = Index("idx_transactions_id", UnifiedTransaction.transaction_id)
idx_transactions_file_origin = Index("idx_transactions_file_origin", UnifiedTransaction.file_origin_id)

# Person indexes
idx_persons_name = Index("idx_persons_name", UnifiedPerson.last_name, UnifiedPerson.first_name)
idx_persons_organization = Index("idx_persons_organization", UnifiedPerson.organization)
idx_persons_type = Index("idx_persons_type", UnifiedPerson.person_type)

# Transaction-Person relationship indexes
idx_transaction_persons_role = Index("idx_transaction_persons_role", UnifiedTransactionPerson.role)
idx_transaction_persons_transaction = Index("idx_transaction_persons_transaction", UnifiedTransactionPerson.transaction_id)
idx_transaction_persons_person = Index("idx_transaction_persons_person", UnifiedTransactionPerson.person_id)

# Address indexes
idx_addresses_state = Index("idx_addresses_state", UnifiedAddress.state)
idx_addresses_city = Index("idx_addresses_city", UnifiedAddress.city)

# Committee indexes
idx_committees_name = Index("idx_committees_name", UnifiedCommittee.name)
idx_committees_type = Index("idx_committees_type", UnifiedCommittee.committee_type)
# Note: filer_id is already indexed as primary key

# Entity indexes
idx_entities_type = Index("idx_entities_type", UnifiedEntity.entity_type)
idx_entities_name = Index("idx_entities_name", UnifiedEntity.normalized_name)

# Campaign indexes
idx_campaigns_year = Index("idx_campaigns_year", UnifiedCampaign.election_year)
idx_campaigns_office = Index("idx_campaigns_office", UnifiedCampaign.office_sought)
idx_campaigns_name = Index("idx_campaigns_name", UnifiedCampaign.normalized_name)
idx_campaign_entity_role = Index("idx_campaign_entity_role", UnifiedCampaignEntity.role)

# Contribution indexes
idx_contributions_date = Index("idx_contributions_date", UnifiedContribution.receipt_date)
idx_contributions_amount = Index("idx_contributions_amount", UnifiedContribution.amount)

# Loan indexes
idx_loans_date = Index("idx_loans_date", UnifiedLoan.loan_date)
idx_loans_due_date = Index("idx_loans_due_date", UnifiedLoan.due_date)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Search for references to UnifiedTransactionIndexes and its indexes
rg -n "UnifiedTransactionIndexes|idx_transactions_|idx_persons_|idx_addresses_|idx_committees_" --type py -B 2 -A 2

Repository: Abstract-Data/campaignfinance-2023

Length of output: 2773


🏁 Script executed:

# Check if indexes are used in __table_args__ of model classes
rg -n "__table_args__" --type py -A 5

Repository: Abstract-Data/campaignfinance-2023

Length of output: 17028


🏁 Script executed:

# Look for migration files or initialization code that might apply these indexes
fd -e py -e sql | xargs grep -l "idx_" 2>/dev/null | head -20

Repository: Abstract-Data/campaignfinance-2023

Length of output: 109


🏁 Script executed:

# Check the import statements and usage of Index from SQLAlchemy
rg -n "from sqlalchemy|from sqlmodel" app/states/unified_sqlmodels.py -A 3

Repository: Abstract-Data/campaignfinance-2023

Length of output: 498


🏁 Script executed:

# Search for Index usage within __table_args__
rg -n "__table_args__.*Index|Index.*__table_args__" --type py -B 5 -A 5

Repository: Abstract-Data/campaignfinance-2023

Length of output: 60


🏁 Script executed:

# Check if there are any migration files
fd -e py -path "*/migrations/*" | head -20

Repository: Abstract-Data/campaignfinance-2023

Length of output: 246


🏁 Script executed:

# Look for alembic or db initialization files
fd -e py | xargs grep -l "alembic\|migrate\|create_all\|metadata" 2>/dev/null | head -10

Repository: Abstract-Data/campaignfinance-2023

Length of output: 302


🏁 Script executed:

# Check unified_database.py for index-related code
cat -n app/states/unified_database.py | head -100

Repository: Abstract-Data/campaignfinance-2023

Length of output: 4198


🏁 Script executed:

# Check recreate_tables.py
cat -n recreate_tables.py

Repository: Abstract-Data/campaignfinance-2023

Length of output: 5343


🏁 Script executed:

# Search for any reference to UnifiedTransactionIndexes specifically
rg -n "UnifiedTransactionIndexes" --type py

Repository: Abstract-Data/campaignfinance-2023

Length of output: 146


🏁 Script executed:

# Check if any model class has __table_args__ with Index to confirm pattern
rg -n "class Unified" app/states/unified_sqlmodels.py -A 10 | head -50

Repository: Abstract-Data/campaignfinance-2023

Length of output: 2934


🏁 Script executed:

# Verify the specific model definitions that need indexes
rg -n "class UnifiedTransaction\(|class UnifiedPerson\(|class UnifiedAddress\(" app/states/unified_sqlmodels.py -A 8

Repository: Abstract-Data/campaignfinance-2023

Length of output: 1434


Index definitions in UnifiedTransactionIndexes won't be automatically applied to the database.

The UnifiedTransactionIndexes class defines indexes as class attributes, but SQLModel/SQLAlchemy doesn't automatically apply indexes defined this way. These Index objects need to be registered with each model's __table_args__. Currently, the models use individual index=True flags on fields but don't include composite or additional indexes via __table_args__.

Move index definitions to the respective model classes. For example, on UnifiedTransaction:

class UnifiedTransaction(SQLModel, table=True):
    __tablename__ = "unified_transactions"
    __table_args__ = (
        Index("idx_transactions_state", "state_id"),
        Index("idx_transactions_type", "transaction_type"),
        Index("idx_transactions_date", "transaction_date"),
        # ... other indexes
    )

Repeat for UnifiedPerson, UnifiedAddress, UnifiedCommittee, and other models with indexes defined in UnifiedTransactionIndexes.

Comment on lines +1214 to +1226
def _parse_amount(self, value: Any) -> Optional[Decimal]:
"""Parse an amount value to Decimal"""
if value is None:
return None

try:
# Remove currency symbols and commas
if isinstance(value, str):
value = re.sub(r'[^\d.-]', '', value)

return Decimal(str(value))
except (ValueError, TypeError):
return None
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Amount parsing may produce invalid results.

The regex on Line 1222 removes all characters except digits, ., and -, but doesn't validate the result. Inputs like "$1.2.3" or "--100" would produce invalid decimals that could raise InvalidOperation when converting to Decimal.

Consider adding validation after stripping:

     def _parse_amount(self, value: Any) -> Optional[Decimal]:
         """Parse an amount value to Decimal"""
         if value is None:
             return None
         
         try:
             # Remove currency symbols and commas
             if isinstance(value, str):
                 value = re.sub(r'[^\d.-]', '', value)
+                # Validate the cleaned string has valid decimal format
+                if not re.match(r'^-?\d*\.?\d+$', value):
+                    return None
             
             return Decimal(str(value))
-        except (ValueError, TypeError):
+        except (ValueError, TypeError, InvalidOperation):
             return None

Note: Import InvalidOperation from decimal module.

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Ruff (0.14.7)

1214-1214: Dynamically typed expressions (typing.Any) are disallowed in value

(ANN401)

🤖 Prompt for AI Agents
In app/states/unified_sqlmodels.py around lines 1214 to 1226, the current
_parse_amount strips non-numeric characters but doesn't validate the cleaned
string so inputs like "$1.2.3" or "--100" can produce InvalidOperation; after
stripping, validate the result with a regex that only allows an optional single
leading '-' and at most one decimal point (e.g. ^-?\d+(\.\d+)?$), ensure the
cleaned string is non-empty, then convert to Decimal while catching
decimal.InvalidOperation (import InvalidOperation) and return None on any
validation or conversion failure.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ff9a994 and 26ea85c.

📒 Files selected for processing (1)
  • .github/workflows/ci.yml (1 hunks)
🧰 Additional context used
🪛 YAMLlint (1.37.1)
.github/workflows/ci.yml

[warning] 3-3: truthy value should be one of [false, true]

(truthy)


[error] 35-35: too many blank lines (1 > 0)

(empty-lines)

🔇 Additional comments (3)
.github/workflows/ci.yml (3)

17-17: Verify Python 3.12 dependency compatibility.

The workflow specifies Python 3.12. Given the extensive dependency updates in this PR (Parquet support, Polars, Rich, SQLModel, etc.), please verify that all project dependencies are compatible with Python 3.12. If you haven't tested against 3.12 locally, consider doing so before merging.


24-25: Confirm CODECOV_TOKEN secret is configured.

The workflow uploads coverage to Codecov using ${{ secrets.CODECOV_TOKEN }}. Ensure this secret is already configured in the repository settings; otherwise, the upload step will fail silently or error out at runtime.


1-35: LGTM—solid CI foundation.

The workflow provides a clean, modern CI setup that aligns well with the PR's test infrastructure additions:

  • Modern Python 3.12 with uv package manager
  • Pytest with coverage reporting
  • Codecov integration for tracking coverage over time
  • Triggers on both push and PR events for comprehensive testing

Pending verification of Python 3.12 compatibility and Codecov secret setup, this is ready to go.

Comment on lines +3 to +5
on:
push:
pull_request:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider explicit trigger configuration for YAML linting.

The push: and pull_request: events without explicit configuration are valid GitHub Actions syntax and will trigger on all push/PR events. However, YAMLlint prefers explicit values. This is a minor style note—the current syntax works fine, but if you want to satisfy stricter linting, consider using explicit empty dicts or null:

 on:
-  push:
-  pull_request:
+  push: {}
+  pull_request: {}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
on:
push:
pull_request:
on:
push: {}
pull_request: {}
🧰 Tools
🪛 YAMLlint (1.37.1)

[warning] 3-3: truthy value should be one of [false, true]

(truthy)

🤖 Prompt for AI Agents
.github/workflows/ci.yml lines 3-5: the workflow triggers use bare "push:" and
"pull_request:" which are valid but fail stricter YAML lint rules; change each
to an explicit empty mapping or null (e.g., "push: {}" and "pull_request: {}" or
"push: null" and "pull_request: null") so the YAML linter accepts the syntax
while preserving the same trigger behavior.

with:
token: ${{ secrets.CODECOV_TOKEN }}
slug: Abstract-Data/campaignfinance-2023

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Remove trailing blank line.

YAMLlint flags the trailing blank line after line 34. Apply this cleanup:

          slug: Abstract-Data/campaignfinance-2023
-
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
slug: Abstract-Data/campaignfinance-2023
🧰 Tools
🪛 YAMLlint (1.37.1)

[error] 35-35: too many blank lines (1 > 0)

(empty-lines)

🤖 Prompt for AI Agents
In .github/workflows/ci.yml around line 35, there is a trailing blank line
flagged by YAMLlint; remove the empty line at the end of the file so the file
ends on line 34 without an extra newline-only line, then save the file (ensure
the file still ends with a single newline character if your editor enforces it).

@sentry
Copy link

sentry bot commented Dec 4, 2025

❌ 1 Tests Failed:

Tests completed Failed Passed Skipped
1 1 0 0
View the top 1 failed test(s) by shortest run time
app/tests/test_ingest_hypothesis.py::app.tests.test_ingest_hypothesis
Stack Traces | 0s run time
ImportError while importing test module '.../app/tests/test_ingest_hypothesis.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
../../../..../uv/python/cpython-3.12.8-linux-x86_64-gnu/lib/python3.12/importlib/__init__.py:90: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
app/tests/test_ingest_hypothesis.py:20: in <module>
    from app.ingest.file_reader import FieldSpec, FieldType, GenericFileReader, SchemaDefinition
E   ModuleNotFoundError: No module named 'app.ingest'

To view more test analytics, go to the Prevent Tests Dashboard

Comment on lines +9 to +40
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.12"

- name: Install uv
uses: astral-sh/setup-uv@v4

- name: Install dependencies
run: |
uv pip install --system .
uv pip install --system pytest pytest-cov

- name: Run tests with coverage
run: pytest --cov --cov-branch --cov-report=xml --junitxml=junit.xml -o junit_family=legacy

- name: Upload coverage reports to Codecov
uses: codecov/codecov-action@v5
with:
token: ${{ secrets.CODECOV_TOKEN }}
slug: Abstract-Data/campaignfinance-2023

- name: Upload test results to Codecov
if: ${{ !cancelled() }}
uses: codecov/test-results-action@v1
with:
token: ${{ secrets.CODECOV_TOKEN }}

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {contents: read}

Copilot Autofix

AI 3 months ago

The recommended fix is to explicitly set a permissions block for the workflow or for the individual job(s). In this case, since there’s only one job and the workflow’s steps do not require write access to repository contents or other privileged scopes, the minimal permission should be set: contents: read. This should be added at the root level of the workflow file (just below name: and before on:), to apply to all jobs. No other imports, method changes, or additional code definitions are needed; only a single YAML key addition to .github/workflows/ci.yml.

Suggested changeset 1
.github/workflows/ci.yml

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -1,4 +1,6 @@
 name: CI
+permissions:
+  contents: read
 
 on:
   push:
EOF
@@ -1,4 +1,6 @@
name: CI
permissions:
contents: read

on:
push:
Copilot is powered by AI and may make mistakes. Always verify output.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants