Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR adds a unified data model system for campaign finance data across multiple states, incorporating PostgreSQL support with SQLModel/SQLAlchemy ORM. The changes include dependency updates, new infrastructure for data loading and processing, and significant refactoring of Texas state validators.
Key changes:
- Introduced unified SQLModel-based data models for cross-state campaign finance data normalization
- Added production-ready data loader with batch processing and deduplication capabilities
- Updated dependencies including pyarrow, selenium, icecream, and added web-scraping utilities
- Refactored Texas validators to use base models and reduce code duplication
- Added database management utilities for PostgreSQL operations
Reviewed Changes
Copilot reviewed 51 out of 64 changed files in this pull request and generated 35 comments.
Show a summary per file
| File | Description |
|---|---|
| uv.lock | Updated package versions and added new dependencies (pyarrow, web-scrape-utils, websockets, undetected-chromedriver) |
| pyproject.toml | Added pyarrow and web-scrape-utils as dependencies |
| app/states/unified_sqlmodels.py | New comprehensive unified data model definitions with SQLModel/SQLAlchemy |
| app/states/unified_database.py | New database manager for PostgreSQL operations |
| production_loader.py | New production data loader with batch processing |
| recreate_tables.py | Utility script for recreating database tables |
| app/states/texas/validators/*.py | Refactored validators to use base models and reduce duplication |
| app/states/texas/texas_downloader.py | Updated to use web-scraping utilities and polars |
| app/states/texas/texas_search.py | Refactored search functionality with polars DataFrames |
| tmp/texas/*.parquet | Binary data files (21 parquet files added) |
| tmp/.DS_Store | macOS system file added |
Files not reviewed (4)
- .idea/campaignfinance.iml: Language not supported
- .idea/csv-editor.xml: Language not supported
- .idea/misc.xml: Language not supported
- .idea/ruff.xml: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # class TECFilerCTALinkModel(TECBaseModel, table=True): | ||
| # __tablename__ = "tx_filer_cta_link" | ||
| # __table_args__ = {"schema": "texas"} | ||
| # filerIdent: int = Field(foreign_key="texas.tx_filer_name.filerIdent", primary_key=True) | ||
| # ctaSeekOfficeCd: Optional[str] = Field(foreign_key="texas.tx_cta.id", primary_key=True) | ||
|
|
There was a problem hiding this comment.
This comment appears to contain commented-out code.
| # class TECFilerCTALinkModel(TECBaseModel, table=True): | |
| # __tablename__ = "tx_filer_cta_link" | |
| # __table_args__ = {"schema": "texas"} | |
| # filerIdent: int = Field(foreign_key="texas.tx_filer_name.filerIdent", primary_key=True) | |
| # ctaSeekOfficeCd: Optional[str] = Field(foreign_key="texas.tx_cta.id", primary_key=True) |
| # class TECFilerTreasurerLinkModel(TECBaseModel, table=True): | ||
| # __tablename__ = "tx_filer_treasurer_link" | ||
| # __table_args__ = {"schema": "texas"} | ||
| # filer_name_id: int = Field(foreign_key="texas.tx_filer_name.filerIdent", primary_key=True) | ||
| # treasurer_id: Optional[str] = Field(foreign_key="texas.tx_treasurer.id", primary_key=True) | ||
| # assttreas_id: Optional[str] = Field(foreign_key="texas.tx_assistant_treasurer.id", primary_key=True) | ||
| # chair_id: Optional[str] = Field(foreign_key="texas.tx_chair.id", primary_key=True) | ||
|
|
There was a problem hiding this comment.
This comment appears to contain commented-out code.
| # class TECFilerTreasurerLinkModel(TECBaseModel, table=True): | |
| # __tablename__ = "tx_filer_treasurer_link" | |
| # __table_args__ = {"schema": "texas"} | |
| # filer_name_id: int = Field(foreign_key="texas.tx_filer_name.filerIdent", primary_key=True) | |
| # treasurer_id: Optional[str] = Field(foreign_key="texas.tx_treasurer.id", primary_key=True) | |
| # assttreas_id: Optional[str] = Field(foreign_key="texas.tx_assistant_treasurer.id", primary_key=True) | |
| # chair_id: Optional[str] = Field(foreign_key="texas.tx_chair.id", primary_key=True) |
| # @model_validator(mode='before') | ||
| # @classmethod | ||
| # def store_raw(cls, values): | ||
| # values['raw_input'] = {k: v for k, v in values.items()} | ||
| # return values | ||
|
|
There was a problem hiding this comment.
This comment appears to contain commented-out code.
| # @model_validator(mode='before') | |
| # @classmethod | |
| # def store_raw(cls, values): | |
| # values['raw_input'] = {k: v for k, v in values.items()} | |
| # return values |
| # def __init__(self, **data): | ||
| # _flags = TECFlags(**data) | ||
| # if len(_flags.model_dump(exclude_none=True).values()) > 0: | ||
| # self.flags = _flags | ||
| # super().__init__(**data) No newline at end of file |
There was a problem hiding this comment.
This comment appears to contain commented-out code.
| unlinked_transactions = session.exec( | ||
| select(UnifiedTransactionPerson).where( | ||
| UnifiedTransactionPerson.person_id == cp.person_id, | ||
| UnifiedTransactionPerson.committee_person_id == None |
There was a problem hiding this comment.
Testing for None should use the 'is' operator.
| if original_address is not ensured_address and original_address is not None: | ||
| try: | ||
| session.expunge(original_address) | ||
| except InvalidRequestError: |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
| if original_address is not ensured_address and original_address is not None: | ||
| try: | ||
| session.expunge(original_address) | ||
| except InvalidRequestError: |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
| if original_person is not ensured_person: | ||
| try: | ||
| session.expunge(original_person) | ||
| except InvalidRequestError: |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
| result = self.choose_options(result) | ||
| if not result.selected_choices: | ||
| return result | ||
| result.selected_choices = result.selected_choices |
There was a problem hiding this comment.
This assignment assigns a variable to itself.
| result.selected_choices = result.selected_choices |
| } | ||
| if type_str in type_synonyms: | ||
| return type_synonyms[type_str] | ||
| for transaction_type in TransactionType: |
There was a problem hiding this comment.
This for-loop may attempt to iterate over a non-iterable instance of class type.
- Modified app/funcs/__init__.py and csv_reader.py - Updated Texas state modules (downloader, search, validators) - Updated unified_database.py - Updated test_validation_class.py - Updated pyproject.toml and uv.lock - Removed parquet files from git tracking (now in .gitignore)
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughAdds Parquet/TXT readers and Polars-based data flows, a Rich-based ProgressTracker, Texas downloader/search/consolidation refactors, an Ohio downloader scaffold, a unified SQLModel schema and DB manager, a production batch ingestion pipeline with deduplication, extensive validator/model restructures, and project/IDE/dependency updates. Changes
Sequence Diagram(s)sequenceDiagram
actor CLI as Client/CLI
participant Loader as ProductionLoader
participant Reader as GenericFileReader
participant Processor as UnifiedSQLDataProcessor
participant DB as UnifiedDatabaseManager
participant Session as DB Session
CLI->>Loader: load_file(file_path, state)
Loader->>DB: load_existing_data(session)
Loader->>Reader: _get_reader(state)
Reader-->>Loader: reader
loop per batch
Loader->>Reader: read batch
Reader-->>Loader: records
Loader->>Processor: process_records(records, state)
Processor-->>Loader: UnifiedTransaction objects
Loader->>Loader: dedupe/cache & manage associations
Loader->>DB: save_transactions(transactions)
DB->>Session: persist & commit
Session-->>DB: commit result
Loader->>Loader: update progress
end
Loader->>Loader: _dedupe_persons_and_entities(session)
Loader->>CLI: display_results()
sequenceDiagram
actor User as Client
participant TEC as TECDownloader
participant Driver as CreateWebDriver
participant Progress as ProgressTracker
participant Extract as Zip/Parquet Processor
participant FS as Filesystem
User->>TEC: download(read_from_temp, headless)
TEC->>Progress: add_task("cleanup")
TEC->>FS: safe temp cleanup
Progress->>Progress: update_task("cleanup","completed")
TEC->>Driver: CreateWebDriver(...)
Driver-->>TEC: driver instance
TEC->>Progress: add_task("download")
TEC->>Driver: navigate & click to download
Driver-->>FS: save zip
TEC->>Progress: update_task("download","completed")
TEC->>Extract: extract_zipfile(zip, tmp)
Extract->>Extract: _process_csv -> convert CSV→Parquet
Extract-->>FS: write parquet files
TEC->>TEC: consolidate_files() (align columns & stack)
TEC->>Progress: stop()
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: ASSERTIVE Plan: Pro 📒 Files selected for processing (1)
🧰 Additional context used🪛 GitHub Check: CodeQL.github/workflows/ci.yml[warning] 9-40: Workflow does not contain permissions 🪛 YAMLlint (1.37.1).github/workflows/ci.yml[warning] 3-3: truthy value should be one of [false, true] (truthy) [error] 41-41: too many blank lines (1 > 0) (empty-lines) ⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 81
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (7)
.idea/misc.xml (1)
3-9: Inconsistent SDK references between components.The Black component (line 7) still references the old "Poetry (campaignfinance) (2)" SDK, while the ProjectRootManager (line 9) has been updated to "Python 3.12 UV (campaignfinance)". This inconsistency may cause Black to fail or use the wrong Python environment.
Apply this diff to align the Black SDK reference:
<component name="Black"> <option name="enabledOnReformat" value="true" /> <option name="executionMode" value="BINARY" /> <option name="pathToExecutable" value="/opt/homebrew/bin/black" /> - <option name="sdkName" value="Poetry (campaignfinance) (2)" /> + <option name="sdkName" value="Python 3.12 UV (campaignfinance)" /> </component>app/tests/test_validation_class.py (1)
3-25: Tests are incompatible with theStateFileValidationAPI.Several issues will cause these tests to fail:
StateFileValidationis an abstract class (abc.ABC) and cannot be instantiated directlyvalidate_record(record)takes only one argument, not(record, MockModel)validate(records)takes only one argument, not(records, MockModel)- The
validator_to_useshould be a class attribute, not passed to methodsBased on
app/abcs/abc_validation.py, the tests need to create a concrete subclass:-@given(st.dictionaries(st.text(), st.text())) -def test_validate_record(record): - validator = StateFileValidation() - result = validator.validate_record(record, MockModel) +class TestValidator(StateFileValidation): + validator_to_use = MockModel + +@given(st.dictionaries(st.text(), st.text())) +def test_validate_record(record): + validator = TestValidator() + result = validator.validate_record(record) assert result[0] in ['passed', 'failed']app/states/texas/validators/texas_filers.py (1)
186-638: Large block of commented-out legacy code.Lines 186-638 contain extensive commented-out code. If this code serves as a reference during refactoring, consider documenting the migration plan or removing it entirely, relying on version control for history.
app/states/texas/validators/texas_address.py (1)
83-86: Critical logic bug: condition is alwaysTrue.The expression
if "PO" or "PO BOX" in _address:is always truthy because"PO"(a non-empty string) is truthy by itself. This means the PO BOX parsing logic runs for every address, not just PO boxes.- if "PO" or "PO BOX" in _address: + if "PO" in _address or "PO BOX" in _address:app/states/texas/validators/texas_expenses.py (1)
235-246: Incorrect field name in error message.The validator validates
expendDescrandexpendDt, but the error message on line 243 incorrectly references'filerName'. This will confuse users when validation fails.@field_validator('expendDescr', 'expendDt', mode='before') @classmethod def validate_required_fields(cls, v): if v == "" or v is None: raise PydanticCustomError( 'missing_required_value', "Field is required", { - 'column': 'filerName', + 'column': 'expendDescr/expendDt', 'value': v } ) return vAlternatively, use Pydantic's validation info to get the actual field name dynamically.
app/states/texas/validators/texas_personname.py (2)
74-75: PotentialAttributeErrorif no key contains 'NamePrefixCd'.If
_key_pfxisNone, calling.split()on it will raise anAttributeError. Add a guard or handle the None case.@model_validator(mode='before') @classmethod def set_type_prefix(cls, values): _key_pfx = next((key for key in values.keys() if 'NamePrefixCd' in key), None) + if not _key_pfx: + values['field_pfx'] = None + return values values['field_pfx'] = _key_pfx.split('NamePrefixCd')[0] return values
92-94: Same potentialAttributeErrorfor_pfxinvalidate_address.If no key contains
'NamePrefixCd',_pfxwill beNone, and line 94 will fail when calling.split()._pfx = next((key for key in values.keys() if 'NamePrefixCd' in key), None) - if _pfx: - _pfx = _pfx.split('NamePrefixCd')[0] + if not _pfx: + return values + _pfx = _pfx.split('NamePrefixCd')[0]
♻️ Duplicate comments (20)
app/main.py (2)
4-5: Unused importre.The
remodule is imported but not used anywhere in the active code.
30-32: Unused imports:Pathandcontextmanager.
Pathandcontextmanagerare imported but not used in the active code. Thepolarsimport (pl) is used and should be kept.app/states/texas/validators/texas_filers.py (2)
60-65: Commented-out code.This block contains commented-out code that should be removed or documented.
89-96: Commented-out code.This block contains commented-out code that should be removed or documented.
production_loader.py (4)
602-605: Empty except clause silently swallowsInvalidRequestError.The
except InvalidRequestError: passsilently ignores the error without logging. Add a comment explaining why this is expected, or log at debug level.
628-631: Empty except clause silently swallowsInvalidRequestError.
775-778: Empty except clause silently swallowsInvalidRequestError.
796-799: Empty except clause silently swallowsInvalidRequestError.app/states/unified_database.py (3)
1078-1085: Useis Nonefor None comparisons.Testing for
Noneshould use theisoperator rather than==.This was flagged in a previous review. Apply:
- UnifiedTransactionPerson.committee_person_id == None + UnifiedTransactionPerson.committee_person_id.is_(None)
1176-1184: Useis Nonefor None comparisons.Testing for
Noneshould use theisoperator rather than==.This was flagged in a previous review. Apply:
- UnifiedTransactionPerson.committee_person_id == None + UnifiedTransactionPerson.committee_person_id.is_(None)
16-22: Unused import:PersonType.
PersonTypeis imported but not used in this file.from .unified_sqlmodels import ( UnifiedTransaction, UnifiedPerson, UnifiedAddress, UnifiedCommittee, - UnifiedTransactionPerson, PersonRole, TransactionType, PersonType, + UnifiedTransactionPerson, PersonRole, TransactionType, unified_sql_processor, UnifiedTransactionVersion, UnifiedPersonVersion, UnifiedCommitteeVersion, UnifiedAddressVersion, UnifiedCommitteePerson, UnifiedCommitteePersonVersion, CommitteeRole, State )app/states/texas/validators/texas_settings.py (1)
51-57: Remove commented-out code.This commented-out code for flags handling should be removed. If it's intended for future implementation, track it in an issue instead.
app/states/ohio/ohio_downloader.py (1)
2-4: Unused imports detected.
CategoryTypes,RecordGen,CategoryConfig,progress, andCreateWebDriverare imported but not used in this file.Remove unused imports:
-from abcs import ( - FileDownloaderABC, StateConfig, CategoryTypes, RecordGen, CategoryConfig, progress, CSVReaderConfig) -from web_scrape_utils import CreateWebDriver, By +from abcs import FileDownloaderABC, StateConfig, CSVReaderConfig +from web_scrape_utils import Byapp/states/texas/models.py (1)
4-9: Multiple unused imports.
BaseModel,ConfigDict,model_validator(line 4),SQLModel(line 5),tx_funcs(line 7), andRecordKeyGenerator(line 9) are imported but not used.app/states/texas/texas_downloader.py (2)
2-18: Remove unused imports.
Optional(line 2) andCreateWebDriver(line 18) are imported but never used in the code.-from typing import Self, ClassVar, Optional +from typing import Self, ClassVar-from web_scrape_utils import CreateWebDriver, By +from web_scrape_utils import By
103-103: Unused variables_file_count.
_file_countis assigned but never used on both lines 103 and 122.- _file_count = len(list(_folder.glob("*.parquet")))- _file_count = len(_file_types)Also applies to: 122-122
app/states/texas/validators/texas_personname.py (2)
2-4: Remove unused imports.
Annotated(line 2),field_validator,create_model,AliasChoices(line 4), andre(line 12) are imported but never used.-from typing import Optional, Annotated +from typing import Optional -from pydantic import field_validator, model_validator, BeforeValidator, create_model, AliasChoices +from pydantic import model_validator, BeforeValidator -import reAlso applies to: 12-12
65-70: Remove commented-out code.Commented-out code should be removed to improve readability. If it's needed for reference, it should be tracked in version control or documented elsewhere.
app/states/texas/texas_search.py (2)
2-2: Use modern type hints and remove unused imports.
Type,Self, andTupleare imported but never used. Also,ListandDictare deprecated in favor of built-inlistanddict.-from typing import Optional, List, Dict, Any, Type, NamedTuple, Self, Tuple +from typing import Optional, Any, NamedTupleThen update usages:
List[str]→list[str]Dict[str, List[str]]→dict[str, list[str]]
258-258: Self-assignment has no effect.
result.selected_choices = result.selected_choicesis a no-op.- result.selected_choices = result.selected_choices
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
⛔ Files ignored due to path filters (6)
.DS_Storeis excluded by!**/.DS_Store.idea/.DS_Storeis excluded by!**/.DS_Storeapp/.DS_Storeis excluded by!**/.DS_Storepoetry.lockis excluded by!**/*.locktmp/.DS_Storeis excluded by!**/.DS_Storeuv.lockis excluded by!**/*.lock
📒 Files selected for processing (36)
.gitattributes(1 hunks).gitignore(2 hunks).idea/campaignfinance.iml(1 hunks).idea/csv-editor.xml(0 hunks).idea/misc.xml(1 hunks).idea/ruff.xml(1 hunks)README.md(1 hunks)app/abcs/__init__.py(1 hunks)app/abcs/abc_download.py(2 hunks)app/abcs/abc_state_config.py(1 hunks)app/abcs/abc_validation_errors.py(1 hunks)app/funcs/__init__.py(1 hunks)app/funcs/csv_reader.py(2 hunks)app/funcs/validator_functions.py(1 hunks)app/live_display.py(1 hunks)app/logs/campaign_finance.log.2025-01-31(1 hunks)app/main.py(3 hunks)app/states/ohio/ohio_downloader.py(1 hunks)app/states/texas/__init__.py(2 hunks)app/states/texas/models.py(1 hunks)app/states/texas/texas_downloader.py(2 hunks)app/states/texas/texas_search.py(1 hunks)app/states/texas/validators/__init__.py(1 hunks)app/states/texas/validators/texas_address.py(4 hunks)app/states/texas/validators/texas_expenses.py(4 hunks)app/states/texas/validators/texas_filers.py(5 hunks)app/states/texas/validators/texas_flags.py(1 hunks)app/states/texas/validators/texas_personname.py(2 hunks)app/states/texas/validators/texas_settings.py(3 hunks)app/states/unified_database.py(1 hunks)app/states/unified_sqlmodels.py(1 hunks)app/tests/test_validation_class.py(1 hunks)app/workflows/texas_download.py(0 hunks)production_loader.py(1 hunks)pyproject.toml(1 hunks)recreate_tables.py(1 hunks)
💤 Files with no reviewable changes (2)
- .idea/csv-editor.xml
- app/workflows/texas_download.py
🧰 Additional context used
🧬 Code graph analysis (19)
recreate_tables.py (1)
app/op.py (2)
database(91-92)database_url(111-122)
app/abcs/abc_state_config.py (1)
app/funcs/toml_reader.py (1)
read_toml(6-8)
app/tests/test_validation_class.py (1)
app/abcs/abc_validation.py (1)
StateFileValidation(24-85)
app/main.py (2)
app/states/texas/texas_search.py (1)
TexasSearch(165-294)app/states/texas/texas_downloader.py (2)
download(30-98)dataframes(277-279)
app/abcs/__init__.py (1)
app/abcs/abc_download.py (1)
FileDownloaderABC(24-142)
app/states/texas/validators/texas_flags.py (3)
app/states/texas/models.py (1)
TECBaseModel(58-64)app/states/texas/validators/texas_settings.py (2)
TECBaseModel(44-57)check_contains_factory(15-20)app/funcs/validator_functions.py (1)
check_contains_factory(166-171)
app/states/texas/__init__.py (2)
app/states/texas/texas_search.py (1)
TexasSearch(165-294)app/states/texas/validators/texas_filers.py (1)
TECFilerName(126-184)
app/funcs/validator_functions.py (1)
app/states/texas/validators/texas_settings.py (2)
check_contains_factory(15-20)check_contains(16-19)
app/states/texas/validators/texas_personname.py (3)
app/states/texas/models.py (1)
TECBaseModel(58-64)app/states/texas/validators/texas_settings.py (2)
TECBaseModel(44-57)check_contains_factory(15-20)app/states/texas/validators/texas_address.py (1)
TECAddressBase(20-142)
app/states/ohio/ohio_downloader.py (2)
app/abcs/abc_download.py (4)
FileDownloaderABC(24-142)download(109-110)consolidate_files(114-115)read(119-120)app/abcs/abc_state_config.py (4)
StateConfig(140-180)CategoryTypes(110-136)CategoryConfig(65-106)CSVReaderConfig(40-42)
app/states/texas/validators/texas_settings.py (2)
app/funcs/validator_functions.py (2)
check_contains_factory(166-171)check_contains(167-170)app/states/texas/models.py (1)
TECBaseModel(58-64)
app/funcs/csv_reader.py (3)
app/abcs/abc_validation.py (1)
logger(32-34)app/abcs/abc_category.py (1)
logger(99-101)app/logger.py (1)
Logger(10-71)
app/states/texas/validators/texas_address.py (2)
app/states/texas/models.py (1)
TECBaseModel(58-64)app/states/texas/validators/texas_settings.py (1)
TECBaseModel(44-57)
app/states/texas/validators/__init__.py (4)
app/states/texas/models.py (1)
TECBaseModel(58-64)app/states/texas/validators/texas_filers.py (1)
TECFilerName(126-184)app/states/texas/validators/texas_contributions.py (1)
TECContribution(10-294)app/states/texas/validators/texas_address.py (1)
TECAddressBase(20-142)
app/funcs/__init__.py (4)
app/funcs/csv_reader.py (1)
FileReader(54-144)app/funcs/toml_reader.py (1)
read_toml(6-8)app/funcs/file_exporters.py (1)
write_records_to_csv_validation(8-38)app/funcs/depreciated.py (1)
deprecated(5-17)
app/abcs/abc_download.py (4)
app/abcs/abc_state_config.py (3)
StateConfig(140-180)CategoryTypes(110-136)TEMP_FOLDER(149-150)app/live_display.py (4)
ProgressTracker(9-106)start(21-25)add_task(34-44)update_task(46-52)app/states/texas/texas_downloader.py (3)
download(30-98)consolidate_files(120-206)read(209-274)app/abcs/abc_category.py (1)
read(103-125)
app/states/texas/validators/texas_expenses.py (2)
app/states/texas/validators/texas_personname.py (1)
TECPersonNameBase(15-107)app/states/texas/validators/texas_settings.py (1)
TECSettings(22-41)
app/states/unified_sqlmodels.py (2)
app/op.py (1)
role(107-108)app/states/unified_database.py (1)
get_session(77-79)
production_loader.py (5)
app/states/unified_sqlmodels.py (2)
State(101-119)process_record(1321-1444)app/abcs/abc_validation.py (1)
failed_records(62-65)app/logger.py (3)
error(61-62)warning(58-59)info(52-53)app/states/unified_database.py (1)
get_session(77-79)app/live_display.py (1)
add_task(34-44)
🪛 LanguageTool
README.md
[style] ~15-~15: Using many exclamation marks might seem excessive (in this case: 3 exclamation marks for a text that’s 1026 characters long)
Context: ...r-the-badge&logo=python&logoColor=blue)

🪛 markdownlint-cli2 (0.18.1)
README.md
3-3: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
7-7: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
10-10: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
14-14: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
18-18: Files should end with a single newline character
(MD047, single-trailing-newline)
🪛 Ruff (0.14.7)
recreate_tables.py
1-1: Shebang is present but file is not executable
(EXE001)
12-12: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
30-30: Do not catch blind exception: Exception
(BLE001)
65-65: Do not catch blind exception: Exception
(BLE001)
70-70: Do not catch blind exception: Exception
(BLE001)
81-81: Do not catch blind exception: Exception
(BLE001)
109-109: Do not catch blind exception: Exception
(BLE001)
app/states/texas/models.py
5-5: Redefinition of unused Field from line 1
(F811)
13-13: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
app/states/texas/validators/texas_flags.py
10-10: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
app/states/texas/texas_downloader.py
25-25: Missing return type annotation for special method __init__
Add return type annotation: None
(ANN204)
32-32: Boolean-typed positional argument in function definition
(FBT001)
32-32: Boolean default positional argument in function definition
(FBT002)
32-32: Unused class method argument: read_from_temp
(ARG003)
33-33: Boolean-typed positional argument in function definition
(FBT001)
33-33: Boolean default positional argument in function definition
(FBT002)
36-36: Unnecessary map() usage (rewrite using a generator expression)
Replace map() with a generator expression
(C417)
63-63: Avoid specifying long messages outside the exception class
(TRY003)
83-83: Unnecessary map() usage (rewrite using a list comprehension)
Replace map() with a list comprehension
(C417)
87-87: Unnecessary generator (rewrite as a list comprehension)
Rewrite as a list comprehension
(C400)
89-89: Avoid specifying long messages outside the exception class
(TRY003)
95-95: Unnecessary map() usage (rewrite using a list comprehension)
Replace map() with a list comprehension
(C417)
113-113: Avoid specifying long messages outside the exception class
(TRY003)
120-120: Missing return type annotation for classmethod consolidate_files
(ANN206)
147-147: Do not catch blind exception: Exception
(BLE001)
178-178: Do not catch blind exception: Exception
(BLE001)
201-201: datetime.datetime.now() called without a tz argument
(DTZ005)
204-204: Unnecessary map() usage (rewrite using a list comprehension)
Replace map() with a list comprehension
(C417)
209-209: Boolean-typed positional argument in function definition
(FBT001)
209-209: Boolean default positional argument in function definition
(FBT002)
217-217: Avoid specifying long messages outside the exception class
(TRY003)
240-240: Do not catch blind exception: Exception
(BLE001)
264-264: Do not catch blind exception: Exception
(BLE001)
269-269: Do not catch blind exception: Exception
(BLE001)
277-277: Missing return type annotation for classmethod dataframes
(ANN206)
app/funcs/validator_functions.py
169-169: Avoid specifying long messages outside the exception class
(TRY003)
app/live_display.py
4-4: typing.Dict is deprecated, use dict instead
(UP035)
10-10: Missing return type annotation for special method __init__
Add return type annotation: None
(ANN204)
40-40: datetime.datetime.now() called without a tz argument
(DTZ005)
41-41: datetime.datetime.now() called without a tz argument
(DTZ005)
51-51: datetime.datetime.now() called without a tz argument
(DTZ005)
74-74: datetime.datetime.now() called without a tz argument
(DTZ005)
94-94: Missing return type annotation for private function _queue_update
Add return type annotation: None
(ANN202)
96-96: Boolean positional value in function call
(FBT003)
98-98: Missing return type annotation for private function _update_display
Add return type annotation: None
(ANN202)
app/states/ohio/ohio_downloader.py
21-21: Boolean-typed positional argument in function definition
(FBT001)
21-21: Unused method argument: overwrite
(ARG002)
21-21: Boolean-typed positional argument in function definition
(FBT001)
21-21: Unused method argument: read_from_temp
(ARG002)
29-29: Unnecessary ... literal
Remove unnecessary ...
(PIE790)
app/states/texas/validators/texas_settings.py
5-5: Redefinition of unused Field from line 1
(F811)
18-18: Avoid specifying long messages outside the exception class
(TRY003)
app/funcs/csv_reader.py
100-101: Replace yield over for loop with yield from
Replace with yield from
(UP028)
102-102: Do not catch blind exception: Exception
(BLE001)
103-103: Use explicit conversion flag
Replace with conversion flag
(RUF010)
108-108: Missing type annotation for **kwargs
(ANN003)
121-122: Replace yield over for loop with yield from
Replace with yield from
(UP028)
124-124: Do not catch blind exception: Exception
(BLE001)
125-125: Use explicit conversion flag
Replace with conversion flag
(RUF010)
129-130: Replace yield over for loop with yield from
Replace with yield from
(UP028)
131-131: Do not catch blind exception: Exception
(BLE001)
132-132: Use explicit conversion flag
Replace with conversion flag
(RUF010)
136-136: Missing type annotation for **kwargs
(ANN003)
137-137: Unnecessary mode argument
Remove mode argument
(UP015)
app/states/texas/validators/texas_address.py
147-147: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
app/states/texas/validators/texas_filers.py
23-23: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
30-30: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
46-46: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
53-53: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
68-68: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
84-84: Missing return type annotation for special method __init__
Add return type annotation: None
(ANN204)
84-84: Missing type annotation for **data
(ANN003)
99-99: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
114-114: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
123-123: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
128-128: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
152-152: Missing return type annotation for classmethod validate_name
(ANN206)
158-158: Missing return type annotation for classmethod validate_addresses
(ANN206)
167-167: Missing return type annotation for classmethod validate_treasurer
(ANN206)
174-174: Missing return type annotation for classmethod validate_assistant_treasurer
(ANN206)
181-181: Missing return type annotation for classmethod validate_chair
(ANN206)
app/states/unified_database.py
5-5: typing.List is deprecated, use list instead
(UP035)
5-5: typing.Dict is deprecated, use dict instead
(UP035)
31-31: Missing return type annotation for special method __init__
Add return type annotation: None
(ANN204)
44-44: Avoid specifying long messages outside the exception class
(TRY003)
105-105: Avoid specifying long messages outside the exception class
(TRY003)
152-152: Boolean-typed positional argument in function definition
(FBT001)
152-152: Boolean default positional argument in function definition
(FBT002)
283-283: Naive datetime constructed using datetime.datetime.strptime() without %z
(DTZ007)
284-284: Naive datetime constructed using datetime.datetime.strptime() without %z
(DTZ007)
574-574: datetime.datetime.utcnow() used
(DTZ003)
583-583: datetime.datetime.utcnow() used
(DTZ003)
617-617: datetime.datetime.utcnow() used
(DTZ003)
651-651: datetime.datetime.utcnow() used
(DTZ003)
685-685: datetime.datetime.utcnow() used
(DTZ003)
738-738: Avoid equality comparisons to True; use UnifiedCommitteePerson.is_active: for truth checks
Replace with UnifiedCommitteePerson.is_active
(E712)
744-744: datetime.datetime.utcnow() used
(DTZ003)
745-745: datetime.datetime.utcnow() used
(DTZ003)
753-753: Boolean-typed positional argument in function definition
(FBT001)
753-753: Boolean default positional argument in function definition
(FBT002)
760-760: Avoid equality comparisons to True; use UnifiedCommitteePerson.is_active: for truth checks
Replace with UnifiedCommitteePerson.is_active
(E712)
768-768: Boolean-typed positional argument in function definition
(FBT001)
768-768: Boolean default positional argument in function definition
(FBT002)
777-777: Avoid equality comparisons to True; use UnifiedCommitteePerson.is_active: for truth checks
Replace with UnifiedCommitteePerson.is_active
(E712)
817-817: datetime.datetime.utcnow() used
(DTZ003)
827-827: datetime.datetime.utcnow() used
(DTZ003)
855-855: Avoid equality comparisons to True; use UnifiedCommitteePerson.is_active: for truth checks
Replace with UnifiedCommitteePerson.is_active
(E712)
865-865: Boolean-typed positional argument in function definition
(FBT001)
865-865: Boolean default positional argument in function definition
(FBT002)
878-878: Unused method argument: user
(ARG002)
891-891: datetime.datetime.utcnow() used
(DTZ003)
935-935: Avoid equality comparisons to True; use UnifiedCommitteePerson.is_active: for truth checks
Replace with UnifiedCommitteePerson.is_active
(E712)
985-985: Avoid equality comparisons to True; use UnifiedCommitteePerson.is_active: for truth checks
Replace with UnifiedCommitteePerson.is_active
(E712)
1059-1059: Unused method argument: user
(ARG002)
1069-1069: Avoid equality comparisons to True; use UnifiedCommitteePerson.is_active: for truth checks
Replace with UnifiedCommitteePerson.is_active
(E712)
1081-1081: Comparison to None should be cond is None
Replace with cond is None
(E711)
1092-1092: datetime.datetime.utcnow() used
(DTZ003)
1107-1107: Unused method argument: user
(ARG002)
1132-1132: Avoid equality comparisons to True; use UnifiedCommitteePerson.is_active: for truth checks
Replace with UnifiedCommitteePerson.is_active
(E712)
1148-1148: datetime.datetime.utcnow() used
(DTZ003)
1163-1163: Avoid equality comparisons to True; use UnifiedCommitteePerson.is_active: for truth checks
Replace with UnifiedCommitteePerson.is_active
(E712)
1179-1179: Comparison to None should be cond is None
Replace with cond is None
(E711)
1203-1203: Do not catch blind exception: Exception
(BLE001)
1208-1208: No explicit stacklevel keyword argument found
Set stacklevel=2
(B028)
app/abcs/abc_download.py
6-6: Import from collections.abc instead: Generator
Import from collections.abc
(UP035)
6-6: typing.Dict is deprecated, use dict instead
(UP035)
30-30: Missing return type annotation for special method __post_init__
(ANN204)
36-36: Missing return type annotation for classmethod not_headless
(ANN206)
62-62: Missing return type annotation for classmethod extract_zipfile
Add return type annotation: None
(ANN206)
68-68: Do not catch blind exception: Exception
(BLE001)
73-73: Missing return type annotation for classmethod _process_csv
Add return type annotation: None
(ANN206)
85-85: datetime.datetime.now() called without a tz argument
(DTZ005)
109-109: Boolean-typed positional argument in function definition
(FBT001)
109-109: Boolean-typed positional argument in function definition
(FBT001)
114-114: Missing return type annotation for classmethod consolidate_files
(ANN206)
119-119: Missing return type annotation for classmethod read
(ANN206)
app/states/texas/validators/texas_expenses.py
161-161: Missing return type annotation for classmethod validate_payee
(ANN206)
app/states/texas/texas_search.py
2-2: typing.List is deprecated, use list instead
(UP035)
2-2: typing.Dict is deprecated, use dict instead
(UP035)
2-2: typing.Type is deprecated, use type instead
(UP035)
2-2: typing.Tuple is deprecated, use tuple instead
(UP035)
48-48: Missing return type annotation for special method __post_init__
(ANN204)
53-53: Missing return type annotation for private function _sort_fields
(ANN202)
63-63: Missing return type annotation for private function _recast_sorted_field_types
(ANN202)
85-85: Avoid specifying long messages outside the exception class
(TRY003)
94-94: Missing return type annotation for private function _coalesce_names
(ANN202)
120-120: Missing return type annotation for special method __repr__
Add return type annotation: str
(ANN204)
169-169: Missing return type annotation for special method __repr__
Add return type annotation: str
(ANN204)
172-172: Missing return type annotation for special method __post_init__
(ANN204)
180-180: Unnecessary dict() call (rewrite as a literal)
Rewrite as a literal
(C408)
230-230: Missing type annotation for *args
(ANN002)
235-235: Unnecessary dict() call (rewrite as a literal)
Rewrite as a literal
(C408)
app/states/unified_sqlmodels.py
8-8: typing.List is deprecated, use list instead
(UP035)
8-8: typing.Dict is deprecated, use dict instead
(UP035)
136-136: Unnecessary UTF-8 encoding argument to encode
Remove unnecessary encoding argument
(UP012)
164-164: Missing return type annotation for special method __post_init__
(ANN204)
227-227: Missing return type annotation for special method __post_init__
(ANN204)
288-288: Missing return type annotation for special method __post_init__
(ANN204)
414-414: Class field state_id is defined multiple times
Remove duplicate field definition for state_id
(PIE794)
747-747: Missing return type annotation for special method __init__
Add return type annotation: None
(ANN204)
853-853: Unused method argument: entity_role
(ARG002)
953-953: Dynamically typed expressions (typing.Any) are disallowed in _get_field_value
(ANN401)
990-990: Consider moving this statement to an else block
(TRY300)
991-991: Do not catch blind exception: Exception
(BLE001)
1007-1007: Do not catch blind exception: Exception
(BLE001)
1036-1036: Do not catch blind exception: Exception
(BLE001)
1094-1094: Do not catch blind exception: Exception
(BLE001)
1183-1183: Possible SQL injection vector through string-based query construction
(S608)
1189-1189: Do not catch blind exception: Exception
(BLE001)
1201-1201: Do not catch blind exception: Exception
(BLE001)
1204-1204: Dynamically typed expressions (typing.Any) are disallowed in value
(ANN401)
1218-1218: Dynamically typed expressions (typing.Any) are disallowed in value
(ANN401)
1228-1228: Naive datetime constructed using datetime.datetime.strptime() without %z
(DTZ007)
1234-1234: Consider moving this statement to an else block
(TRY300)
1238-1238: Dynamically typed expressions (typing.Any) are disallowed in value
(ANN401)
1293-1293: Dynamically typed expressions (typing.Any) are disallowed in obj
(ANN401)
1293-1293: Dynamically typed expressions (typing.Any) are disallowed in _json_default
(ANN401)
1304-1304: Missing return type annotation for special method __init__
Add return type annotation: None
(ANN204)
production_loader.py
1-1: Shebang is present but file is not executable
(EXE001)
19-19: typing.Dict is deprecated, use dict instead
(UP035)
19-19: typing.List is deprecated, use list instead
(UP035)
19-19: typing.Tuple is deprecated, use tuple instead
(UP035)
152-152: Missing return type annotation for special method __init__
Add return type annotation: None
(ANN204)
203-203: Loop control variable slug not used within loop body
Rename unused slug to _slug
(B007)
206-206: Avoid specifying long messages outside the exception class
(TRY003)
451-451: Possible SQL injection vector through string-based query construction
(S608)
469-469: Possible SQL injection vector through string-based query construction
(S608)
536-536: Missing return type annotation for private function _cache_entity
(ANN202)
552-552: Missing return type annotation for private function _cache_campaign
(ANN202)
568-568: Missing return type annotation for private function _ensure_address
(ANN202)
700-700: Avoid specifying long messages outside the exception class
(TRY003)
823-823: Do not catch blind exception: Exception
(BLE001)
832-832: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
832-832: Logging statement uses f-string
(G004)
855-855: Avoid specifying long messages outside the exception class
(TRY003)
876-876: Avoid specifying long messages outside the exception class
(TRY003)
931-931: Logging statement uses f-string
(G004)
1038-1038: f-string without any placeholders
Remove extraneous f prefix
(F541)
1044-1044: Logging .exception(...) should be used instead of .error(..., exc_info=True)
(G201)
1044-1044: Logging statement uses f-string
(G004)
1048-1048: f-string without any placeholders
Remove extraneous f prefix
(F541)
1049-1049: f-string without any placeholders
Remove extraneous f prefix
(F541)
1050-1050: f-string without any placeholders
Remove extraneous f prefix
(F541)
🔇 Additional comments (12)
.gitattributes (1)
1-1: Verify removal of CSV and ZIP from LFS tracking.The change removes
*.csvand*.zipfrom Git LFS tracking and replaces them with*.parquet. Combined with the.gitignorechanges that now actively ignore*.csv, CSV files would be completely untracked (neither in Git nor LFS).Confirm this is intentional: Are CSV files temporary/transient, or do they need to be persioned in some form? If CSVs are generated/downloaded in CI pipelines or shared workflows, verify that removing LFS tracking won't cause issues.
app/abcs/abc_validation_errors.py (1)
7-7: LGTM!The logging prefix standardization to
'campaignfinance|'aligns with the project-wide logging consistency effort.recreate_tables.py (1)
12-12: The import is intentional for side-effect registration.The
# noqa: F401comment is appropriate here since the import registers SQLModel metadata as a side effect. However, static analysis reports the noqa directive itself is flagged. This is a linter configuration issue—the import is correctly used.app/states/texas/validators/texas_filers.py (1)
28-37: Unusual composite primary key design with all-optional fields.
TECFilerIDNameLinkdefines multipleprimary_key=Truefields that are allOptionalexceptfiler_id. Having optional primary key columns is unconventional and may cause issues with NULL values in primary key constraints. Consider whether this link table design is intentional.app/abcs/abc_state_config.py (1)
154-158: LGTM!The reformatted multiline expression improves readability while maintaining identical functionality.
app/states/texas/__init__.py (2)
6-6: LGTM!The new
TexasSearchimport exposes the search functionality from the module's public interface.
32-34: TECFilerName validator is correctly implemented with proper field mappings.The change from
TECFilertoTECFilerNameis intentional and complete.TECFilerNameis properly defined with comprehensive field mappings including filerIdent, filerJdiCd, committeeStatusCd, filerName, filerAddress, filerMailing, and relationships to related entities. The oldTECFilerclass is fully deprecated (commented out), with no active references remaining. The integration into the filers category configuration is correct.app/abcs/__init__.py (1)
7-7: LGTM!The
progressobject is now part of the public API, enabling progress tracking across downloader components. This aligns with its usage inFileDownloaderABCand the Texas downloader modules.app/states/texas/validators/__init__.py (1)
1-10: LGTM on the restructured exports.The updated public API surface correctly exposes the new base models (
TECBaseModel,TECAddressBase) and renamed entities (TECFilerName), aligning with the refactored validator architecture.app/live_display.py (1)
64-92: LGTM on the table generation logic.The color-coded status display with duration tracking is well-implemented. Thread-safe access via
self._lockduring table generation ensures consistent reads.app/states/texas/validators/texas_expenses.py (1)
18-20:table=Trueremoved but table metadata retained.
TECExpenseno longer hastable=Truebut still has__tablename__and__table_args__defined. If this is intentionally not a table model, remove the metadata. If it should be a table, addtable=Trueback.app/states/ohio/ohio_downloader.py (1)
21-35: Method signature mismatch with abstract base class.The
download,consolidate_files, andreadmethods useself(instance methods), butFileDownloaderABCdeclares them as@classmethodwithclsparameter. This inconsistency will cause issues when these methods are invoked.Apply this diff to align with the ABC:
- def download(self, overwrite: bool, read_from_temp: bool) -> FileDownloaderABC: + @classmethod + def download(cls, overwrite: bool, read_from_temp: bool) -> FileDownloaderABC: - _driver = self.driver.create_driver() - _wait = self.driver + _driver = cls.driver.create_driver() + _wait = cls.driver _driver.get(OHIO_ENTRY_URL) _wait.wait_until_clickable(By.LINK_TEXT, "FTP Site") _ftp_site_url = _driver.find_element(By.LINK_TEXT, "FTP Site").get_attribute("href") _driver.get(_ftp_site_url) - - ... + return cls - def consolidate_files(self): - ... + @classmethod + def consolidate_files(cls): + pass - def read(self): - ... + @classmethod + def read(cls): + passLikely an incorrect or invalid review comment.
| *.zip | ||
| *.DS_Store | ||
| *.env | ||
| # *.parquet |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Consolidate redundant commented and uncommented ignore patterns.
Lines 7 and 10 contain commented versions of patterns, while lines 178–179 contain their uncommented equivalents. This duplication creates maintenance burden. Consider removing the commented versions or consolidating at the end of the file with a single definitive block.
Also applies to: 10-10, 178-179
🤖 Prompt for AI Agents
.gitignore lines 7, 10 and lines 178–179: there are duplicate commented ignore
patterns (lines 7 and 10) that match the uncommented equivalents at 178–179;
remove the redundant commented entries (or move any intentionally kept comments
into a single definitive explanatory block at the end of the file) so each
ignore pattern appears only once (keep the active/uncommented patterns where
they currently are and delete the commented duplicates).
| *.csv | ||
| *.txt |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
wc -l .gitignoreRepository: Abstract-Data/campaignfinance-2023
Length of output: 92
🏁 Script executed:
head -10 .gitignore | cat -nRepository: Abstract-Data/campaignfinance-2023
Length of output: 244
🏁 Script executed:
sed -n '170,180p' .gitignore | cat -nRepository: Abstract-Data/campaignfinance-2023
Length of output: 543
🏁 Script executed:
# Check if *.csv appears multiple times
rg "^\*\.csv$" .gitignoreRepository: Abstract-Data/campaignfinance-2023
Length of output: 89
🏁 Script executed:
# Check if *.txt already exists elsewhere
rg "^\*\.txt$" .gitignoreRepository: Abstract-Data/campaignfinance-2023
Length of output: 83
Remove duplicate CSV rule and clarify intent of broad *.txt pattern.
Line 176 duplicates the *.csv rule already present at line 2. Additionally, line 177's *.txt pattern is overly broad and could inadvertently ignore important text files (README, documentation, config templates, etc.).
Clarify the intent: Is *.txt meant to ignore temporary text outputs, or is this too broad for the repository?
🤖 Prompt for AI Agents
In .gitignore around lines 176-177, remove the duplicate "*.csv" entry (it's
already present at line 2) and replace or clarify the overly broad "*.txt" rule:
either delete it if not needed, scope it to a specific directory/pattern for
temporary outputs (e.g., temp/*.txt or *.out.txt) or add explicit negations for
important files (e.g., !README.txt, !docs/*.txt) and include a short comment
above the rule explaining the intent (temporary outputs vs generated artifacts).
| <excludeFolder url="file://$MODULE_DIR$/.venv/bin" /> | ||
| <excludeFolder url="file://$MODULE_DIR$/.venv/lib" /> | ||
| <excludeFolder url="file://$MODULE_DIR$/.venv" /> |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Redundant folder exclusions.
Lines 6-7 exclude specific subdirectories of .venv, but line 8 excludes the entire .venv directory. When a parent directory is excluded, all its subdirectories are implicitly excluded, making lines 6-7 redundant.
Apply this diff to remove the redundant exclusions:
<content url="file://$MODULE_DIR$">
<sourceFolder url="file://$MODULE_DIR$/app" isTestSource="false" />
- <excludeFolder url="file://$MODULE_DIR$/.venv/bin" />
- <excludeFolder url="file://$MODULE_DIR$/.venv/lib" />
<excludeFolder url="file://$MODULE_DIR$/.venv" />
</content>📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| <excludeFolder url="file://$MODULE_DIR$/.venv/bin" /> | |
| <excludeFolder url="file://$MODULE_DIR$/.venv/lib" /> | |
| <excludeFolder url="file://$MODULE_DIR$/.venv" /> | |
| <excludeFolder url="file://$MODULE_DIR$/.venv" /> |
🤖 Prompt for AI Agents
In .idea/campaignfinance.iml around lines 6 to 8, remove the two redundant
excludeFolder entries for file://$MODULE_DIR$/.venv/bin and
file://$MODULE_DIR$/.venv/lib and keep only the single excludeFolder entry that
excludes file://$MODULE_DIR$/.venv so that the parent .venv exclusion covers its
subdirectories.
| <excludeFolder url="file://$MODULE_DIR$/.venv" /> | ||
| </content> | ||
| <orderEntry type="jdk" jdkName="Poetry (campaignfinance) (2)" jdkType="Python SDK" /> | ||
| <orderEntry type="jdk" jdkName="Python 3.12 UV (campaignfinance)" jdkType="Python SDK" /> |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Hardcoded SDK reference is machine-specific.
The SDK name "Python 3.12 UV (campaignfinance)" is an IDE-specific identifier that may differ across team members' machines. IDE configuration files in .idea/ are typically added to .gitignore because they contain machine-specific settings like SDK paths, tool locations, and personal preferences.
Consider adding .idea/ to .gitignore (keeping only shared configurations like code style if needed), or document that developers must configure their local SDK to match this exact name.
🤖 Prompt for AI Agents
In .idea/campaignfinance.iml around line 10 the file contains a hardcoded,
machine-specific SDK entry "Python 3.12 UV (campaignfinance)"; remove that
SDK-specific orderEntry (or delete the whole .iml) and add .idea/ to .gitignore
(or keep only shared IDE files) so personal SDK names aren't committed;
alternatively document required local SDK name if you must keep a shared
reference.
| @@ -6,7 +6,7 @@ | |||
| <option name="pathToExecutable" value="/opt/homebrew/bin/black" /> | |||
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Hardcoded absolute path is machine-specific.
The path /opt/homebrew/bin/black is specific to macOS with Homebrew and won't work on Linux, Windows, or systems with different package managers. This will cause Black integration to fail for other developers.
As noted for the previous file, .idea/ configurations are typically machine-specific and should be excluded from version control. If you choose to commit them, document the expected tool locations or use IDE features to detect tools from the project's virtual environment automatically.
🤖 Prompt for AI Agents
In .idea/misc.xml around line 6 the Black executable path is hardcoded to
/opt/homebrew/bin/black which is machine-specific; replace this by removing the
absolute path and either rely on IDE's auto-detection (use the project's
virtualenv/interpreter setting) or store a relative/variable-based path (e.g.,
use a configurable IDEA path macro or project interpreter) and add .idea/ files
to .gitignore or document expected tool locations in repo docs so other
developers don't get a broken config.
| # Texas Campaign Finance Package | ||
|
|
||
| ## Overview | ||
| This package is designed to provide a simple interface for accessing campaign finance data from the Texas Ethics Commission. | ||
| It also reduces duplication of fields and joins data from multiple files into a single table to reduce the size of the data. | ||
|
|
||
| ## Examples | ||
| Across all files, there are over 317 columns. This package reduces the number of columns to [number of columns] by joining data from multiple files. | ||
|
|
||
| ## Ability to Download TEC File Data Built-In | ||
| Using [Selenium](https://www.selenium.dev/), this package can download the latest campaign finance data from the Texas Ethics Commission website. The data is then processed and saved as CSV files. | ||
|
|
||
|
|
||
| ## Dependencies | ||
|  | ||
|  | ||
|  | ||
|  No newline at end of file |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Minor markdown formatting improvements.
Add blank lines after headings and ensure file ends with a newline for proper markdown rendering and linting compliance.
# Texas Campaign Finance Package
## Overview
+
This package is designed to provide a simple interface for accessing campaign finance data from the Texas Ethics Commission.
It also reduces duplication of fields and joins data from multiple files into a single table to reduce the size of the data.
## Examples
+
Across all files, there are over 317 columns. This package reduces the number of columns to [number of columns] by joining data from multiple files.
## Ability to Download TEC File Data Built-In
+
Using [Selenium](https://www.selenium.dev/), this package can download the latest campaign finance data from the Texas Ethics Commission website. The data is then processed and saved as CSV files.
## Dependencies
+




+📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Texas Campaign Finance Package | |
| ## Overview | |
| This package is designed to provide a simple interface for accessing campaign finance data from the Texas Ethics Commission. | |
| It also reduces duplication of fields and joins data from multiple files into a single table to reduce the size of the data. | |
| ## Examples | |
| Across all files, there are over 317 columns. This package reduces the number of columns to [number of columns] by joining data from multiple files. | |
| ## Ability to Download TEC File Data Built-In | |
| Using [Selenium](https://www.selenium.dev/), this package can download the latest campaign finance data from the Texas Ethics Commission website. The data is then processed and saved as CSV files. | |
| ## Dependencies | |
|  | |
|  | |
|  | |
|  | |
| # Texas Campaign Finance Package | |
| ## Overview | |
| This package is designed to provide a simple interface for accessing campaign finance data from the Texas Ethics Commission. | |
| It also reduces duplication of fields and joins data from multiple files into a single table to reduce the size of the data. | |
| ## Examples | |
| Across all files, there are over 317 columns. This package reduces the number of columns to [number of columns] by joining data from multiple files. | |
| ## Ability to Download TEC File Data Built-In | |
| Using [Selenium](https://www.selenium.dev/), this package can download the latest campaign finance data from the Texas Ethics Commission website. The data is then processed and saved as CSV files. | |
| ## Dependencies | |
|  | |
|  | |
|  | |
|  | |
🧰 Tools
🪛 LanguageTool
[style] ~15-~15: Using many exclamation marks might seem excessive (in this case: 3 exclamation marks for a text that’s 1026 characters long)
Context: ...r-the-badge&logo=python&logoColor=blue)

🪛 markdownlint-cli2 (0.18.1)
3-3: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
7-7: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
10-10: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
14-14: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
18-18: Files should end with a single newline character
(MD047, single-trailing-newline)
🤖 Prompt for AI Agents
In README.md around lines 1 to 18, headings are missing blank lines after them
and the file likely does not end with a newline; add a single blank line
immediately after each Markdown heading (e.g., after "# Texas Campaign Finance
Package", "## Overview", "## Examples", "## Ability to Download TEC File Data
Built-In", and "## Dependencies") and ensure the file ends with a trailing
newline character.
| It also reduces duplication of fields and joins data from multiple files into a single table to reduce the size of the data. | ||
|
|
||
| ## Examples | ||
| Across all files, there are over 317 columns. This package reduces the number of columns to [number of columns] by joining data from multiple files. |
There was a problem hiding this comment.
Placeholder text needs to be replaced.
The text [number of columns] appears to be a placeholder that should be replaced with the actual column count.
🤖 Prompt for AI Agents
In README.md around line 8, the placeholder "[number of columns]" must be
replaced with the actual column count; determine the correct number (e.g., 317
or the current accurate count) and replace the placeholder with that numeric
value, keeping the sentence grammatical and updating any surrounding punctuation
or wording as needed.
| except Exception as e: | ||
| console.print(f"[red]❌ Setup failed: {e}[/red]") | ||
| return |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Catching bare Exception hides root causes.
Broad exception handling makes debugging difficult. Catch specific exceptions:
- except Exception as e:
- console.print(f"[red]❌ Setup failed: {e}[/red]")
+ except (ImportError, ConnectionError, OperationalError) as e:
+ console.print(f"[red]❌ Setup failed: {e}[/red]")At minimum, log the full traceback for debugging:
import traceback
except Exception as e:
console.print(f"[red]❌ Setup failed: {e}[/red]")
console.print(f"[dim]{traceback.format_exc()}[/dim]")Also applies to: 70-72, 81-83
🧰 Tools
🪛 Ruff (0.14.7)
30-30: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
In recreate_tables.py around lines 30-32 (and similarly at 70-72 and 81-83), the
code catches a bare Exception and only prints the exception message; update each
except block to catch more specific exceptions where possible and at minimum
include the full traceback in the output for debugging: import the traceback
module at the top, and in those except blocks print the original error message
and the formatted traceback (traceback.format_exc()) so root causes are visible;
prefer replacing broad Exception with the specific exception types raised by the
failing operations if known.
| # Drop existing unified tables | ||
| console.print("\n[bold blue]Dropping existing unified tables...[/bold blue]") | ||
| try: | ||
| with engine.connect() as conn: | ||
| # Drop tables in reverse dependency order | ||
| tables_to_drop = [ | ||
| 'unified_committee_person_versions', | ||
| 'unified_transaction_versions', | ||
| 'unified_person_versions', | ||
| 'unified_committee_versions', | ||
| 'unified_address_versions', | ||
| 'unified_campaign_entities', | ||
| 'unified_entity_associations', | ||
| 'unified_transaction_persons', | ||
| 'unified_committee_persons', | ||
| 'unified_contributions', | ||
| 'unified_loans', | ||
| 'unified_transactions', | ||
| 'file_origins', | ||
| 'unified_campaigns', | ||
| 'unified_entities', | ||
| 'unified_persons', | ||
| 'unified_committees', | ||
| 'unified_addresses', | ||
| 'states' | ||
| ] | ||
|
|
||
| for table in tables_to_drop: | ||
| try: | ||
| conn.execute(text(f"DROP TABLE IF EXISTS {table} CASCADE")) | ||
| console.print(f"[green]✅ Dropped {table}[/green]") | ||
| except Exception as e: | ||
| console.print(f"[yellow]⚠️ Could not drop {table}: {e}[/yellow]") | ||
|
|
||
| conn.commit() | ||
|
|
||
| except Exception as e: | ||
| console.print(f"[red]❌ Error dropping tables: {e}[/red]") | ||
| return | ||
|
|
||
| # Create new tables with correct schema | ||
| console.print("\n[bold blue]Creating new unified tables...[/bold blue]") | ||
| try: | ||
| # Create all tables | ||
| SQLModel.metadata.create_all(engine) | ||
| console.print("[green]✅ All unified tables created successfully![/green]") | ||
|
|
||
| except Exception as e: | ||
| console.print(f"[red]❌ Error creating tables: {e}[/red]") | ||
| return |
There was a problem hiding this comment.
Schema migration lacks transactional safety.
If table creation fails after tables are dropped, the database is left in an inconsistent state with missing tables. Consider:
- Wrap the entire operation in a transaction
- Or use a backup/restore approach
- Or at minimum, warn users this is destructive
+ console.print("[yellow]⚠️ WARNING: This operation is destructive and non-recoverable![/yellow]")
+ console.print("[yellow]Consider backing up your database first.[/yellow]")
+
# Drop existing unified tables
console.print("\n[bold blue]Dropping existing unified tables...[/bold blue]")For production use, consider migration tools like Alembic that provide rollback capabilities.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Drop existing unified tables | |
| console.print("\n[bold blue]Dropping existing unified tables...[/bold blue]") | |
| try: | |
| with engine.connect() as conn: | |
| # Drop tables in reverse dependency order | |
| tables_to_drop = [ | |
| 'unified_committee_person_versions', | |
| 'unified_transaction_versions', | |
| 'unified_person_versions', | |
| 'unified_committee_versions', | |
| 'unified_address_versions', | |
| 'unified_campaign_entities', | |
| 'unified_entity_associations', | |
| 'unified_transaction_persons', | |
| 'unified_committee_persons', | |
| 'unified_contributions', | |
| 'unified_loans', | |
| 'unified_transactions', | |
| 'file_origins', | |
| 'unified_campaigns', | |
| 'unified_entities', | |
| 'unified_persons', | |
| 'unified_committees', | |
| 'unified_addresses', | |
| 'states' | |
| ] | |
| for table in tables_to_drop: | |
| try: | |
| conn.execute(text(f"DROP TABLE IF EXISTS {table} CASCADE")) | |
| console.print(f"[green]✅ Dropped {table}[/green]") | |
| except Exception as e: | |
| console.print(f"[yellow]⚠️ Could not drop {table}: {e}[/yellow]") | |
| conn.commit() | |
| except Exception as e: | |
| console.print(f"[red]❌ Error dropping tables: {e}[/red]") | |
| return | |
| # Create new tables with correct schema | |
| console.print("\n[bold blue]Creating new unified tables...[/bold blue]") | |
| try: | |
| # Create all tables | |
| SQLModel.metadata.create_all(engine) | |
| console.print("[green]✅ All unified tables created successfully![/green]") | |
| except Exception as e: | |
| console.print(f"[red]❌ Error creating tables: {e}[/red]") | |
| return | |
| # Drop existing unified tables | |
| console.print("\n[bold blue]Dropping existing unified tables...[/bold blue]") | |
| console.print("[yellow]⚠️ WARNING: This operation is destructive and non-recoverable![/yellow]") | |
| console.print("[yellow]Consider backing up your database first.[/yellow]") | |
| try: | |
| with engine.connect() as conn: | |
| # Drop tables in reverse dependency order | |
| tables_to_drop = [ | |
| 'unified_committee_person_versions', | |
| 'unified_transaction_versions', | |
| 'unified_person_versions', | |
| 'unified_committee_versions', | |
| 'unified_address_versions', | |
| 'unified_campaign_entities', | |
| 'unified_entity_associations', | |
| 'unified_transaction_persons', | |
| 'unified_committee_persons', | |
| 'unified_contributions', | |
| 'unified_loans', | |
| 'unified_transactions', | |
| 'file_origins', | |
| 'unified_campaigns', | |
| 'unified_entities', | |
| 'unified_persons', | |
| 'unified_committees', | |
| 'unified_addresses', | |
| 'states' | |
| ] | |
| for table in tables_to_drop: | |
| try: | |
| conn.execute(text(f"DROP TABLE IF EXISTS {table} CASCADE")) | |
| console.print(f"[green]✅ Dropped {table}[/green]") | |
| except Exception as e: | |
| console.print(f"[yellow]⚠️ Could not drop {table}: {e}[/yellow]") | |
| conn.commit() | |
| except Exception as e: | |
| console.print(f"[red]❌ Error dropping tables: {e}[/red]") | |
| return | |
| # Create new tables with correct schema | |
| console.print("\n[bold blue]Creating new unified tables...[/bold blue]") | |
| try: | |
| # Create all tables | |
| SQLModel.metadata.create_all(engine) | |
| console.print("[green]✅ All unified tables created successfully![/green]") | |
| except Exception as e: | |
| console.print(f"[red]❌ Error creating tables: {e}[/red]") | |
| return |
🧰 Tools
🪛 Ruff (0.14.7)
65-65: Do not catch blind exception: Exception
(BLE001)
70-70: Do not catch blind exception: Exception
(BLE001)
81-81: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
In recreate_tables.py around lines 34 to 83, the script drops tables then
creates them without transactional safety; change the flow so the destructive
drops and subsequent create_all occur inside a single transaction (use
engine.begin()/engine.connect().begin() or a Session/transaction context) and
only commit after SQLModel.metadata.create_all succeeds, bind create_all to the
same connection/transaction; additionally add a pre-check/backup step (e.g.,
prompt the user and optionally dump or rename existing tables/schemas) and
explicit destructive-warning messaging, and consider switching this workflow to
Alembic migrations for production to enable proper rollbacks.
| for table in tables_to_drop: | ||
| try: | ||
| conn.execute(text(f"DROP TABLE IF EXISTS {table} CASCADE")) | ||
| console.print(f"[green]✅ Dropped {table}[/green]") | ||
| except Exception as e: | ||
| console.print(f"[yellow]⚠️ Could not drop {table}: {e}[/yellow]") | ||
|
|
||
| conn.commit() |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Use parameterized queries or identifier quoting to avoid SQL injection patterns.
While the table names are hardcoded in tables_to_drop, directly interpolating into SQL strings establishes a risky pattern. Use SQLAlchemy's quoted_name or text().bindparams() for safety:
+from sqlalchemy import quoted_name
+from sqlalchemy.sql import text
...
for table in tables_to_drop:
try:
- conn.execute(text(f"DROP TABLE IF EXISTS {table} CASCADE"))
+ # Use quoted identifier to prevent SQL injection patterns
+ conn.execute(text(f"DROP TABLE IF EXISTS {quoted_name(table, quote=True)} CASCADE"))Alternatively, use SQLAlchemy's Table.drop() API with the metadata for type-safe operations.
🧰 Tools
🪛 Ruff (0.14.7)
65-65: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
In recreate_tables.py around lines 61 to 68, the code interpolates table names
directly into a DROP TABLE SQL string which encourages SQL-injection-prone
patterns; replace this by using SQLAlchemy-safe identifier handling — either
construct Table objects from your MetaData and call table.drop(bind=conn,
checkfirst=True) for each name, or if you must use text(), quote identifiers
with sqlalchemy.sql.elements.quoted_name or use sql.text() with proper
identifier binding via metadata/engine.dialect.identifier_preparer.quote to
produce a safe quoted identifier before executing; update the try/except branch
to execute the safe API and keep the conn.commit() after successful drops.
| progress = ProgressTracker() | ||
| progress.start() |
There was a problem hiding this comment.
Side effect at module import time.
Starting ProgressTracker at module load (progress.start()) means the live display begins whenever this module is imported, even if no download is happening. This can cause unexpected console output and resource usage.
Consider lazy initialization or explicit start:
progress = ProgressTracker()
-progress.start()Then call progress.start() explicitly in the download() method when needed.
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In app/abcs/abc_download.py around lines 20-21, creating and starting
ProgressTracker at module import time (progress = ProgressTracker();
progress.start()) causes an unwanted live display and resource use whenever the
module is imported; change to lazy initialization by removing progress.start()
from module scope, instantiate ProgressTracker (or call progress.start()) inside
the download() method when an actual download begins (or use a helper that
creates/starts the tracker on first use), and ensure the tracker is properly
stopped/cleaned up after the download completes or on errors.
| try: | ||
| cls._process_csv(zip_ref, file, tmp) | ||
| except Exception as e: | ||
| ic(f"Zip File Extraction Error on {file.filename.upper()}: {e}") |
There was a problem hiding this comment.
Broad exception catch may hide critical errors.
Catching bare Exception and only logging with ic() can hide serious issues like disk full, permission errors, or corrupted archives. Consider catching specific exceptions or at least logging the full traceback.
try:
cls._process_csv(zip_ref, file, tmp)
- except Exception as e:
- ic(f"Zip File Extraction Error on {file.filename.upper()}: {e}")
+ except (OSError, IOError, pl.exceptions.ComputeError) as e:
+ ic(f"Zip File Extraction Error on {file.filename.upper()}: {e}")
+ raise📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| try: | |
| cls._process_csv(zip_ref, file, tmp) | |
| except Exception as e: | |
| ic(f"Zip File Extraction Error on {file.filename.upper()}: {e}") | |
| try: | |
| cls._process_csv(zip_ref, file, tmp) | |
| except (OSError, IOError, pl.exceptions.ComputeError) as e: | |
| ic(f"Zip File Extraction Error on {file.filename.upper()}: {e}") | |
| raise |
🧰 Tools
🪛 Ruff (0.14.7)
68-68: Do not catch blind exception: Exception
(BLE001)
| _csv_task = progress.add_task("T5", f"Extract CSV {file_name.stem}", "Started") | ||
| zip_ref.extract(file, tmp) | ||
|
|
||
| if file_name.suffix == '.txt': | ||
| return |
There was a problem hiding this comment.
Progress task not completed for .txt files.
When the file is a .txt, the function returns early on line 83 without calling progress.update_task(_csv_task, "Complete"). This leaves the task in "Started" status indefinitely.
if file_name.suffix == '.txt':
+ progress.update_task(_csv_task, "Complete")
return📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| _csv_task = progress.add_task("T5", f"Extract CSV {file_name.stem}", "Started") | |
| zip_ref.extract(file, tmp) | |
| if file_name.suffix == '.txt': | |
| return | |
| _csv_task = progress.add_task("T5", f"Extract CSV {file_name.stem}", "Started") | |
| zip_ref.extract(file, tmp) | |
| if file_name.suffix == '.txt': | |
| progress.update_task(_csv_task, "Complete") | |
| return |
🤖 Prompt for AI Agents
In app/abcs/abc_download.py around lines 79 to 83, the progress task _csv_task
is left in "Started" when a .txt file causes an early return; before returning
for file_name.suffix == '.txt' call progress.update_task(_csv_task, "Complete")
(or an equivalent completion/cleanup call) so the task status is set to
"Complete" before the function returns.
| if file_name.suffix == '.txt': | ||
| return | ||
|
|
||
| rename = f"{file_name.stem}_{datetime.now():%Y%m%d}dl" |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
datetime.now() called without timezone.
Using naive datetime for file naming can cause inconsistencies across different environments or if the system timezone changes.
+from datetime import datetime, timezone
...
- rename = f"{file_name.stem}_{datetime.now():%Y%m%d}dl"
+ rename = f"{file_name.stem}_{datetime.now(timezone.utc):%Y%m%d}dl"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| rename = f"{file_name.stem}_{datetime.now():%Y%m%d}dl" | |
| rename = f"{file_name.stem}_{datetime.now(timezone.utc):%Y%m%d}dl" |
🧰 Tools
🪛 Ruff (0.14.7)
85-85: datetime.datetime.now() called without a tz argument
(DTZ005)
🤖 Prompt for AI Agents
In app/abcs/abc_download.py around line 85, the filename uses a naive datetime
(datetime.now()) which lacks timezone information; replace it with a
timezone-aware timestamp (for example datetime.now(tz=timezone.utc) or
datetime.now().astimezone()) when formatting the rename string and ensure the
appropriate timezone import (from datetime import datetime, timezone) is added
at the top of the file so the generated filename is consistently based on an
explicit timezone (e.g., UTC).
| @classmethod | ||
| @abc.abstractmethod | ||
| def download(cls, overwrite: bool, read_from_temp: bool) -> FileDownloaderABC: | ||
| ... | ||
|
|
||
| @classmethod | ||
| @abc.abstractmethod | ||
| def download(self, overwrite: bool, read_from_temp: bool) -> FileDownloaderABC: | ||
| def consolidate_files(cls): | ||
| ... | ||
|
|
||
| @classmethod | ||
| @abc.abstractmethod | ||
| def read(self): | ||
| def read(cls): | ||
| ... |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider keyword-only arguments for boolean parameters.
The download method has two boolean positional parameters which can be confusing at call sites (e.g., download(True, False)).
@classmethod
@abc.abstractmethod
- def download(cls, overwrite: bool, read_from_temp: bool) -> FileDownloaderABC:
+ def download(cls, *, overwrite: bool, read_from_temp: bool) -> FileDownloaderABC:
...🧰 Tools
🪛 Ruff (0.14.7)
109-109: Boolean-typed positional argument in function definition
(FBT001)
109-109: Boolean-typed positional argument in function definition
(FBT001)
114-114: Missing return type annotation for classmethod consolidate_files
(ANN206)
119-119: Missing return type annotation for classmethod read
(ANN206)
🤖 Prompt for AI Agents
In app/abcs/abc_download.py around lines 107 to 120, the download method
currently accepts two boolean positional arguments which are ambiguous at call
sites; change the abstractmethod signature to require keyword-only booleans
(e.g., introduce a positional-only marker by adding a lone * so overwrite and
read_from_temp must be passed by name), update all concrete implementations to
match the new signature, and update all call sites to use explicit keywords
(overwrite=..., read_from_temp=...) to avoid confusion; ensure type hints and
return type remain unchanged and run tests to catch any mismatches.
| from dataclasses import Field | ||
| from typing import Optional | ||
| from datetime import date | ||
| from pydantic import ConfigDict, model_validator, BaseModel, BeforeValidator | ||
| from sqlmodel import SQLModel, Field |
There was a problem hiding this comment.
Remove unused and redefined Field import.
Line 1 imports Field from dataclasses, which is immediately shadowed by the Field import from sqlmodel on line 5. The dataclasses.Field is never used. Additionally, BaseModel from pydantic (line 4) appears unused since you're using SQLModel.
-from dataclasses import Field
from typing import Optional
from datetime import date
-from pydantic import ConfigDict, model_validator, BaseModel, BeforeValidator
+from pydantic import ConfigDict, model_validator, BeforeValidator
from sqlmodel import SQLModel, Field📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| from dataclasses import Field | |
| from typing import Optional | |
| from datetime import date | |
| from pydantic import ConfigDict, model_validator, BaseModel, BeforeValidator | |
| from sqlmodel import SQLModel, Field | |
| from typing import Optional | |
| from datetime import date | |
| from pydantic import ConfigDict, model_validator, BeforeValidator | |
| from sqlmodel import SQLModel, Field |
🧰 Tools
🪛 Ruff (0.14.7)
5-5: Redefinition of unused Field from line 1
(F811)
🤖 Prompt for AI Agents
In app/states/texas/validators/texas_settings.py around lines 1 to 5, remove the
unused and shadowed imports: drop "Field" from dataclasses (line 1) since it is
immediately shadowed by sqlmodel.Field, and remove "BaseModel" from pydantic
(line 4) if it is not used elsewhere in the file; also trim any other unused
pydantic imports (e.g., BeforeValidator) so that only the actually used symbols
(like model_validator or ConfigDict) remain, leaving the sqlmodel Field import
intact.
| def check_contains_factory(match_string: str): | ||
| def check_contains(value: str) -> str: | ||
| if value and match_string not in value: | ||
| raise ValueError(f"Value must contain '{match_string}'") | ||
| return value | ||
| return check_contains |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Duplicate of check_contains_factory in validator_functions.py.
This function is identical to the one in app/funcs/validator_functions.py (lines 166-171). Import it from there instead of redefining it.
-def check_contains_factory(match_string: str):
- def check_contains(value: str) -> str:
- if value and match_string not in value:
- raise ValueError(f"Value must contain '{match_string}'")
- return value
- return check_contains
+from funcs.validator_functions import check_contains_factoryCommittable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 Ruff (0.14.7)
18-18: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In app/states/texas/validators/texas_settings.py around lines 15 to 20 there is
a duplicate definition of check_contains_factory; remove the local function and
instead import check_contains_factory from app/funcs/validator_functions.py (add
the import at the top of the file), update any local references to use the
imported name, and ensure imports are organized (and tests/types still pass).
| class TECBaseModel(TECSettings): | ||
| recordType: Optional[str] = Field(BeforeValidator(check_contains_factory('recordType')), description="Record type code") | ||
| formTypeCd: Optional[str] = Field(BeforeValidator(check_contains_factory('formType')), description="TEC form used") | ||
| schedFormTypeCd: Optional[str] = Field(BeforeValidator(check_contains_factory('schedFormType')), description="TEC Schedule Used") | ||
| receivedDt: Optional[date] = Field(BeforeValidator(check_contains_factory('receivedDt')), description="Date report received by TEC") | ||
| filerIdent: Optional[str] = Field(default=None, description="Filer account #", max_length=100) | ||
| filerTypeCd: Optional[str] = Field(default=None, description="Type of filer", max_length=30) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, find the check_contains_factory function definition
rg -n "def check_contains_factory|check_contains_factory" --type py -B 2 -A 10Repository: Abstract-Data/campaignfinance-2023
Length of output: 12330
🏁 Script executed:
# Search for receivedDt usage patterns throughout the codebase
rg -n "receivedDt" --type py -C 3Repository: Abstract-Data/campaignfinance-2023
Length of output: 9386
🏁 Script executed:
# Check if there are test files that show how data is instantiated
fd -e "test_" -o -e "_test.py" | head -20Repository: Abstract-Data/campaignfinance-2023
Length of output: 194
🏁 Script executed:
# Let me examine the check_contains function more closely and see if there's type coercion
cat -n app/funcs/validator_functions.py | sed -n '166,171p'Repository: Abstract-Data/campaignfinance-2023
Length of output: 373
🏁 Script executed:
# Check if there are any Pydantic validators that handle date parsing before the BeforeValidator
cat -n app/states/texas/validators/texas_settings.py | sed -n '1,60p'Repository: Abstract-Data/campaignfinance-2023
Length of output: 3032
🏁 Script executed:
# Look for how data is loaded/instantiated to understand input format
fd -type f -name "*.py" | xargs rg -l "TECBaseModel\|receivedDt" | head -10Repository: Abstract-Data/campaignfinance-2023
Length of output: 248
🏁 Script executed:
# Search for validate_dates function to understand what it does
rg -n "def validate_dates" --type py -A 15Repository: Abstract-Data/campaignfinance-2023
Length of output: 3023
🏁 Script executed:
# Also check if there are any tests that show actual instantiation patterns
fd -type f -name "*.py" | xargs rg -l "test.*TECBaseModel\|TECBaseModel.*test" | head -5Repository: Abstract-Data/campaignfinance-2023
Length of output: 248
🏁 Script executed:
# Verify the execution order - check if there are any tests showing the error
rg -n "receivedDt" app/states/texas/validators/texas_settings.py -B 5 -A 5Repository: Abstract-Data/campaignfinance-2023
Length of output: 971
🏁 Script executed:
# Double-check the Pydantic v2 behavior by looking at imports and config
head -20 app/states/texas/validators/texas_settings.pyRepository: Abstract-Data/campaignfinance-2023
Length of output: 758
Remove BeforeValidator(check_contains_factory('receivedDt')) from the receivedDt field.
The validate_dates model validator (mode='before') runs before field-level validators and converts the receivedDt string to a date object. When check_contains_factory subsequently runs, it receives a date object but expects a string, causing a TypeError at runtime. The substring-check logic is also semantically incorrect for date fields. Apply the same fix to any other non-string fields using this validator pattern.
🤖 Prompt for AI Agents
In app/states/texas/validators/texas_settings.py around lines 44 to 50, remove
the Field-level BeforeValidator(check_contains_factory('receivedDt')) from the
receivedDt field because validate_dates (mode='before') already converts the
incoming string to a date and the check_contains validator expects a string;
update receivedDt to be a plain Optional[date] Field (no BeforeValidator) and
audit other non-string fields to remove any
BeforeValidator(check_contains_factory(...)) uses so substring checks only run
on string fields (keep check_contains_factory on string-typed fields only or
move checks into appropriate pre-validation for strings).
| # Get current version number | ||
| version_count = session.exec(select(UnifiedTransactionVersion).where(UnifiedTransactionVersion.transaction_id == tx.id)).count() | ||
| # Save current state as version | ||
| # Convert data to JSON-serializable format | ||
| data_dict = {} | ||
| for k in tx.__fields__.keys(): | ||
| value = getattr(tx, k) | ||
| if isinstance(value, date): | ||
| data_dict[k] = value.isoformat() | ||
| elif isinstance(value, datetime): | ||
| data_dict[k] = value.isoformat() | ||
| elif isinstance(value, Decimal): | ||
| data_dict[k] = float(value) | ||
| else: | ||
| data_dict[k] = value | ||
|
|
||
| version = UnifiedTransactionVersion( | ||
| transaction_id=tx.id, | ||
| version_number=version_count + 1, | ||
| data=json.dumps(data_dict), | ||
| changed_at=datetime.utcnow(), | ||
| changed_by=user, | ||
| change_reason=reason, | ||
| amendment_details=amendment_details | ||
| ) | ||
| session.add(version) | ||
| # Update fields | ||
| for k, v in updates.items(): | ||
| setattr(tx, k, v) | ||
| tx.last_modified_at = datetime.utcnow() | ||
| tx.last_modified_by = user | ||
| tx.change_reason = reason | ||
| tx.amendment_details = amendment_details | ||
| session.add(tx) | ||
| session.commit() | ||
| session.refresh(tx) |
There was a problem hiding this comment.
Race condition in version counting and deprecated datetime.utcnow().
- The version count query followed by version creation is not atomic—concurrent updates could result in duplicate version numbers.
datetime.utcnow()is deprecated; usedatetime.now(timezone.utc)instead.
For the race condition, consider using a database sequence or SELECT ... FOR UPDATE. For the datetime issue:
+from datetime import datetime, date, timezone
...
- changed_at=datetime.utcnow(),
+ changed_at=datetime.now(timezone.utc),
...
- tx.last_modified_at = datetime.utcnow()
+ tx.last_modified_at = datetime.now(timezone.utc)The same pattern applies to all other datetime.utcnow() calls at lines 583, 617, 651, 685, 744, 745, 817, 827, 891, 1092, 1148.
Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 Ruff (0.14.7)
574-574: datetime.datetime.utcnow() used
(DTZ003)
583-583: datetime.datetime.utcnow() used
(DTZ003)
🤖 Prompt for AI Agents
In app/states/unified_database.py around lines 554 to 589, the version_count
query followed by creating a new UnifiedTransactionVersion is not atomic (race
can yield duplicate version_number) and uses deprecated datetime.utcnow(); to
fix, perform the version calculation and insert inside the same DB transaction
while acquiring a lock on the related transaction row (e.g., SELECT ... FOR
UPDATE or ORM equivalent) or use a dedicated DB sequence/auto-increment for
version_number so concurrent workers cannot generate the same number; also
replace datetime.utcnow() with datetime.now(timezone.utc) (and apply that
replacement to the other occurrences at lines 583, 617, 651, 685, 744, 745, 817,
827, 891, 1092, 1148) so timestamps are timezone-aware.
| version = UnifiedPersonVersion( | ||
| person_id=person.id, | ||
| version_number=version_count + 1, | ||
| data=json.dumps({k: getattr(person, k) for k in person.__fields__.keys()}), | ||
| changed_at=datetime.utcnow(), | ||
| changed_by=user, | ||
| change_reason=reason, | ||
| amendment_details=amendment_details | ||
| ) | ||
| session.add(version) | ||
| for k, v in updates.items(): | ||
| setattr(person, k, v) | ||
| session.add(person) | ||
| session.commit() | ||
| session.refresh(person) | ||
| return person |
There was a problem hiding this comment.
Non-serializable data in person version snapshot.
Unlike update_transaction, the update_person method doesn't handle non-JSON-serializable types (e.g., date, datetime, Decimal) before calling json.dumps(). This will raise TypeError if the person model contains such fields.
Apply consistent serialization logic:
- version = UnifiedPersonVersion(
- person_id=person.id,
- version_number=version_count + 1,
- data=json.dumps({k: getattr(person, k) for k in person.__fields__.keys()}),
+ data_dict = {}
+ for k in person.__fields__.keys():
+ value = getattr(person, k)
+ if isinstance(value, date):
+ data_dict[k] = value.isoformat()
+ elif isinstance(value, datetime):
+ data_dict[k] = value.isoformat()
+ elif isinstance(value, Decimal):
+ data_dict[k] = float(value)
+ else:
+ data_dict[k] = value
+ version = UnifiedPersonVersion(
+ person_id=person.id,
+ version_number=version_count + 1,
+ data=json.dumps(data_dict),The same issue exists in update_committee (line 650) and update_address (line 684).
Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 Ruff (0.14.7)
617-617: datetime.datetime.utcnow() used
(DTZ003)
🤖 Prompt for AI Agents
In app/states/unified_database.py around lines 613-628, the snapshot JSON for
UnifiedPersonVersion uses json.dumps on model fields directly which will raise
TypeError for non-serializable types (date, datetime, Decimal, UUID, etc.);
update the code to serialize the person dict using the same approach as
update_transaction (use a shared helper or json.dumps(..., default=str) or
explicitly convert known field types to ISO strings/primitive types) so all
values are JSON-serializable before creating the version. Apply the same fix to
the analogous snapshots in update_committee (around line 650) and update_address
(around line 684) so they also use the shared serialization helper or json.dumps
with a default converter.
| file=cls.config.TEMP_FOLDER / f"{k}_{datetime.now():%Y%m%d}w.parquet", | ||
| compression='lz4') |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Use timezone-aware datetime.
datetime.now() without a timezone argument produces a naive datetime, which can cause issues in systems with multiple timezones.
+from datetime import datetime, timezone
...
df.write_parquet(
- file=cls.config.TEMP_FOLDER / f"{k}_{datetime.now():%Y%m%d}w.parquet",
+ file=cls.config.TEMP_FOLDER / f"{k}_{datetime.now(timezone.utc):%Y%m%d}w.parquet",
compression='lz4')📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| file=cls.config.TEMP_FOLDER / f"{k}_{datetime.now():%Y%m%d}w.parquet", | |
| compression='lz4') | |
| file=cls.config.TEMP_FOLDER / f"{k}_{datetime.now(timezone.utc):%Y%m%d}w.parquet", | |
| compression='lz4') |
🧰 Tools
🪛 Ruff (0.14.7)
201-201: datetime.datetime.now() called without a tz argument
(DTZ005)
🤖 Prompt for AI Agents
In app/states/texas/texas_downloader.py around lines 201-202, the code uses
datetime.now() to generate a filename which produces a naive datetime; update
this to a timezone-aware timestamp by using datetime.now(timezone.utc) (or
datetime.now(tz=your_app_tz) if a local zone is required) and ensure you import
timezone from datetime (or use datetime.utcnow().replace(tzinfo=timezone.utc) if
you prefer); replace the current call in the filename construction with the
timezone-aware call so the generated filename includes an aware datetime.
- Updated all imports to use app. prefix for proper module resolution - Fixed app/abcs/*.py imports - Fixed app/funcs/*.py imports - Fixed app/states/texas/**/*.py imports - Fixed app/states/oklahoma/**/*.py imports - Fixed app/states/ohio/*.py imports - Fixed app/main.py imports - Updated test_validation_class.py to use new StateFileValidation API - Updated test_ingest_hypothesis.py with Hypothesis health check suppression
- Cast pl.lit(None) to String type when adding missing columns to fix 'type String is incompatible with expected type Null' error - Add cleanup of leftover parquet files at start of test_texas_consolidate_files to prevent flaky test failures from tmp_path reuse between Hypothesis examples
There was a problem hiding this comment.
Actionable comments posted: 11
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
app/abcs/abc_category.py (1)
79-94:validationsetter is recursively assigning to itself.In the setter,
self.validation = valuecalls the same setter again, leading to infinite recursion ifvalidationis ever set. Either drop the setter or have it assign to a private backing attribute (e.g._validation) instead.app/states/texas/validators/texas_personname.py (1)
87-101: Guard against missingNamePrefixCdwhen deriving address prefixes.If no key containing
'NamePrefixCd'is present,_pfxwill beNone, andkey.startswith(_pfx)will raise aTypeError. While your data likely always includes a prefix field, a defensive check would avoid surprising failures:- _pfx = next((key for key in values.keys() if 'NamePrefixCd' in key), None) - if _pfx: - _pfx = _pfx.split('NamePrefixCd')[0] - _mail_keys = [key for key in values.keys() if 'Mail' in key and key.startswith(_pfx)] + _pfx = next((key for key in values.keys() if 'NamePrefixCd' in key), None) + if not _pfx: + return values + _pfx = _pfx.split('NamePrefixCd')[0] + _mail_keys = [key for key in values.keys() if 'Mail' in key and key.startswith(_pfx)]This keeps the happy path untouched while making the validator resilient to missing prefix fields.
app/states/texas/validators/direct_expenditures.py (1)
1-44: Import is fine, but most TECDirectExpenses fields lack type annotations and won’t be modeled.The new import of
TECSettingsfromapp.states.texas.validators.texas_settingsis correct, but inTECDirectExpensesonly the first few attributes are typed. All of the bare names (e.g.infoOnlyFlag,filerIdent,expendDt, etc.) will not become SQLModel/Pydantic fields and therefore won’t be persisted or validated.If this model is meant to represent the full direct‑expenses record, you’ll want to add explicit type annotations (and
Field(...)where appropriate) for each of those attributes, for example:- infoOnlyFlag - filerIdent + infoOnlyFlag: bool | None = None + filerIdent: str | None = None # ...and so on for the remaining attributes, with appropriate types...Adjust the concrete types/defaults to match your TEC schema.
app/states/texas/validators/texas_address.py (1)
83-83: Logic bug: condition always evaluates to True.The expression
if "PO" or "PO BOX" in _address:is incorrect. It's parsed asif ("PO") or ("PO BOX" in _address):, and since"PO"is a non-empty string (truthy), this condition is always True regardless of_addresscontent.Apply this diff:
- if "PO" or "PO BOX" in _address: + if "PO" in _address or "PO BOX" in _address:Or more succinctly:
- if "PO" or "PO BOX" in _address: + if any(x in _address for x in ("PO", "PO BOX")):app/states/texas/validators/texas_expenses.py (1)
235-247: Minor: Incorrect field name in error message.Line 243 references
'column': 'filerName'in the error dict, but this validator is forexpendDescrandexpendDtfields. The error message should reference the actual field being validated.@field_validator('expendDescr', 'expendDt', mode='before') @classmethod - def validate_required_fields(cls, v): + def validate_required_fields(cls, v, info): if v == "" or v is None: raise PydanticCustomError( 'missing_required_value', "Field is required", { - 'column': 'filerName', + 'column': info.field_name, 'value': v } ) return v
♻️ Duplicate comments (27)
app/states/texas/validators/texas_personname.py (1)
51-55:field_pfxvalidator still appears to be a copy‑paste error.
field_pfxis populated byset_type_prefixfrom a key containingNamePrefixCd, but itsBeforeValidatorchecks for'NameSuffixCd', which doesn’t match the field’s purpose and duplicates the suffix logic. This is almost certainly unintended and can also interfere if a value is ever passed in.Recommend removing the
BeforeValidatorfromfield_pfx(or changing it to the correct key if you truly want validation there), e.g.:- field_pfx: Optional[str] = Field( - BeforeValidator(funcs.check_contains_factory('NameSuffixCd')), - description="Field prefix" - ) + field_pfx: Optional[str] = Field( + default=None, + description="Field prefix", + )app/states/texas/validators/texas_flags.py (2)
9-10: Annotate__table_args__withClassVar.Mutable class attributes should be annotated with
typing.ClassVarto indicate they are class-level configuration, not instance fields.-from typing import Optional +from typing import Optional, ClassVar class TECFlagsBase(TECBaseModel): __tablename__ = "tx_flags" - __table_args__ = {"schema": "texas"} + __table_args__: ClassVar[dict] = {"schema": "texas"}
12-14: Type mismatch:BeforeValidatorexpectsstr, but field isOptional[bool].The
check_contains_factoryreturns a validator operating onstrvalues, butprimaryUsaPhoneFlagis typed asOptional[bool]. This will cause a runtime error when the validator attempts string operations on a boolean.Either remove the
BeforeValidatorif the field should be boolean, or change the type toOptional[str]if it holds string values like "Y"/"N":- primaryUsaPhoneFlag: Optional[bool] = Field( - BeforeValidator(funcs.check_contains_factory('PrimaryUsaPhoneFlag')), - description="Y indicates that the phone is a USA") + primaryUsaPhoneFlag: Optional[str] = Field( + default=None, + description="Y indicates that the phone is a USA")app/states/texas/validators/texas_settings.py (3)
1-5: Remove unusedFieldimport fromdataclassesand unusedBaseModelimport.Line 1 imports
Fieldfromdataclasses, which is immediately shadowed by theFieldimport fromsqlmodelon line 5. Thedataclasses.Fieldis never used. Additionally,BaseModelfrompydanticappears unused.-from dataclasses import Field from typing import Optional from datetime import date -from pydantic import ConfigDict, model_validator, BaseModel, BeforeValidator +from pydantic import ConfigDict, model_validator, BeforeValidator from sqlmodel import SQLModel, Field
15-20: Duplicate ofcheck_contains_factoryinvalidator_functions.py.This function is identical to the one in
app/funcs/validator_functions.py(lines 165-170). Import it from there instead of redefining it to maintain DRY principles.-def check_contains_factory(match_string: str): - def check_contains(value: str) -> str: - if value and match_string not in value: - raise ValueError(f"Value must contain '{match_string}'") - return value - return check_contains +from app.funcs.validator_functions import check_contains_factory
44-50: RemoveBeforeValidatorfromreceivedDtfield - type mismatch will cause runtime error.The
validate_datesmodel validator (mode='before') convertsreceivedDtfrom string todateobject before field-level validators run. Whencheck_contains_factorysubsequently executes, it receives adateobject but expects a string, causing aTypeError. The substring-check logic is semantically incorrect for date fields.- receivedDt: Optional[date] = Field(BeforeValidator(check_contains_factory('receivedDt')), description="Date report received by TEC") + receivedDt: Optional[date] = Field(default=None, description="Date report received by TEC")app/main.py (2)
71-73:doug_deasonis created but not included indonation_list.Line 71 creates
doug_deasonfiltering for DOUG DEASON contributions, but line 73 only concatenatestim_dunn,farris_wilks, anddon_dyer.Apply this diff to include
doug_deason:-donation_list = pd.concat([tim_dunn, farris_wilks, don_dyer]) +donation_list = pd.concat([tim_dunn, farris_wilks, don_dyer, doug_deason])
83-348: Large block of commented-out code should be removed.This extensive commented-out code clutters the file and reduces readability. Rely on version control history for reference instead.
app/states/ohio/ohio_downloader.py (1)
21-29: Incomplete implementation with unused parameters.The
overwriteandread_from_tempparameters are declared but never used. The trailing...on line 29 is unnecessary since there's preceding code. Either implement the logic using these parameters or remove them if not needed.app/states/texas/validators/texas_address.py (2)
32-32: Type annotation mismatch with default value.
standardizedhasdefault=Nonebut is typed asstr(non-optional). This will cause validation errors when the field is None.- standardized: str = Field(default=None, description="Address standardized") + standardized: Optional[str] = Field(default=None, description="Address standardized")
145-151: Nullable primary key fields violate database integrity.
TECPersonAddressLinkModelusesOptionalfields withprimary_key=True. NULL values in primary key columns prevent SQL from enforcing uniqueness, potentially allowing duplicate rows.Consider creating separate link tables with non-nullable primary keys, or add a required discriminator column with validation ensuring exactly one optional field is set per row.
app/abcs/abc_download.py (3)
20-21: Side effect at module import time.Starting
ProgressTrackerat module load (progress.start()) means the live display begins whenever this module is imported, even if no download is happening. This can cause unexpected console output and resource usage.Consider lazy initialization:
progress = ProgressTracker() -progress.start()Then call
progress.start()explicitly in thedownload()method when needed.
82-83: Progress task not completed for.txtfiles.When the file is a
.txt, the function returns early on line 83 without callingprogress.update_task(_csv_task, "Complete"). This leaves the task in "Started" status indefinitely.if file_name.suffix == '.txt': + progress.update_task(_csv_task, "Complete") return
66-69: Broad exception catch may hide critical errors.Catching bare
Exceptionand only logging withic()can hide serious issues like disk full, permission errors, or corrupted archives.try: cls._process_csv(zip_ref, file, tmp) - except Exception as e: - ic(f"Zip File Extraction Error on {file.filename.upper()}: {e}") + except (OSError, IOError, pl.exceptions.ComputeError) as e: + ic(f"Zip File Extraction Error on {file.filename.upper()}: {e}") + raiseapp/states/texas/texas_downloader.py (4)
36-36: Consuming the iterator prematurely causes deletion to fail.
_existing_filesis an iterator fromitertools.chain. It's consumed when checking_safe_to_delete, but the actual deletion at line 83 happens after the download completes. By that time, the iterator is exhausted and no files get deleted.- _existing_files = itertools.chain(*map(lambda x: tmp.glob(x), ["*.csv", "*.parquet", "*.txt"])) + _existing_files = list(tmp.glob("*.csv")) + list(tmp.glob("*.parquet")) + list(tmp.glob("*.txt"))
218-227:_datais reassigned to empty dict, discarding previously collected files.Line 218-221 populates
_datawith file mappings, but line 223 immediately reassigns_data = {}, discarding all the collected data. The subsequent loop iterates over the now-empty dictionary, making the entire parquet reading logic a no-op._data = {} for file in parquet_files: _type = file.stem.rsplit('_', 1)[0] _data.setdefault(_type, []).append(file) - _data = {} for k, v in _data.items():
143-149: Second file is skipped during consolidation due to iterator consumption.The loop at lines 143-149 iterates
v[1:]to collect columns, but then line 157 iterates_files(the iterator). Sincenext(_files)already consumed the first file at line 132, andv[1:]doesn't consume_files, the iterator_filesstill points atv[1]. However, when you iterate_filesat line 157, it starts fromv[1], meaning files are read correctly BUT the second file (v[1]) is read twice - once for column discovery and once for processing.Use the list directly instead of an iterator:
- _files = iter(v) - # Start with scanning the first file instead of empty DataFrame - _first_file = next(_files) df = ( - pl.read_parquet(_first_file) + pl.read_parquet(v[0]) .with_columns( - pl.lit(_first_file.stem) + pl.lit(v[0].stem) .alias('file_origin') ) ) ... - for _file in _files: + for _file in v[1:]:Also applies to: 157-157
32-33: Unused parameterread_from_tempand boolean positional arguments.
read_from_tempis declared but never used. Additionally, boolean positional arguments can be confusing at call sites.Either implement the logic using
read_from_tempor remove it, and consider keyword-only arguments:@classmethod def download( cls, - read_from_temp: bool = True, - headless: bool = True + *, + headless: bool = True ) -> Self:app/states/texas/models.py (3)
1-9: Critical: Remove dataclasses Field import and clean up unused imports.Line 1 imports
Fieldfromdataclasses, but Line 5 redefines it by importing fromsqlmodel. The dataclasses import is incorrect and unused. Additionally, several imports appear unused:ConfigDict,model_validator,BaseModelfrom pydantic (Line 4),SQLModelfrom sqlmodel (Line 5),tx_funcs(Line 7), andRecordKeyGenerator(Line 9).Apply this diff to fix the import issues:
-from dataclasses import Field from typing import Optional from datetime import date -from pydantic import ConfigDict, model_validator, BaseModel, BeforeValidator +from pydantic import BeforeValidator -from sqlmodel import SQLModel, Field +from sqlmodel import Field from .validators.texas_settings import TECSettings -import app.states.texas.funcs.tx_validation_funcs as tx_funcs import app.funcs.validator_functions as funcs -from app.funcs.record_keygen import RecordKeyGenerator
15-17: Critical: Incorrect BeforeValidator usage with Field().
BeforeValidatormust be used withAnnotatedtype hints in Pydantic v2, not as a positional argument toField(). This will cause runtime errors.Apply this pattern to fix:
+from typing import Annotated ... - primaryUsaPhoneFlag: Optional[bool] = Field( - BeforeValidator(funcs.check_contains_factory('PrimaryUsaPhoneFlag')), - description="Y indicates that the phone is a USA") + primaryUsaPhoneFlag: Annotated[ + Optional[bool], + BeforeValidator(funcs.check_contains_factory('PrimaryUsaPhoneFlag')) + ] = Field(default=None, description="Y indicates that the phone is a USA")Apply the same fix to Lines 59-62 in TECBaseModel for
recordType,formTypeCd,schedFormTypeCd, andreceivedDt.
58-64: Major: Potential duplicate TECBaseModel definition.According to past review comments,
TECBaseModelappears to be already defined inapp/states/texas/validators/texas_settings.py. The AI summary also references TECBaseModel being introduced in texas_settings.py as part of this PR. This could create code duplication and inconsistency.Verify whether TECBaseModel already exists in texas_settings.py:
#!/bin/bash # Check for TECBaseModel definitions across the codebase rg -n "class TECBaseModel" --type py -C3app/states/texas/validators/texas_expenses.py (3)
71-135: Remove commented-out code.This large block of commented payee field definitions (65 lines) should be removed rather than left in the codebase. Git history preserves the old implementation if needed.
159-173: Critical: Field extraction incompatible with BeforeValidator checks in TECPersonNameBase.The
validate_payeevalidator extracts payee fields by key prefix (k.startswith("payee")), creating a dictionary like{'payeeNameFirst': 'John', 'payeeNameLast': 'Doe'}. It then passes this toTECPersonNameBase(**_payee). However,TECPersonNameBasefields useBeforeValidator(funcs.check_contains_factory(...))which expect the VALUE to contain substrings like'NameFirst'. After extraction,_payee['payeeNameFirst'] = "John", but the validator expects"John"to contain'NameFirst', which will fail.The
check_contains_factoryvalidators are designed to validate raw input keys/metadata, not cleaned extracted values. Thevalidate_namemethod inTECPersonNameBase(shown in relevant snippets) extracts values by finding keys containing substrings (e.g.,'NameFirst' in key), which works with raw input but conflicts with the BeforeValidator checks.Consider one of these solutions:
Solution 1 (Recommended): Pass raw values dict directly and let TECPersonNameBase handle extraction:
@model_validator(mode='before') @classmethod def validate_payee(cls, values): - _payee = {k: v for k, v in values.items() if k.startswith("payee")} if not _payee: raise PydanticCustomError( 'missing_required_value', "Payee data is required", { 'column': 'payee', 'value': _payee } ) - values['payee'] = TECPersonNameBase(**_payee) + # Pass all values; TECPersonNameBase will extract payee fields via validate_name + values['payee'] = TECPersonNameBase(**values) return valuesSolution 2: Remove BeforeValidator constraints from TECPersonNameBase fields that are populated via validate_name.
174-233: Remove commented-out code.This 60-line block of commented validation logic should be removed. Git history preserves it if needed.
app/states/texas/validators/texas_filers.py (3)
110-110: Critical: Broken back_populates references on relationships.Lines 145-147 in
TECFilerNamedefine relationships withback_populates="campaign"pointing toTECTreasurer,TECAssistantTreasurer, andTECCampaignChair. However, the correspondingcampaignrelationships are commented out (Lines 110, 119, 124). This will cause SQLAlchemy relationship configuration errors at runtime.Either uncomment the
campaignrelationships in the target classes:class TECTreasurer(TECPersonNameBase, table=True): ... - # campaign: "TECFilerName" = Relationship(back_populates="treasurer", link_model=TECFilerIDNameLink) + campaign: "TECFilerName" = Relationship(back_populates="treasurer", link_model=TECFilerIDNameLink) class TECAssistantTreasurer(TECPersonNameBase, table=True): ... - # campaign: "TECFilerName" = Relationship(back_populates="asstTreasurer", link_model=TECFilerIDNameLink) + campaign: "TECFilerName" = Relationship(back_populates="asstTreasurer", link_model=TECFilerIDNameLink) class TECCampaignChair(TECPersonNameBase, table=True): ... - # campaign: "TECFilerName" = Relationship(back_populates="chair", link_model=TECFilerIDNameLink) + campaign: "TECFilerName" = Relationship(back_populates="chair", link_model=TECFilerIDNameLink)Or remove the
back_populatesparameters from Lines 145-147.Also applies to: 119-119, 124-124
150-154: Critical: Type mismatch - assigning TECPersonName object to str field.The
validate_namevalidator assigns aTECPersonNameobject tovalues['filerName'](Line 153), butfilerNameis defined asstr = Field(...)on Line 135. This will cause validation errors or type mismatches at runtime.Consider one of these fixes:
Solution 1: Change filerName field type to TECPersonName:
- filerName: str = Field(..., description="Filer name") + filerName: TECPersonName = Field(..., description="Filer name", sa_type=JSON)Solution 2: Extract string representation before assignment:
@model_validator(mode='before') @classmethod def validate_name(cls, values): - values['filerName'] = TECPersonName(**values) + person_name = TECPersonName(**values) + values['filerName'] = person_name.nameFull or f"{person_name.first} {person_name.last}" return valuesSolution 3: Store in a separate field:
+ filer_person_name: TECPersonName = Field(default=None, description="Filer person name", sa_type=JSON) ... @model_validator(mode='before') @classmethod def validate_name(cls, values): - values['filerName'] = TECPersonName(**values) + values['filer_person_name'] = TECPersonName(**values) return values
60-65: Remove hundreds of lines of commented-out code.This file contains massive blocks of commented-out code:
- Lines 60-65: Old link model
- Lines 89-96: Old link model
- Lines 186-729: 500+ lines of commented class definitions and fields
These should be removed entirely. Git history preserves the old implementations if needed. Keeping this much commented code severely harms readability and maintainability.
Also applies to: 89-96, 186-729
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (30)
app/abcs/__init__.py(1 hunks)app/abcs/abc_category.py(2 hunks)app/abcs/abc_db_loader.py(1 hunks)app/abcs/abc_download.py(2 hunks)app/abcs/abc_state_config.py(2 hunks)app/abcs/abc_validation.py(1 hunks)app/funcs/db_loader.py(1 hunks)app/main.py(3 hunks)app/states/ohio/ohio_downloader.py(1 hunks)app/states/oklahoma/funcs/ok_validation_funcs.py(1 hunks)app/states/oklahoma/oklahoma.py(1 hunks)app/states/oklahoma/validators/ok_contribution.py(1 hunks)app/states/oklahoma/validators/ok_expenditure.py(1 hunks)app/states/oklahoma/validators/ok_lobbyist_expenditure.py(1 hunks)app/states/oklahoma/validators/ok_settings.py(1 hunks)app/states/texas/__init__.py(2 hunks)app/states/texas/funcs/tx_validation_funcs.py(1 hunks)app/states/texas/models.py(1 hunks)app/states/texas/texas_downloader.py(1 hunks)app/states/texas/validators/direct_expenditures.py(1 hunks)app/states/texas/validators/texas_address.py(3 hunks)app/states/texas/validators/texas_contributions.py(1 hunks)app/states/texas/validators/texas_debtdata.py(1 hunks)app/states/texas/validators/texas_expenses.py(4 hunks)app/states/texas/validators/texas_filers.py(5 hunks)app/states/texas/validators/texas_flags.py(1 hunks)app/states/texas/validators/texas_personname.py(2 hunks)app/states/texas/validators/texas_settings.py(2 hunks)app/tests/test_ingest_hypothesis.py(1 hunks)app/tests/test_validation_class.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (16)
app/states/ohio/ohio_downloader.py (3)
app/abcs/abc_download.py (4)
FileDownloaderABC(24-142)download(109-110)consolidate_files(114-115)read(119-120)app/abcs/abc_state_config.py (4)
StateConfig(140-180)CategoryTypes(110-136)CategoryConfig(65-106)CSVReaderConfig(40-42)app/abcs/abc_category.py (1)
read(103-125)
app/funcs/db_loader.py (3)
app/abcs/abc_category.py (1)
logger(99-101)app/abcs/abc_validation.py (1)
logger(32-34)app/logger.py (1)
Logger(10-71)
app/abcs/abc_validation.py (4)
app/abcs/abc_validation_errors.py (1)
ValidationErrorList(29-95)app/funcs/validator_functions.py (1)
create_record_id(16-27)app/abcs/abc_category.py (1)
logger(99-101)app/logger.py (1)
Logger(10-71)
app/abcs/abc_state_config.py (1)
app/funcs/toml_reader.py (1)
read_toml(6-8)
app/states/texas/texas_downloader.py (5)
app/funcs/csv_reader.py (3)
FileReader(54-144)read_parquet(108-132)read_folder(82-104)app/abcs/abc_download.py (6)
FileDownloaderABC(24-142)download(109-110)not_headless(36-39)extract_zipfile(62-70)consolidate_files(114-115)read(119-120)app/abcs/abc_state_config.py (4)
StateConfig(140-180)CategoryTypes(110-136)CategoryConfig(65-106)TEMP_FOLDER(149-150)app/states/ohio/ohio_downloader.py (3)
download(21-29)consolidate_files(31-32)read(34-35)app/live_display.py (2)
add_task(34-44)update_task(46-52)
app/tests/test_validation_class.py (1)
app/abcs/abc_validation.py (3)
StateFileValidation(24-85)validate_record(36-47)validate(49-55)
app/abcs/abc_db_loader.py (3)
app/abcs/abc_category.py (1)
logger(99-101)app/abcs/abc_validation.py (1)
logger(32-34)app/logger.py (1)
Logger(10-71)
app/states/texas/validators/direct_expenditures.py (1)
app/states/texas/validators/texas_settings.py (1)
TECSettings(22-41)
app/states/texas/__init__.py (4)
app/logger.py (1)
Logger(10-71)app/states/texas/texas_downloader.py (1)
TECDownloader(22-279)app/states/texas/texas_search.py (1)
TexasSearch(165-294)app/states/texas/validators/texas_filers.py (1)
TECFilerName(126-184)
app/states/texas/validators/texas_flags.py (3)
app/states/texas/models.py (1)
TECBaseModel(58-64)app/states/texas/validators/texas_settings.py (2)
TECBaseModel(44-57)check_contains_factory(15-20)app/funcs/validator_functions.py (1)
check_contains_factory(166-171)
app/abcs/__init__.py (6)
app/abcs/abc_state_config.py (4)
CSVReaderConfig(40-42)StateConfig(140-180)CategoryConfig(65-106)CategoryTypes(110-136)app/abcs/abc_category.py (1)
StateCategoryClass(32-169)app/abcs/abc_validation.py (1)
StateFileValidation(24-85)app/abcs/abc_validation_errors.py (1)
ValidationErrorList(29-95)app/abcs/abc_db_loader.py (1)
DBLoaderClass(13-162)app/abcs/abc_download.py (1)
FileDownloaderABC(24-142)
app/states/texas/validators/texas_address.py (3)
app/states/texas/models.py (1)
TECBaseModel(58-64)app/states/texas/validators/texas_settings.py (1)
TECBaseModel(44-57)app/funcs/record_keygen.py (1)
RecordKeyGenerator(10-56)
app/states/texas/validators/texas_expenses.py (2)
app/states/texas/validators/texas_personname.py (1)
TECPersonNameBase(15-107)app/states/texas/validators/texas_settings.py (1)
TECSettings(22-41)
app/main.py (2)
app/states/texas/texas_search.py (1)
TexasSearch(165-294)app/states/texas/texas_downloader.py (2)
download(30-98)dataframes(277-279)
app/states/texas/validators/texas_personname.py (4)
app/states/texas/models.py (1)
TECBaseModel(58-64)app/states/texas/validators/texas_settings.py (2)
TECBaseModel(44-57)check_contains_factory(15-20)app/states/texas/validators/texas_address.py (1)
TECAddressBase(20-142)app/funcs/validator_functions.py (1)
check_contains_factory(166-171)
app/states/texas/validators/texas_settings.py (3)
app/funcs/record_keygen.py (1)
RecordKeyGenerator(10-56)app/funcs/validator_functions.py (2)
check_contains_factory(166-171)check_contains(167-170)app/states/texas/models.py (1)
TECBaseModel(58-64)
🪛 Ruff (0.14.7)
app/states/ohio/ohio_downloader.py
21-21: Boolean-typed positional argument in function definition
(FBT001)
21-21: Unused method argument: overwrite
(ARG002)
21-21: Boolean-typed positional argument in function definition
(FBT001)
21-21: Unused method argument: read_from_temp
(ARG002)
29-29: Unnecessary ... literal
Remove unnecessary ...
(PIE790)
app/states/texas/texas_downloader.py
25-25: Missing return type annotation for special method __init__
Add return type annotation: None
(ANN204)
32-32: Boolean-typed positional argument in function definition
(FBT001)
32-32: Boolean default positional argument in function definition
(FBT002)
32-32: Unused class method argument: read_from_temp
(ARG003)
33-33: Boolean-typed positional argument in function definition
(FBT001)
33-33: Boolean default positional argument in function definition
(FBT002)
36-36: Unnecessary map() usage (rewrite using a generator expression)
Replace map() with a generator expression
(C417)
63-63: Avoid specifying long messages outside the exception class
(TRY003)
83-83: Unnecessary map() usage (rewrite using a list comprehension)
Replace map() with a list comprehension
(C417)
87-87: Unnecessary generator (rewrite as a list comprehension)
Rewrite as a list comprehension
(C400)
89-89: Avoid specifying long messages outside the exception class
(TRY003)
95-95: Unnecessary map() usage (rewrite using a list comprehension)
Replace map() with a list comprehension
(C417)
113-113: Avoid specifying long messages outside the exception class
(TRY003)
120-120: Missing return type annotation for classmethod consolidate_files
(ANN206)
147-147: Do not catch blind exception: Exception
(BLE001)
178-178: Do not catch blind exception: Exception
(BLE001)
201-201: datetime.datetime.now() called without a tz argument
(DTZ005)
204-204: Unnecessary map() usage (rewrite using a list comprehension)
Replace map() with a list comprehension
(C417)
209-209: Boolean-typed positional argument in function definition
(FBT001)
209-209: Boolean default positional argument in function definition
(FBT002)
217-217: Avoid specifying long messages outside the exception class
(TRY003)
240-240: Do not catch blind exception: Exception
(BLE001)
264-264: Do not catch blind exception: Exception
(BLE001)
269-269: Do not catch blind exception: Exception
(BLE001)
277-277: Missing return type annotation for classmethod dataframes
(ANN206)
app/abcs/abc_download.py
6-6: Import from collections.abc instead: Generator
Import from collections.abc
(UP035)
6-6: typing.Dict is deprecated, use dict instead
(UP035)
30-30: Missing return type annotation for special method __post_init__
(ANN204)
36-36: Missing return type annotation for classmethod not_headless
(ANN206)
62-62: Missing return type annotation for classmethod extract_zipfile
Add return type annotation: None
(ANN206)
68-68: Do not catch blind exception: Exception
(BLE001)
73-73: Missing return type annotation for classmethod _process_csv
Add return type annotation: None
(ANN206)
85-85: datetime.datetime.now() called without a tz argument
(DTZ005)
109-109: Boolean-typed positional argument in function definition
(FBT001)
109-109: Boolean-typed positional argument in function definition
(FBT001)
114-114: Missing return type annotation for classmethod consolidate_files
(ANN206)
119-119: Missing return type annotation for classmethod read
(ANN206)
app/states/texas/validators/texas_flags.py
10-10: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
app/tests/test_ingest_hypothesis.py
7-7: typing.Dict is deprecated, use dict instead
(UP035)
7-7: typing.List is deprecated, use list instead
(UP035)
7-7: typing.Tuple is deprecated, use tuple instead
(UP035)
122-122: datetime.datetime.fromtimestamp() called without a tz argument
(DTZ006)
144-144: datetime.datetime.fromtimestamp() called without a tz argument
(DTZ006)
174-174: Unused lambda argument: self
(ARG005)
206-206: Loop control variable files not used within loop body
(B007)
app/states/texas/validators/texas_address.py
147-147: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
app/states/texas/models.py
5-5: Redefinition of unused Field from line 1
(F811)
13-13: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
app/states/texas/validators/texas_expenses.py
161-161: Missing return type annotation for classmethod validate_payee
(ANN206)
app/states/texas/validators/texas_filers.py
23-23: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
30-30: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
46-46: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
53-53: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
68-68: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
84-84: Missing return type annotation for special method __init__
Add return type annotation: None
(ANN204)
84-84: Missing type annotation for **data
(ANN003)
99-99: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
114-114: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
123-123: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
128-128: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
152-152: Missing return type annotation for classmethod validate_name
(ANN206)
158-158: Missing return type annotation for classmethod validate_addresses
(ANN206)
167-167: Missing return type annotation for classmethod validate_treasurer
(ANN206)
174-174: Missing return type annotation for classmethod validate_assistant_treasurer
(ANN206)
181-181: Missing return type annotation for classmethod validate_chair
(ANN206)
app/states/texas/validators/texas_settings.py
5-5: Redefinition of unused Field from line 1
(F811)
18-18: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (25)
app/abcs/abc_db_loader.py (1)
4-4: LGTM: Import path standardization.The update to absolute import (
from app.logger import Logger) improves consistency across the codebase and aligns with the project-wide refactor to app-level imports.app/states/oklahoma/validators/ok_lobbyist_expenditure.py (1)
7-7: LGTM: Import path standardization.The import path update to
app.funcs.validator_functionsaligns with the project-wide migration to absolute imports and is consistent with other Oklahoma validator modules.app/states/texas/funcs/tx_validation_funcs.py (1)
2-2: LGTM: Import path standardization.The update to
app.funcs.validator_functionsmaintains consistency with the project-wide refactor to absolute imports.app/abcs/abc_state_config.py (2)
15-15: LGTM: Import path standardization.The import path update to
app.funcsaligns with the project-wide migration to absolute imports.
154-158: LGTM: Formatting improvement.The multi-line formatting of the return statement improves readability without changing functionality.
app/states/texas/validators/texas_contributions.py (1)
7-7: LGTM: Import path standardization.The import path update to
app.states.texas.funcs.tx_validation_funcsmaintains consistency with the project-wide refactor to absolute imports.app/states/oklahoma/validators/ok_settings.py (1)
3-3: LGTM: Import path standardization.The import path update to
app.funcs.validator_functionsaligns with other Oklahoma validator modules and the project-wide migration to absolute imports.app/funcs/db_loader.py (1)
3-3: LGTM: Import path standardization.The update to absolute import (
from app.logger import Logger) maintains consistency with the project-wide refactor to app-level imports.app/states/oklahoma/validators/ok_contribution.py (1)
7-7: LGTM: Import path standardization.The import path update to
app.funcs.validator_functionscompletes the consistent migration to absolute imports across Oklahoma validator modules.app/states/oklahoma/validators/ok_expenditure.py (1)
6-6: Import path refactor looks correct and consistent.Switching to
app.funcs.validator_functionskeeps existing usages (validate_date,person_name_parser) intact and aligns with the new app‑scoped import style.app/states/oklahoma/funcs/ok_validation_funcs.py (1)
1-1: OK to shift validator_functions to app‑scoped import.This keeps
ok_date_validationbehavior the same while matching the newapp.funcsnamespace.app/states/texas/validators/texas_debtdata.py (1)
7-8: Texas validator import realignment looks good.Using
app.funcs.validator_functionsandapp.states.texas.funcs.tx_validation_funcskeeps the existing DebtData validators wired through the new app‑level paths.app/abcs/abc_category.py (1)
6-20: App‑scoped ABC imports wired correctly.The shift to
app.abcs.*,app.logger, andapp.funcskeeps existing usages (StateFileValidation,DBLoaderClass,FileReader, logging) consistent with the new package layout.app/states/oklahoma/oklahoma.py (1)
5-9: Oklahoma configuration imports correctly re‑namespaced.Using
app.funcs,app.logger,app.abcsand a relative.validatorskeeps the Oklahoma configuration self‑contained underapp.states.oklahomaand consistent with the rest of the project.app/tests/test_ingest_hypothesis.py (1)
26-145: GenericFileReader CSV/Parquet property‑based tests look solid.The schema builder +
_expected_records_from_schemapattern, together with Hypothesis strategies, gives good coverage of type conversion and metadata (file_origin,download_date) for both CSV and Parquet inputs. The assertions align well with the expected schema behavior.app/abcs/__init__.py (1)
2-7: Import path updates and newprogressexport look good.The import paths are consistently updated to use the
app.abcs.*prefix, aligning with the project-wide module structure changes. The addition ofprogressto the public exports makes theProgressTrackerinstance accessible to downstream modules.app/abcs/abc_validation.py (1)
9-11: Import path updates are consistent.The import paths are correctly updated to use the
app.*prefix, consistent with the project-wide module structure changes.app/states/texas/__init__.py (2)
2-7: Import path updates look good.The import paths are correctly updated to use
app.*prefixes and relative imports where appropriate. The addition ofTexasSearchimport aligns with the module's expanded functionality.
32-34: Unable to locate the file or validators referenced in the review comment.After comprehensive searching, I could not find the file
app/states/texas/__init__.pyor any references toTECFiler/TECFilerNamevalidators in the repository. The verification cannot be completed because:
- The file path
app/states/texas/__init__.pydoes not appear to exist in the repository- No search results returned any definitions or usages of
TECFilerorTECFilerNamevalidators- The validators module cannot be located to confirm the validator structure
Recommendation: Verify that the file path is correct and the validators module exists in the codebase before proceeding with this review comment. If this is a newly added file or the path has changed, please provide the correct location.
app/states/texas/validators/texas_flags.py (1)
15-53: Remaining boolean flag fields look correct.The other flag fields are properly defined as
Optional[bool]withdefault=Noneand descriptive documentation. The file is missing a trailing newline at the end.app/tests/test_validation_class.py (3)
7-10: MockModel definition is appropriate for testing.The
MockModelcorrectly extendsSQLModelwith an optionalidfield and requiredfield, suitable for testing the validation logic.
14-22: Test correctly validates single record handling.The test properly instantiates
StateFileValidationwith the newvalidator_to_useparameter and correctly asserts the tuple structure of passed/failed results.
25-36: Test correctly handles generator-based validation.The test properly iterates over the generator returned by
validate()and validates each(status, record)tuple, aligning with the updated API.app/states/ohio/ohio_downloader.py (1)
18-35: Method signatures don't match the ABC abstract methods.The ABC
FileDownloaderABCdefinesdownload,consolidate_files, andreadas@classmethod, butOhioDownloaderimplements them as instance methods. This violates the abstract interface contract and will cause issues when called as classmethods.Apply this diff to align with the ABC:
class OhioDownloader(FileDownloaderABC): config: StateConfig = OHIO_CONFIGURATION - def download(self, overwrite: bool, read_from_temp: bool) -> FileDownloaderABC: + @classmethod + def download(cls, overwrite: bool, read_from_temp: bool) -> FileDownloaderABC: - _driver = self.driver.create_driver() - _wait = self.driver + _driver = cls.driver.create_driver() + _wait = cls.driver _driver.get(OHIO_ENTRY_URL) _wait.wait_until_clickable(By.LINK_TEXT, "FTP Site") _ftp_site_url = _driver.find_element(By.LINK_TEXT, "FTP Site").get_attribute("href") _driver.get(_ftp_site_url) - - ... - def consolidate_files(self): + @classmethod + def consolidate_files(cls): ... - def read(self): + @classmethod + def read(cls): ...Likely an incorrect or invalid review comment.
app/states/texas/validators/texas_filers.py (1)
165-184: Code snippet does not exist in the repository.The review comment references lines 165-184 in
app/states/texas/validators/texas_filers.py, but the file contains only 149 lines. The validatorsvalidate_treasurer,validate_assistant_treasurer, andvalidate_chairwith the shown implementation do not exist in the codebase. Additionally, the referenced classesTECTreasurer,TECAssistantTreasurer, andTECCampaignChairare not defined or imported anywhere in the repository.This appears to be a review comment for a different version or branch of the code than what is currently present.
Likely an incorrect or invalid review comment.
| donation_campaign_expenses['new_PayeeName'] = donation_campaign_expenses['payeeNameOrganization'].fillna('') \ | ||
| .where(donation_campaign_expenses['payeeNameOrganization'].notna(), | ||
| (donation_campaign_expenses['payeeNameFirst'].fillna('') + ' ' + | ||
| donation_campaign_expenses['payeeNameLast'].fillna('')).str.strip()) |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Chained .where() with .fillna() is confusing and potentially incorrect.
The logic here is convoluted. You use .fillna('') then .where() with the same notna() condition, which is redundant since fillna('') already handles NaN values. The intent seems to be: use organization name if present, otherwise concatenate first + last name.
Consider using a clearer approach:
-donation_campaign_expenses['new_PayeeName'] = donation_campaign_expenses['payeeNameOrganization'].fillna('') \
- .where(donation_campaign_expenses['payeeNameOrganization'].notna(),
- (donation_campaign_expenses['payeeNameFirst'].fillna('') + ' ' +
- donation_campaign_expenses['payeeNameLast'].fillna('')).str.strip())
+donation_campaign_expenses['new_PayeeName'] = donation_campaign_expenses['payeeNameOrganization'].fillna(
+ (donation_campaign_expenses['payeeNameFirst'].fillna('') + ' ' +
+ donation_campaign_expenses['payeeNameLast'].fillna('')).str.strip()
+)🤖 Prompt for AI Agents
In app/main.py around lines 77 to 80, the chained use of .fillna('') then
.where(...notna()) is redundant and confusing; replace it with a clear
conditional that picks payeeNameOrganization when it is non-empty/non-null,
otherwise concatenates payeeNameFirst + ' ' + payeeNameLast (ensuring first/last
are filled with '' before concatenation and the result is stripped). Implement
this by using a single np.where (or Series.where with a boolean mask) that
checks for organization presence (notna and not empty) and otherwise uses the
concatenated first+last name after fillna and strip; remove the redundant
fillna->where chaining.
|
|
||
| class TECFlags(TECSettings, table=True): | ||
| __tablename__ = "tx_flags" | ||
| __table_args__ = {"schema": "texas"} |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Minor: Annotate table_args with ClassVar.
SQLAlchemy's __table_args__ is a class-level configuration attribute and should be annotated with typing.ClassVar to indicate it's not an instance attribute.
+from typing import ClassVar
...
- __table_args__ = {"schema": "texas"}
+ __table_args__: ClassVar[dict] = {"schema": "texas"}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| __table_args__ = {"schema": "texas"} | |
| from typing import ClassVar | |
| __table_args__: ClassVar[dict] = {"schema": "texas"} |
🧰 Tools
🪛 Ruff (0.14.7)
13-13: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
🤖 Prompt for AI Agents
In app/states/texas/models.py around line 13, __table_args__ is defined as a
class-level SQLAlchemy config but lacks a typing.ClassVar annotation; update the
declaration to use typing.ClassVar (e.g., ClassVar[dict] or
ClassVar[Mapping[str, Any]]) and add the corresponding import from typing at the
top of the file so the attribute is clearly marked as a class variable rather
than an instance attribute.
| @classmethod | ||
| def dataframes(cls): | ||
| _files = cls._create_file_type_dict() | ||
| return {k: pl.scan_parquet(v) for k, v in _files.items()} |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Does polars scan_parquet accept a list of files
💡 Result:
Short answer: No — polars.scan_parquet does not accept a Python list of file paths. It expects a single path/string (a file, directory, or a glob pattern) and can read multiple files via a directory or glob, or by using read_parquet which accepts multiple paths. [1][2]
References
- scan_parquet signature and "Lazily read ... file or multiple files via glob patterns." [1]
- Detailed scan_parquet params and behavior (source is a path/str; use glob or directory to read multiple files). [2]
🏁 Script executed:
# Find the file and examine context around line 279
cd app/states/texas && head -290 texas_downloader.py | tail -20Repository: Abstract-Data/campaignfinance-2023
Length of output: 892
🏁 Script executed:
# Look at the _files variable and understand its structure
cd app/states/texas && grep -n "_files" texas_downloader.py | head -20Repository: Abstract-Data/campaignfinance-2023
Length of output: 1034
🏁 Script executed:
cd app/states/texas && grep -n "_create_file_type_dict" texas_downloader.py -A 30 | head -50Repository: Abstract-Data/campaignfinance-2023
Length of output: 2444
pl.scan_parquet does not accept a list of paths—it only accepts a single path, directory, or glob pattern.
The _create_file_type_dict() method returns values as lists of file paths (line 114: _file_type_dict.setdefault(parts, []).append(file)), but pl.scan_parquet(v) at line 279 will fail when v is a list. Use a glob pattern or convert to a string path, or iterate through files individually like the consolidate_files() method does (lines 133–146).
🤖 Prompt for AI Agents
In app/states/texas/texas_downloader.py around line 279, pl.scan_parquet is
being called with v which is a list of file paths but scan_parquet accepts a
single path/directory/glob; replace the single call with logic that handles
lists: either convert the list to an appropriate glob/string path if files share
a directory/pattern, or iterate the list and call pl.scan_parquet for each file
and then combine the resulting LazyFrames (mirroring the consolidate_files()
approach at lines 133–146) so the function returns a mapping of keys to a single
concatenated LazyFrame per key rather than attempting to pass a Python list into
pl.scan_parquet.
|
|
||
|
|
||
| class TECExpense(TECSettings, table=True): | ||
| class TECExpense(TECSettings): |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Get context around the class declaration
cat -n app/states/texas/validators/texas_expenses.py | head -30Repository: Abstract-Data/campaignfinance-2023
Length of output: 1413
🏁 Script executed:
# Search for TECExpense usage in the codebase
rg "TECExpense" --type py -B2 -A2Repository: Abstract-Data/campaignfinance-2023
Length of output: 1646
🏁 Script executed:
# Check what TECSettings looks like
rg "class TECSettings" --type py -B2 -A10Repository: Abstract-Data/campaignfinance-2023
Length of output: 1046
🏁 Script executed:
# Check if tx_expenses table is referenced anywhere else
rg "tx_expenses" --type pyRepository: Abstract-Data/campaignfinance-2023
Length of output: 157
🏁 Script executed:
# Check other validators to see if they have table=True
rg "class TEC\w+\(TECSettings.*\)" --type pyRepository: Abstract-Data/campaignfinance-2023
Length of output: 826
Remove dead code: __tablename__ and __table_args__ have no effect without table=True.
The TECExpense class lost its table=True parameter (line 18), which means it is no longer a SQLModel table. However, lines 19-20 still define __tablename__ and __table_args__, which are only meaningful for table models in SQLModel and now serve no purpose. Either restore table=True if TECExpense should remain an ORM table, or remove these two attributes to clean up dead code.
🤖 Prompt for AI Agents
In app/states/texas/validators/texas_expenses.py around line 18, the TECExpense
class no longer declares table=True so its following __tablename__ and
__table_args__ are dead; either restore table=True on the TECExpense class
definition if it is intended to be a SQLModel ORM table, or remove the
__tablename__ and __table_args__ attributes entirely to eliminate dead code;
make the change consistent with other models (if you restore table=True, ensure
imports and any DB metadata usage remain correct; if you remove them, run
tests/lint to confirm no references rely on those attributes).
| # clear_blank_strings = model_validator(mode='before')(funcs.clear_blank_strings) | ||
| # check_dates = model_validator(mode='before')(tx_funcs.validate_dates) | ||
| # check_zipcodes = model_validator(mode='before')(tx_funcs.check_zipcodes) | ||
| # address_formatting = model_validator(mode='before')(tx_funcs.address_formatting) | ||
| # phone_number_validation = model_validator(mode='before')(tx_funcs.phone_number_validation) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
rg -n "clear_blank_strings|check_dates|check_zipcodes|phone_number_validation|address_formatting" app/states/texas/validators/texas_settings.py -A2Repository: Abstract-Data/campaignfinance-2023
Length of output: 534
🏁 Script executed:
rg -n "class TECExpense" app/states/texas/validators/texas_expenses.py -A5Repository: Abstract-Data/campaignfinance-2023
Length of output: 674
Commented-out validators are redundant and may mask inheritance issues.
These validators are inherited from TECSettings (via class inheritance on line 18), so commenting them out here is redundant. However, the situation is more nuanced:
clear_blank_stringsandcheck_datesare active in TECSettings (lines 32-33), so commenting them in TECExpense serves no purpose and could confuse maintainers.phone_number_validationin TECExpense corresponds tocheck_phone_numbersin TECSettings (line 35, active), with a naming mismatch that adds confusion.check_zipcodesandaddress_formattingare also commented in TECSettings (lines 34, 36), making these redundant comments match the parent class state.
If duplicate execution is a concern, remove these comments entirely and rely on inheritance. If they're disabled intentionally, document why in a comment explaining the specific reason (e.g., "disabled due to performance concerns").
🤖 Prompt for AI Agents
In app/states/texas/validators/texas_expenses.py around lines 153 to 157,
commented-out model validators duplicate or conflict with validators inherited
from TECSettings and create confusion (and a naming mismatch for phone
validator). Remove these commented validator lines so the class relies on
inherited validators, or if you intentionally want to disable or rename any
validator, add a concise inline comment explaining the reason (e.g., "disabled
due to performance" or "renamed to check_phone_numbers in parent"); ensure names
match the parent (phone_number_validation -> check_phone_numbers) if you
reintroduce them.
| __tablename__ = "tx_person_names" | ||
| __table_args__ = {"schema": "texas"} | ||
| addresses: list["TECAddress"] = Relationship(back_populates="person_name", link_model=TECPersonAddressLinkModel) | ||
| # filer_name: list["TECFilerName"] = Relationship(back_populates="filer_name", link_model=TECFilerIDNameLink) |
There was a problem hiding this comment.
Major: Commented-out filer_name relationship may cause issues.
Line 48 has a commented-out relationship: # filer_name: list["TECFilerName"] = Relationship(back_populates="filer_name", link_model=TECFilerIDNameLink). However, Line 144 in TECFilerName references this with filer_name: Optional["TECPersonName"] = Relationship(back_populates="filer_name", link_model=TECFilerIDNameLink). This asymmetry will cause SQLAlchemy relationship configuration errors.
Either uncomment and fix Line 48, or remove/update the relationship on Line 144.
🤖 Prompt for AI Agents
In app/states/texas/validators/texas_filers.py around line 48, the filer_name
relationship is commented out which creates an asymmetric relationship because
TECFilerName (line 144) declares filer_name with back_populates and
link_model=TECFilerIDNameLink; restore symmetry by uncommenting and correcting
the relationship on line 48 so the attribute name and back_populates value
exactly match the counterpart on TECFilerName, use the proper typing (e.g.,
list["TECFilerName"] or Optional depending on cardinality) and include
link_model=TECFilerIDNameLink and back_populates="filer_name"; alternatively, if
you intend a different cardinality, update TECFilerName’s back_populates to
match the relationship name used here and ensure forward refs/imports are
present.
| def __init__(self, **data): | ||
| super().__init__(**data) | ||
| self.id = self.generate_key(self.ctaSeekOfficeCd, self.ctaSeekOfficeDistrict, self.ctaSeekOfficePlace) |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Add type annotations to init method.
The __init__ method is missing return type annotation (-> None) and parameter type annotation for **data.
- def __init__(self, **data):
+ def __init__(self, **data: Any) -> None:
super().__init__(**data)
self.id = self.generate_key(self.ctaSeekOfficeCd, self.ctaSeekOfficeDistrict, self.ctaSeekOfficePlace)Add to imports:
from typing import Optional
+from typing import Any📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def __init__(self, **data): | |
| super().__init__(**data) | |
| self.id = self.generate_key(self.ctaSeekOfficeCd, self.ctaSeekOfficeDistrict, self.ctaSeekOfficePlace) | |
| def __init__(self, **data: Any) -> None: | |
| super().__init__(**data) | |
| self.id = self.generate_key(self.ctaSeekOfficeCd, self.ctaSeekOfficeDistrict, self.ctaSeekOfficePlace) |
🧰 Tools
🪛 Ruff (0.14.7)
84-84: Missing return type annotation for special method __init__
Add return type annotation: None
(ANN204)
84-84: Missing type annotation for **data
(ANN003)
🤖 Prompt for AI Agents
In app/states/texas/validators/texas_filers.py around lines 84 to 86, the
__init__ method lacks type annotations; change its signature to include
parameter and return types (e.g., def __init__(self, **data: Any) -> None:) and
update imports to include Any from typing (add "from typing import Any"). Ensure
the rest of the method remains unchanged and type-checks under these
annotations.
| filer_names: Optional["TECFilerID"] = Relationship(back_populates="other_names", link_model=TECFilerIDNameLink) | ||
| @model_validator(mode='before') | ||
| @classmethod | ||
| def validate_name(cls, values): |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Add return type annotations to validators.
All the model_validator classmethods are missing return type annotations. While this is a minor issue, adding -> dict or -> Any would improve code clarity and satisfy type checkers.
@model_validator(mode='before')
@classmethod
- def validate_name(cls, values):
+ def validate_name(cls, values: dict) -> dict:
...Apply the same pattern to all validators in this file.
Also applies to: 158-158, 167-167, 174-174, 181-181
🧰 Tools
🪛 Ruff (0.14.7)
152-152: Missing return type annotation for classmethod validate_name
(ANN206)
🤖 Prompt for AI Agents
In app/states/texas/validators/texas_filers.py around lines 152, 158, 167, 174
and 181, the model_validator classmethods (e.g., validate_name and the other
validators) lack return type annotations; update each validator signature to
include an explicit return type (-> dict or -> Any) consistent across the file,
import Any from typing if using Any, and run type checks to ensure signatures
match their actual return values.
| # flags: Optional[TECFlags] = Field(default=None, description="Flags") | ||
| # | ||
| # def __init__(self, **data): | ||
| # _flags = TECFlags(**data) | ||
| # if len(_flags.model_dump(exclude_none=True).values()) > 0: | ||
| # self.flags = _flags | ||
| # super().__init__(**data) No newline at end of file |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Remove or address commented-out code.
The commented-out flags field and __init__ method represent incomplete functionality. Either implement this feature or remove the dead code to avoid confusion.
🤖 Prompt for AI Agents
In app/states/texas/validators/texas_settings.py around lines 51-57, remove the
commented-out flags field and __init__ or implement them properly; either delete
the dead code to avoid confusion, or restore and implement the flags feature by
adding a typed Optional[TECFlags] Field and an initializer that constructs
TECFlags from incoming data only when flag values are present (using Pydantic v2
model validation/creation patterns and model_dump/exclude_none to detect
values), then call super().__init__(**data); ensure imports and typing match the
project Pydantic version.
| import datetime as dt | ||
| from decimal import Decimal | ||
| from pathlib import Path | ||
| from typing import Dict, List, Tuple |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
🧩 Analysis chain
🏁 Script executed:
# First, let's check GenericFileReader's handling of download_date and timezone
rg -n "download_date|fromtimestamp" app/ingest/file_reader.py -A 3 -B 1Repository: Abstract-Data/campaignfinance-2023
Length of output: 143
🏁 Script executed:
# Now let's examine the test file at the specific lines mentioned
cat -n app/tests/test_ingest_hypothesis.py | head -20Repository: Abstract-Data/campaignfinance-2023
Length of output: 819
🏁 Script executed:
# Check lines 122-123, 144-145, 174-175 in the test file
sed -n '120,125p; 142,147p; 172,177p' app/tests/test_ingest_hypothesis.pyRepository: Abstract-Data/campaignfinance-2023
Length of output: 1026
🏁 Script executed:
# Also check the Python version requirement
fd -t f "setup.py|setup.cfg|pyproject.toml|.python-version" -x head -20 {} \;Repository: Abstract-Data/campaignfinance-2023
Length of output: 576
🏁 Script executed:
# Search for GenericFileReader implementation across all files
find . -name "*.py" -type f -exec grep -l "class GenericFileReader\|download_date" {} \;Repository: Abstract-Data/campaignfinance-2023
Length of output: 854
🏁 Script executed:
# Get more context around line 174
sed -n '168,180p' app/tests/test_ingest_hypothesis.pyRepository: Abstract-Data/campaignfinance-2023
Length of output: 684
🏁 Script executed:
# Check if there are any datetime timezone usages in GenericFileReader or related files
find . -name "*.py" -type f -exec grep -l "timezone\|UTC\|utc" {} \; | grep -E "(file_reader|ingest)" | head -5Repository: Abstract-Data/campaignfinance-2023
Length of output: 60
🏁 Script executed:
# Try to find if GenericFileReader is defined in app directory structure
find ./app -name "*.py" -type f | xargs grep -l "class GenericFileReader"Repository: Abstract-Data/campaignfinance-2023
Length of output: 62
Address minor Ruff hints: built‑in generics, tz‑aware datetime, and unused lambda arg.
- Line 7: Switch to built‑in generics for annotations (UP035):
dict[str, object],list[str],tuple[...]instead ofDict,List,Tuple. - Lines 122 & 144: Add timezone information to
dt.datetime.fromtimestamp()(DTZ006), e.g.,dt.datetime.fromtimestamp(file_path.stat().st_mtime, tz=dt.timezone.utc)to make tests explicit with tz-aware datetimes rather than naive local times. - Line 174: Change lambda parameter from
selfto_(ARG005):property(lambda _: tmp_path)since the parameter is unused.
These are optional cleanups; functionality is otherwise fine.
🧰 Tools
🪛 Ruff (0.14.7)
7-7: typing.Dict is deprecated, use dict instead
(UP035)
7-7: typing.List is deprecated, use list instead
(UP035)
7-7: typing.Tuple is deprecated, use tuple instead
(UP035)
🤖 Prompt for AI Agents
In app/tests/test_ingest_hypothesis.py around lines 7, 122, 144 and 174, address
Ruff hints by switching typing imports to built-in generics on line 7 (use
dict[str, object], list[str], tuple[...] instead of Dict, List, Tuple), make
timestamps tz-aware on lines 122 and 144 by passing a timezone (e.g.,
tz=dt.timezone.utc) to dt.datetime.fromtimestamp(...), and rename the unused
lambda parameter on line 174 from self to _ (property(lambda _: tmp_path)) to
indicate it's unused.
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (10)
app/states/texas/texas_downloader.py (7)
32-33: Unused parameter flagged in previous review.The
read_from_tempparameter is never used in the method body and should be removed as indicated in the previous review.
36-36: Iterator consumption issue flagged in previous review.The iterator
_existing_fileswill be exhausted when checking_safe_to_deleteand cannot be reused for deletion later, as detailed in the previous review.
80-84: Deletion logic issue flagged in previous review.The
_existing_filesiterator is exhausted by this point and the deletion will be a no-op, as explained in the previous review comment.
143-180: File reading inefficiency flagged in previous review.The iterator consumption pattern causes files to be read multiple times, as detailed in the previous review.
201-202: Timezone-aware datetime recommended in previous review.Using
datetime.now()without a timezone argument was flagged in the previous review.
276-279:pl.scan_parquetdoes not accept a list, as flagged in previous review.The
_create_file_type_dict()method returns lists of file paths, butpl.scan_parquet(v)at line 279 expects a single path, directory, or glob pattern, not a list. This was detailed in the previous review.
208-274: Critical bug:_datadictionary is cleared before use.Line 218 initializes
_data = {}, lines 219-221 populate it with file mappings, but line 223 immediately reassigns_data = {}, discarding all collected data. The subsequent loop at line 224 iterates over an empty dictionary, making the entireread()method non-functional.Apply this diff to remove the redundant reinitialization:
_data = {} for file in parquet_files: _type = file.stem.rsplit('_', 1)[0] _data.setdefault(_type, []).append(file) - _data = {} for k, v in _data.items():app/tests/test_ingest_hypothesis.py (3)
7-7: Deprecated typing imports flagged in previous review.The previous review recommended switching to built-in generics (
dict,list,tuple) instead oftyping.Dict,typing.List,typing.Tuple.
122-122: Timezone-aware datetime recommended in previous review.The previous review flagged
dt.datetime.fromtimestamp()calls without atzargument at lines 122 and 144.Also applies to: 144-144
174-174: Unused lambda parameter flagged in previous review.The previous review noted that the lambda parameter should be renamed from
selfto_to indicate it's unused.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (2)
app/states/texas/texas_downloader.py(1 hunks)app/tests/test_ingest_hypothesis.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
app/states/texas/texas_downloader.py (3)
app/funcs/csv_reader.py (3)
FileReader(54-144)read_parquet(108-132)read_folder(82-104)app/abcs/abc_download.py (6)
FileDownloaderABC(24-142)download(109-110)not_headless(36-39)extract_zipfile(62-70)consolidate_files(114-115)read(119-120)app/live_display.py (2)
add_task(34-44)update_task(46-52)
app/tests/test_ingest_hypothesis.py (4)
app/abcs/abc_state_config.py (1)
StateConfig(140-180)app/states/texas/texas_downloader.py (1)
consolidate_files(120-206)app/abcs/abc_download.py (1)
consolidate_files(114-115)app/states/ohio/ohio_downloader.py (1)
consolidate_files(31-32)
🪛 Ruff (0.14.7)
app/states/texas/texas_downloader.py
25-25: Missing return type annotation for special method __init__
Add return type annotation: None
(ANN204)
32-32: Boolean-typed positional argument in function definition
(FBT001)
32-32: Boolean default positional argument in function definition
(FBT002)
32-32: Unused class method argument: read_from_temp
(ARG003)
33-33: Boolean-typed positional argument in function definition
(FBT001)
33-33: Boolean default positional argument in function definition
(FBT002)
36-36: Unnecessary map() usage (rewrite using a generator expression)
Replace map() with a generator expression
(C417)
63-63: Avoid specifying long messages outside the exception class
(TRY003)
83-83: Unnecessary map() usage (rewrite using a list comprehension)
Replace map() with a list comprehension
(C417)
87-87: Unnecessary generator (rewrite as a list comprehension)
Rewrite as a list comprehension
(C400)
89-89: Avoid specifying long messages outside the exception class
(TRY003)
95-95: Unnecessary map() usage (rewrite using a list comprehension)
Replace map() with a list comprehension
(C417)
113-113: Avoid specifying long messages outside the exception class
(TRY003)
120-120: Missing return type annotation for classmethod consolidate_files
(ANN206)
147-147: Do not catch blind exception: Exception
(BLE001)
178-178: Do not catch blind exception: Exception
(BLE001)
201-201: datetime.datetime.now() called without a tz argument
(DTZ005)
204-204: Unnecessary map() usage (rewrite using a list comprehension)
Replace map() with a list comprehension
(C417)
209-209: Boolean-typed positional argument in function definition
(FBT001)
209-209: Boolean default positional argument in function definition
(FBT002)
217-217: Avoid specifying long messages outside the exception class
(TRY003)
240-240: Do not catch blind exception: Exception
(BLE001)
264-264: Do not catch blind exception: Exception
(BLE001)
269-269: Do not catch blind exception: Exception
(BLE001)
277-277: Missing return type annotation for classmethod dataframes
(ANN206)
app/tests/test_ingest_hypothesis.py
7-7: typing.Dict is deprecated, use dict instead
(UP035)
7-7: typing.List is deprecated, use list instead
(UP035)
7-7: typing.Tuple is deprecated, use tuple instead
(UP035)
122-122: datetime.datetime.fromtimestamp() called without a tz argument
(DTZ006)
144-144: datetime.datetime.fromtimestamp() called without a tz argument
(DTZ006)
174-174: Unused lambda argument: self
(ARG005)
211-211: Loop control variable files not used within loop body
(B007)
🔇 Additional comments (1)
app/tests/test_ingest_hypothesis.py (1)
1-224: LGTM: Comprehensive hypothesis-driven tests for ingestion and consolidation.The test suite effectively validates:
- GenericFileReader for both CSV and Parquet formats with schema mapping and type conversions
- Metadata augmentation (file_origin, download_date)
- Texas consolidation workflow with dynamic schema alignment and file lifecycle management
The property-based testing approach with Hypothesis provides strong coverage of edge cases and various input combinations.
| if not headless or not cls.driver.headless: | ||
| cls.not_headless() |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cd /repo && cat -n app/states/texas/texas_downloader.py | head -80Repository: Abstract-Data/campaignfinance-2023
Length of output: 133
🏁 Script executed:
cd /repo && rg -n "not_headless" --type=py -A 3 -B 3Repository: Abstract-Data/campaignfinance-2023
Length of output: 133
🏁 Script executed:
cd /repo && rg -n "def not_headless|driver\.headless\s*=" --type=py -B 2 -A 5Repository: Abstract-Data/campaignfinance-2023
Length of output: 133
🏁 Script executed:
pwd && git ls-files | head -20Repository: Abstract-Data/campaignfinance-2023
Length of output: 529
🏁 Script executed:
find . -name "texas_downloader.py" -type f 2>/dev/nullRepository: Abstract-Data/campaignfinance-2023
Length of output: 116
🏁 Script executed:
cat -n app/states/texas/texas_downloader.py | head -80Repository: Abstract-Data/campaignfinance-2023
Length of output: 3225
🏁 Script executed:
rg -n "not_headless|def not_headless" --type=py -B 3 -A 5Repository: Abstract-Data/campaignfinance-2023
Length of output: 1239
🏁 Script executed:
rg -n "driver\.headless" --type=py -B 2 -A 2Repository: Abstract-Data/campaignfinance-2023
Length of output: 647
🏁 Script executed:
rg -n "class CreateWebDriver|def __init__" app/abcs/abc_download.py -B 2 -A 10 | head -50Repository: Abstract-Data/campaignfinance-2023
Length of output: 60
🏁 Script executed:
rg -n "CreateWebDriver" --type=py -B 2 -A 8 | grep -A 8 "class CreateWebDriver"Repository: Abstract-Data/campaignfinance-2023
Length of output: 60
🏁 Script executed:
grep -r "class CreateWebDriver" --include="*.py"Repository: Abstract-Data/campaignfinance-2023
Length of output: 60
🏁 Script executed:
cat -n app/abcs/abc_download.py | head -50Repository: Abstract-Data/campaignfinance-2023
Length of output: 2099
🏁 Script executed:
rg -n "self.driver|cls.driver" app/abcs/abc_download.py -B 2 -A 2 | head -40Repository: Abstract-Data/campaignfinance-2023
Length of output: 222
🏁 Script executed:
rg -n "FileDownloaderABC.driver.*=" --type=py -B 2 -A 2Repository: Abstract-Data/campaignfinance-2023
Length of output: 407
Simplify the headless mode logic.
The conditional logic is unnecessarily complex:
if not headless or not cls.driver.headless:
cls.not_headless()When headless=True, this evaluates to if False or not cls.driver.headless:, which redundantly calls not_headless() only if the driver is already in non-headless mode. When headless=False, the first condition is always True, making the second condition irrelevant. Simplify to directly handle the user's requested mode:
-if not headless or not cls.driver.headless:
- cls.not_headless()
+if not headless:
+ cls.not_headless()🤖 Prompt for AI Agents
In app/states/texas/texas_downloader.py around lines 39-40, the conditional "if
not headless or not cls.driver.headless: cls.not_headless()" is overly complex;
change it to directly act on the user's requested mode by calling
cls.not_headless() only when headless is False (i.e., use "if not headless:
cls.not_headless()") and remove the redundant driver.headless check so the code
clearly follows the requested headless parameter.
|
|
||
| if not parquet: | ||
| _reader = FileReader() | ||
| cls.data = _reader.read_folder(cls.folder, file_counts=cls.config.FILE_COUNTS) |
There was a problem hiding this comment.
cls.folder may not exist as a class attribute.
The read classmethod references cls.folder at line 213, but folder is defined as an instance attribute (field(init=False)) in the base class FileDownloaderABC. Accessing it as a class attribute will fail with an AttributeError.
Replace with the class-level cls.config.TEMP_FOLDER:
if not parquet:
_reader = FileReader()
- cls.data = _reader.read_folder(cls.folder, file_counts=cls.config.FILE_COUNTS)
+ cls.data = _reader.read_folder(cls.config.TEMP_FOLDER, file_counts=cls.config.FILE_COUNTS)
else:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| cls.data = _reader.read_folder(cls.folder, file_counts=cls.config.FILE_COUNTS) | |
| cls.data = _reader.read_folder(cls.config.TEMP_FOLDER, file_counts=cls.config.FILE_COUNTS) |
🤖 Prompt for AI Agents
In app/states/texas/texas_downloader.py around line 213, the classmethod uses
cls.folder which is an instance-only attribute and will raise AttributeError;
replace the reference with the class-level temp folder (cls.config.TEMP_FOLDER)
so the classmethod uses a valid class attribute, i.e., call _reader.read_folder
using cls.config.TEMP_FOLDER and keep passing cls.config.FILE_COUNTS.
| for category, files in category_files.items(): | ||
| consolidated = list(tmp_path.glob(f"{category}_*.parquet")) | ||
| assert len(consolidated) == 1, f"Expected a single consolidated file for {category}" |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Unused loop variable files.
The loop unpacks category_files.items() into (category, files) but only category is used in the loop body. The files variable can be replaced with _ to indicate it's intentionally unused.
Apply this diff:
- for category, files in category_files.items():
+ for category, _ in category_files.items():
consolidated = list(tmp_path.glob(f"{category}_*.parquet"))
assert len(consolidated) == 1, f"Expected a single consolidated file for {category}"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for category, files in category_files.items(): | |
| consolidated = list(tmp_path.glob(f"{category}_*.parquet")) | |
| assert len(consolidated) == 1, f"Expected a single consolidated file for {category}" | |
| for category, _ in category_files.items(): | |
| consolidated = list(tmp_path.glob(f"{category}_*.parquet")) | |
| assert len(consolidated) == 1, f"Expected a single consolidated file for {category}" |
🧰 Tools
🪛 Ruff (0.14.7)
211-211: Loop control variable files not used within loop body
(B007)
🤖 Prompt for AI Agents
In app/tests/test_ingest_hypothesis.py around lines 211 to 213, the loop unpacks
category_files.items() into (category, files) but never uses files; replace the
unused variable with an underscore (e.g., for category, _ in
category_files.items():) to indicate it is intentionally unused and silence
linters.
- Add direct lookup for unified field names before fuzzy matching - Skip fuzzy matching on fields with unified prefixes (person_, address_, etc.) - Make transaction fields optional in field library (required=False) This fixes the issue where campaign names like 'LYTAL LYTAL LYTAL' appeared instead of 'JAMES H. LYTAL' due to fuzzy matching incorrectly matching person_last_name when looking for person_first_name.
There was a problem hiding this comment.
Actionable comments posted: 7
♻️ Duplicate comments (6)
app/states/unified_sqlmodels.py (6)
129-129:datetime.utcnow()is deprecated in Python 3.12+.This issue was previously flagged. Consider using
datetime.now(timezone.utc)for forward compatibility.
413-414: Duplicate field definition:state_idis defined twice.This bug was previously flagged. Remove Line 414 to fix.
853-853: Unused parameterentity_role.This was previously flagged. Either remove the parameter or use it to differentiate address field lookups by role (e.g., contributor vs payee address fields).
1001-1002: Broad exception handling returnsFalsesilently.This was previously flagged. Consider catching specific exceptions or logging unexpected errors.
1017-1018: Broad exception handling returnsNonesilently.This was previously flagged. Consider logging unexpected exceptions at debug level.
1191-1196: SQL injection risk and incorrectsession.execusage.The SQL injection risk was previously flagged. Additionally, Line 1194 uses
session.exec(text(query), params), but SQLModel'sexec()method signature doesn't accept positional parameters this way. This will likely raise aTypeErroror ignore the parameters.Use SQLModel's query builder as suggested in the previous review, or fix the raw SQL approach:
- if len(conditions) >= 2: - query = f"SELECT * FROM unified_addresses WHERE {' AND '.join(conditions)} LIMIT 1" - result = session.exec(text(query), params).first() - if result: - return UnifiedAddress(**dict(result)) + if len(params) >= 2: + stmt = select(UnifiedAddress) + if params.get("street_1"): + stmt = stmt.where(UnifiedAddress.street_1 == params["street_1"]) + if params.get("city"): + stmt = stmt.where(UnifiedAddress.city == params["city"]) + if params.get("state"): + stmt = stmt.where(UnifiedAddress.state == params["state"]) + if params.get("zip_code"): + stmt = stmt.where(UnifiedAddress.zip_code == params["zip_code"]) + return session.exec(stmt).first()
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (2)
app/states/unified_field_library.py(1 hunks)app/states/unified_sqlmodels.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
app/states/unified_sqlmodels.py (2)
app/states/unified_field_library.py (1)
get_state_mappings(380-382)app/states/unified_database.py (1)
get_session(77-79)
🪛 Ruff (0.14.7)
app/states/unified_field_library.py
8-8: typing.Dict is deprecated, use dict instead
(UP035)
8-8: typing.List is deprecated, use list instead
(UP035)
8-8: typing.Set is deprecated, use set instead
(UP035)
8-8: typing.Tuple is deprecated, use tuple instead
(UP035)
108-108: Missing return type annotation for special method __init__
Add return type annotation: None
(ANN204)
115-115: Missing return type annotation for private function _initialize_unified_fields
Add return type annotation: None
(ANN202)
296-296: Missing return type annotation for private function _initialize_state_mappings
Add return type annotation: None
(ANN202)
465-465: Unnecessary mode argument
Remove mode argument
(UP015)
app/states/unified_sqlmodels.py
8-8: typing.List is deprecated, use list instead
(UP035)
8-8: typing.Dict is deprecated, use dict instead
(UP035)
136-136: Unnecessary UTF-8 encoding argument to encode
Remove unnecessary encoding argument
(UP012)
164-164: Missing return type annotation for special method __post_init__
(ANN204)
227-227: Missing return type annotation for special method __post_init__
(ANN204)
288-288: Missing return type annotation for special method __post_init__
(ANN204)
414-414: Class field state_id is defined multiple times
Remove duplicate field definition for state_id
(PIE794)
747-747: Missing return type annotation for special method __init__
Add return type annotation: None
(ANN204)
853-853: Unused method argument: entity_role
(ARG002)
953-953: Dynamically typed expressions (typing.Any) are disallowed in _get_field_value
(ANN401)
1000-1000: Consider moving this statement to an else block
(TRY300)
1001-1001: Do not catch blind exception: Exception
(BLE001)
1017-1017: Do not catch blind exception: Exception
(BLE001)
1046-1046: Do not catch blind exception: Exception
(BLE001)
1104-1104: Do not catch blind exception: Exception
(BLE001)
1193-1193: Possible SQL injection vector through string-based query construction
(S608)
1199-1199: Do not catch blind exception: Exception
(BLE001)
1211-1211: Do not catch blind exception: Exception
(BLE001)
1214-1214: Dynamically typed expressions (typing.Any) are disallowed in value
(ANN401)
1228-1228: Dynamically typed expressions (typing.Any) are disallowed in value
(ANN401)
1238-1238: Naive datetime constructed using datetime.datetime.strptime() without %z
(DTZ007)
1244-1244: Consider moving this statement to an else block
(TRY300)
1248-1248: Dynamically typed expressions (typing.Any) are disallowed in value
(ANN401)
1303-1303: Dynamically typed expressions (typing.Any) are disallowed in obj
(ANN401)
1303-1303: Dynamically typed expressions (typing.Any) are disallowed in _json_default
(ANN401)
1314-1314: Missing return type annotation for special method __init__
Add return type annotation: None
(ANN204)
🔇 Additional comments (13)
app/states/unified_field_library.py (3)
15-77: LGTM!The enum definitions are comprehensive and well-organized with clear semantic groupings for campaign finance data categories and field types.
80-99: LGTM!The dataclass definitions are well-designed with appropriate defaults. The
confidencefield inStateFieldMappingis a thoughtful addition for handling approximate field mappings.
296-374: LGTM!The state-specific field mappings are comprehensive and well-structured. The varying confidence scores (e.g., 0.9 for
receivedDt→filed_date, 0.8 forPurpose→description) appropriately indicate mapping uncertainty.app/states/unified_sqlmodels.py (10)
23-99: LGTM!The enum definitions are comprehensive and well-organized, covering all necessary transaction types, person types, roles, and association types for campaign finance data modeling.
140-189: LGTM!The
UnifiedAddressmodel is well-structured with appropriate field types and normalization in__post_init__. Note: Thedatetime.utcnowdeprecation applies to multiple models throughout this file (Lines 157-158, 214-215, 278-279, etc.)—consider a bulk update if addressing the deprecation.
438-546: LGTM!The version/history tables follow a consistent pattern for audit trails with JSON snapshots. The
UnifiedCommitteePersonjunction table properly models committee-person relationships with role information and change tracking.
548-622: LGTM!The entity association and campaign models are well-designed with proper relationship definitions. The
UnifiedEntityAssociationpattern effectively models various relationship types (treasurer-of, vendor-for, etc.) between entities.
625-688: LGTM!The
UnifiedContributionandUnifiedLoanmodels properly normalize specialized transaction data with appropriate entity relationships and unique constraints ontransaction_id.
741-786: LGTM!The
UnifiedSQLModelBuilderinitialization andbuild_transactionmethod are well-implemented. Storing raw data as JSON provides valuable debugging/audit capability.
788-851: LGTM!The
build_personmethod has good logic for detecting person types, including handling special placeholder cases like "NON-ITEMIZED CONTRIBUTOR" and "ANONYMOUS".
885-951: LGTM!The
build_committeemethod has sensible fallback logic for handling committees without names, using candidate name or filer_id as alternatives.
1264-1301: LGTM!The
_determine_transaction_typemethod has good logic for inferring transaction types from both explicit type fields and field name patterns, with appropriate fallbacks.
1309-1477: LGTM!The
UnifiedSQLDataProcessoreffectively orchestrates the building of complete transaction graphs with related entities (persons, committees, campaigns) and specialized financial records (contributions, loans). Theprocess_recordmethod properly creates all necessary junction table entries.
| across different states, enabling consistent data processing and analysis. | ||
| """ | ||
|
|
||
| from typing import Dict, List, Set, Optional, Any, Tuple |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider using built-in generic types instead of typing imports.
Python 3.9+ supports dict, list, set, and tuple as generic types directly, making typing.Dict, typing.List, etc. unnecessary.
-from typing import Dict, List, Set, Optional, Any, Tuple
+from typing import Optional, AnyThen update type hints throughout the file (e.g., Dict[str, FieldDefinition] → dict[str, FieldDefinition]).
Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 Ruff (0.14.7)
8-8: typing.Dict is deprecated, use dict instead
(UP035)
8-8: typing.List is deprecated, use list instead
(UP035)
8-8: typing.Set is deprecated, use set instead
(UP035)
8-8: typing.Tuple is deprecated, use tuple instead
(UP035)
| # Committee fields | ||
| self.unified_fields.update({ | ||
| "committee_name": FieldDefinition( | ||
| name="committee_name", | ||
| category=FieldCategory.COMMITTEE_NAME, | ||
| field_type=FieldType.STRING, | ||
| description="Name of the political committee", | ||
| examples=["filerName", "Committee Name"], | ||
| validation_rules={"max_length": 200} | ||
| ), | ||
| "committee_type": FieldDefinition( | ||
| name="committee_type", | ||
| category=FieldCategory.COMMITTEE_TYPE, | ||
| field_type=FieldType.CODE, | ||
| description="Type of political committee", | ||
| examples=["filerTypeCd", "Committee Type"], | ||
| validation_rules={"enum_values": ["candidate", "pac", "party", "other"]} | ||
| ), | ||
| }) |
There was a problem hiding this comment.
Missing committee_filer_id unified field definition.
The state mappings reference committee_filer_id (Lines 332, 369), but this unified field is not defined in _initialize_unified_fields. This will cause get_unified_field("committee_filer_id") to return None and may break field lookups.
Add the missing field definition:
"committee_type": FieldDefinition(
name="committee_type",
category=FieldCategory.COMMITTEE_TYPE,
field_type=FieldType.CODE,
description="Type of political committee",
examples=["filerTypeCd", "Committee Type"],
validation_rules={"enum_values": ["candidate", "pac", "party", "other"]}
),
+ "committee_filer_id": FieldDefinition(
+ name="committee_filer_id",
+ category=FieldCategory.COMMITTEE_IDENTIFICATION,
+ field_type=FieldType.IDENTIFIER,
+ description="Unique identifier for the committee from the state filing system",
+ examples=["filerIdent", "Org ID"],
+ validation_rules={"required": False}
+ ),
})📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Committee fields | |
| self.unified_fields.update({ | |
| "committee_name": FieldDefinition( | |
| name="committee_name", | |
| category=FieldCategory.COMMITTEE_NAME, | |
| field_type=FieldType.STRING, | |
| description="Name of the political committee", | |
| examples=["filerName", "Committee Name"], | |
| validation_rules={"max_length": 200} | |
| ), | |
| "committee_type": FieldDefinition( | |
| name="committee_type", | |
| category=FieldCategory.COMMITTEE_TYPE, | |
| field_type=FieldType.CODE, | |
| description="Type of political committee", | |
| examples=["filerTypeCd", "Committee Type"], | |
| validation_rules={"enum_values": ["candidate", "pac", "party", "other"]} | |
| ), | |
| }) | |
| # Committee fields | |
| self.unified_fields.update({ | |
| "committee_name": FieldDefinition( | |
| name="committee_name", | |
| category=FieldCategory.COMMITTEE_NAME, | |
| field_type=FieldType.STRING, | |
| description="Name of the political committee", | |
| examples=["filerName", "Committee Name"], | |
| validation_rules={"max_length": 200} | |
| ), | |
| "committee_type": FieldDefinition( | |
| name="committee_type", | |
| category=FieldCategory.COMMITTEE_TYPE, | |
| field_type=FieldType.CODE, | |
| description="Type of political committee", | |
| examples=["filerTypeCd", "Committee Type"], | |
| validation_rules={"enum_values": ["candidate", "pac", "party", "other"]} | |
| ), | |
| "committee_filer_id": FieldDefinition( | |
| name="committee_filer_id", | |
| category=FieldCategory.COMMITTEE_IDENTIFICATION, | |
| field_type=FieldType.IDENTIFIER, | |
| description="Unique identifier for the committee from the state filing system", | |
| examples=["filerIdent", "Org ID"], | |
| validation_rules={"required": False} | |
| ), | |
| }) |
🤖 Prompt for AI Agents
In app/states/unified_field_library.py around lines 250 to 268, add a unified
field definition for "committee_filer_id" because state mappings reference it
but it's missing; create a FieldDefinition with name="committee_filer_id", an
appropriate category (e.g., FieldCategory.COMMITTEE_ID or COMMITTEE_FILER_ID),
field_type=FieldType.STRING, a short description like "Unique filer identifier
for the committee", useful examples (e.g., ["filerId", "committeeFilerId"]), and
validation_rules such as {"max_length": 100} (and an optional pattern if your
IDs follow a specific regex); insert this into the
self.unified_fields.update({...}) block alongside the other committee fields so
get_unified_field("committee_filer_id") returns the new definition.
| def import_mappings(self, file_path: Path): | ||
| """Import mappings from a JSON file""" | ||
| with open(file_path, 'r') as f: | ||
| import_data = json.load(f) |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Minor: Unnecessary mode argument.
The 'r' mode is the default for open() and can be omitted.
- with open(file_path, 'r') as f:
+ with open(file_path) as f:
import_data = json.load(f)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def import_mappings(self, file_path: Path): | |
| """Import mappings from a JSON file""" | |
| with open(file_path, 'r') as f: | |
| import_data = json.load(f) | |
| def import_mappings(self, file_path: Path): | |
| """Import mappings from a JSON file""" | |
| with open(file_path) as f: | |
| import_data = json.load(f) |
🧰 Tools
🪛 Ruff (0.14.7)
465-465: Unnecessary mode argument
Remove mode argument
(UP015)
🤖 Prompt for AI Agents
In app/states/unified_field_library.py around lines 463 to 466, the
open(file_path, 'r') call uses the explicit 'r' mode which is unnecessary;
remove the mode argument (use open(file_path) or preferably file_path.open()) or
even simpler use file_path.read_text() and then json.loads to read the JSON, so
the function reads the file without the redundant 'r' argument.
| """Normalized file origin references for ingested data.""" | ||
| __tablename__ = "file_origins" | ||
|
|
||
| id: str = Field(default=None, primary_key=True, max_length=64) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Search for FileOrigin instantiation patterns
rg -n "FileOrigin\(" --type py -A 3 -B 1Repository: Abstract-Data/campaignfinance-2023
Length of output: 750
🏁 Script executed:
# Look for the build_key method implementation
rg -n "def build_key" --type py -A 10Repository: Abstract-Data/campaignfinance-2023
Length of output: 928
🏁 Script executed:
# Search for how FileOrigin instances are created in tests or usage
rg -n "FileOrigin\s*=" --type py -A 2Repository: Abstract-Data/campaignfinance-2023
Length of output: 60
FileOrigin.id has default=None but is a primary key—remove the default or use default_factory.
All instantiations in the codebase explicitly set id via the build_key() result (e.g., FileOrigin(id=key, ...)), so the default is never actually used. However, defining default=None on a primary key field is poor practice and confusing. Either remove the default entirely or use default_factory to compute the key automatically if instances are created without an id.
🤖 Prompt for AI Agents
In app/states/unified_sqlmodels.py around line 126, the FileOrigin.id field is
declared as a primary_key with default=None which is confusing and poor
practice; remove the default=None so callers must supply an id, or replace it
with a default_factory that generates the key (e.g., call the existing build_key
function) if you want automatic ids—pick one approach, update the Field
declaration accordingly, and ensure any tests/instantiations that relied on
implicit defaults still pass.
| updated_at: datetime = Field(default_factory=datetime.utcnow) | ||
|
|
||
| # Relationships | ||
| address: Optional[UnifiedAddress] = Relationship() |
There was a problem hiding this comment.
Missing back_populates on UnifiedCommittee.address relationship.
Unlike other address relationships (e.g., UnifiedPerson.address, UnifiedEntity.address), this relationship doesn't specify back_populates. This inconsistency may cause bidirectional relationship sync issues.
- address: Optional[UnifiedAddress] = Relationship()
+ address: Optional[UnifiedAddress] = Relationship(back_populates="committees")Note: You'll also need to add a corresponding relationship on UnifiedAddress:
# In UnifiedAddress class
committees: List["UnifiedCommittee"] = Relationship(back_populates="address")🤖 Prompt for AI Agents
In app/states/unified_sqlmodels.py around line 282, the UnifiedCommittee.address
Relationship is missing back_populates which breaks bidirectional sync; update
UnifiedCommittee.address to include back_populates="committees" and add the
corresponding Relationship on the UnifiedAddress model (e.g., committees:
List["UnifiedCommittee"] = Relationship(back_populates="address") ) so both
sides reference each other consistently.
| # Database indexes for performance | ||
| class UnifiedTransactionIndexes: | ||
| """Database indexes for the unified transaction system""" | ||
|
|
||
| # Transaction indexes | ||
| idx_transactions_state = Index("idx_transactions_state", UnifiedTransaction.state_id) | ||
| idx_transactions_type = Index("idx_transactions_type", UnifiedTransaction.transaction_type) | ||
| idx_transactions_date = Index("idx_transactions_date", UnifiedTransaction.transaction_date) | ||
| idx_transactions_amount = Index("idx_transactions_amount", UnifiedTransaction.amount) | ||
| idx_transactions_committee = Index("idx_transactions_committee", UnifiedTransaction.committee_id) | ||
| idx_transactions_id = Index("idx_transactions_id", UnifiedTransaction.transaction_id) | ||
| idx_transactions_file_origin = Index("idx_transactions_file_origin", UnifiedTransaction.file_origin_id) | ||
|
|
||
| # Person indexes | ||
| idx_persons_name = Index("idx_persons_name", UnifiedPerson.last_name, UnifiedPerson.first_name) | ||
| idx_persons_organization = Index("idx_persons_organization", UnifiedPerson.organization) | ||
| idx_persons_type = Index("idx_persons_type", UnifiedPerson.person_type) | ||
|
|
||
| # Transaction-Person relationship indexes | ||
| idx_transaction_persons_role = Index("idx_transaction_persons_role", UnifiedTransactionPerson.role) | ||
| idx_transaction_persons_transaction = Index("idx_transaction_persons_transaction", UnifiedTransactionPerson.transaction_id) | ||
| idx_transaction_persons_person = Index("idx_transaction_persons_person", UnifiedTransactionPerson.person_id) | ||
|
|
||
| # Address indexes | ||
| idx_addresses_state = Index("idx_addresses_state", UnifiedAddress.state) | ||
| idx_addresses_city = Index("idx_addresses_city", UnifiedAddress.city) | ||
|
|
||
| # Committee indexes | ||
| idx_committees_name = Index("idx_committees_name", UnifiedCommittee.name) | ||
| idx_committees_type = Index("idx_committees_type", UnifiedCommittee.committee_type) | ||
| # Note: filer_id is already indexed as primary key | ||
|
|
||
| # Entity indexes | ||
| idx_entities_type = Index("idx_entities_type", UnifiedEntity.entity_type) | ||
| idx_entities_name = Index("idx_entities_name", UnifiedEntity.normalized_name) | ||
|
|
||
| # Campaign indexes | ||
| idx_campaigns_year = Index("idx_campaigns_year", UnifiedCampaign.election_year) | ||
| idx_campaigns_office = Index("idx_campaigns_office", UnifiedCampaign.office_sought) | ||
| idx_campaigns_name = Index("idx_campaigns_name", UnifiedCampaign.normalized_name) | ||
| idx_campaign_entity_role = Index("idx_campaign_entity_role", UnifiedCampaignEntity.role) | ||
|
|
||
| # Contribution indexes | ||
| idx_contributions_date = Index("idx_contributions_date", UnifiedContribution.receipt_date) | ||
| idx_contributions_amount = Index("idx_contributions_amount", UnifiedContribution.amount) | ||
|
|
||
| # Loan indexes | ||
| idx_loans_date = Index("idx_loans_date", UnifiedLoan.loan_date) | ||
| idx_loans_due_date = Index("idx_loans_due_date", UnifiedLoan.due_date) | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Search for references to UnifiedTransactionIndexes and its indexes
rg -n "UnifiedTransactionIndexes|idx_transactions_|idx_persons_|idx_addresses_|idx_committees_" --type py -B 2 -A 2Repository: Abstract-Data/campaignfinance-2023
Length of output: 2773
🏁 Script executed:
# Check if indexes are used in __table_args__ of model classes
rg -n "__table_args__" --type py -A 5Repository: Abstract-Data/campaignfinance-2023
Length of output: 17028
🏁 Script executed:
# Look for migration files or initialization code that might apply these indexes
fd -e py -e sql | xargs grep -l "idx_" 2>/dev/null | head -20Repository: Abstract-Data/campaignfinance-2023
Length of output: 109
🏁 Script executed:
# Check the import statements and usage of Index from SQLAlchemy
rg -n "from sqlalchemy|from sqlmodel" app/states/unified_sqlmodels.py -A 3Repository: Abstract-Data/campaignfinance-2023
Length of output: 498
🏁 Script executed:
# Search for Index usage within __table_args__
rg -n "__table_args__.*Index|Index.*__table_args__" --type py -B 5 -A 5Repository: Abstract-Data/campaignfinance-2023
Length of output: 60
🏁 Script executed:
# Check if there are any migration files
fd -e py -path "*/migrations/*" | head -20Repository: Abstract-Data/campaignfinance-2023
Length of output: 246
🏁 Script executed:
# Look for alembic or db initialization files
fd -e py | xargs grep -l "alembic\|migrate\|create_all\|metadata" 2>/dev/null | head -10Repository: Abstract-Data/campaignfinance-2023
Length of output: 302
🏁 Script executed:
# Check unified_database.py for index-related code
cat -n app/states/unified_database.py | head -100Repository: Abstract-Data/campaignfinance-2023
Length of output: 4198
🏁 Script executed:
# Check recreate_tables.py
cat -n recreate_tables.pyRepository: Abstract-Data/campaignfinance-2023
Length of output: 5343
🏁 Script executed:
# Search for any reference to UnifiedTransactionIndexes specifically
rg -n "UnifiedTransactionIndexes" --type pyRepository: Abstract-Data/campaignfinance-2023
Length of output: 146
🏁 Script executed:
# Check if any model class has __table_args__ with Index to confirm pattern
rg -n "class Unified" app/states/unified_sqlmodels.py -A 10 | head -50Repository: Abstract-Data/campaignfinance-2023
Length of output: 2934
🏁 Script executed:
# Verify the specific model definitions that need indexes
rg -n "class UnifiedTransaction\(|class UnifiedPerson\(|class UnifiedAddress\(" app/states/unified_sqlmodels.py -A 8Repository: Abstract-Data/campaignfinance-2023
Length of output: 1434
Index definitions in UnifiedTransactionIndexes won't be automatically applied to the database.
The UnifiedTransactionIndexes class defines indexes as class attributes, but SQLModel/SQLAlchemy doesn't automatically apply indexes defined this way. These Index objects need to be registered with each model's __table_args__. Currently, the models use individual index=True flags on fields but don't include composite or additional indexes via __table_args__.
Move index definitions to the respective model classes. For example, on UnifiedTransaction:
class UnifiedTransaction(SQLModel, table=True):
__tablename__ = "unified_transactions"
__table_args__ = (
Index("idx_transactions_state", "state_id"),
Index("idx_transactions_type", "transaction_type"),
Index("idx_transactions_date", "transaction_date"),
# ... other indexes
)Repeat for UnifiedPerson, UnifiedAddress, UnifiedCommittee, and other models with indexes defined in UnifiedTransactionIndexes.
| def _parse_amount(self, value: Any) -> Optional[Decimal]: | ||
| """Parse an amount value to Decimal""" | ||
| if value is None: | ||
| return None | ||
|
|
||
| try: | ||
| # Remove currency symbols and commas | ||
| if isinstance(value, str): | ||
| value = re.sub(r'[^\d.-]', '', value) | ||
|
|
||
| return Decimal(str(value)) | ||
| except (ValueError, TypeError): | ||
| return None |
There was a problem hiding this comment.
Amount parsing may produce invalid results.
The regex on Line 1222 removes all characters except digits, ., and -, but doesn't validate the result. Inputs like "$1.2.3" or "--100" would produce invalid decimals that could raise InvalidOperation when converting to Decimal.
Consider adding validation after stripping:
def _parse_amount(self, value: Any) -> Optional[Decimal]:
"""Parse an amount value to Decimal"""
if value is None:
return None
try:
# Remove currency symbols and commas
if isinstance(value, str):
value = re.sub(r'[^\d.-]', '', value)
+ # Validate the cleaned string has valid decimal format
+ if not re.match(r'^-?\d*\.?\d+$', value):
+ return None
return Decimal(str(value))
- except (ValueError, TypeError):
+ except (ValueError, TypeError, InvalidOperation):
return NoneNote: Import InvalidOperation from decimal module.
Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 Ruff (0.14.7)
1214-1214: Dynamically typed expressions (typing.Any) are disallowed in value
(ANN401)
🤖 Prompt for AI Agents
In app/states/unified_sqlmodels.py around lines 1214 to 1226, the current
_parse_amount strips non-numeric characters but doesn't validate the cleaned
string so inputs like "$1.2.3" or "--100" can produce InvalidOperation; after
stripping, validate the result with a regex that only allows an optional single
leading '-' and at most one decimal point (e.g. ^-?\d+(\.\d+)?$), ensure the
cleaned string is non-empty, then convert to Decimal while catching
decimal.InvalidOperation (import InvalidOperation) and return None on any
validation or conversion failure.
There was a problem hiding this comment.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
.github/workflows/ci.yml(1 hunks)
🧰 Additional context used
🪛 YAMLlint (1.37.1)
.github/workflows/ci.yml
[warning] 3-3: truthy value should be one of [false, true]
(truthy)
[error] 35-35: too many blank lines (1 > 0)
(empty-lines)
🔇 Additional comments (3)
.github/workflows/ci.yml (3)
17-17: Verify Python 3.12 dependency compatibility.The workflow specifies Python 3.12. Given the extensive dependency updates in this PR (Parquet support, Polars, Rich, SQLModel, etc.), please verify that all project dependencies are compatible with Python 3.12. If you haven't tested against 3.12 locally, consider doing so before merging.
24-25: Confirm CODECOV_TOKEN secret is configured.The workflow uploads coverage to Codecov using
${{ secrets.CODECOV_TOKEN }}. Ensure this secret is already configured in the repository settings; otherwise, the upload step will fail silently or error out at runtime.
1-35: LGTM—solid CI foundation.The workflow provides a clean, modern CI setup that aligns well with the PR's test infrastructure additions:
- Modern Python 3.12 with uv package manager
- Pytest with coverage reporting
- Codecov integration for tracking coverage over time
- Triggers on both push and PR events for comprehensive testing
Pending verification of Python 3.12 compatibility and Codecov secret setup, this is ready to go.
| on: | ||
| push: | ||
| pull_request: |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider explicit trigger configuration for YAML linting.
The push: and pull_request: events without explicit configuration are valid GitHub Actions syntax and will trigger on all push/PR events. However, YAMLlint prefers explicit values. This is a minor style note—the current syntax works fine, but if you want to satisfy stricter linting, consider using explicit empty dicts or null:
on:
- push:
- pull_request:
+ push: {}
+ pull_request: {}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| on: | |
| push: | |
| pull_request: | |
| on: | |
| push: {} | |
| pull_request: {} |
🧰 Tools
🪛 YAMLlint (1.37.1)
[warning] 3-3: truthy value should be one of [false, true]
(truthy)
🤖 Prompt for AI Agents
.github/workflows/ci.yml lines 3-5: the workflow triggers use bare "push:" and
"pull_request:" which are valid but fail stricter YAML lint rules; change each
to an explicit empty mapping or null (e.g., "push: {}" and "pull_request: {}" or
"push: null" and "pull_request: null") so the YAML linter accepts the syntax
while preserving the same trigger behavior.
| with: | ||
| token: ${{ secrets.CODECOV_TOKEN }} | ||
| slug: Abstract-Data/campaignfinance-2023 | ||
|
|
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Remove trailing blank line.
YAMLlint flags the trailing blank line after line 34. Apply this cleanup:
slug: Abstract-Data/campaignfinance-2023
-📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| slug: Abstract-Data/campaignfinance-2023 |
🧰 Tools
🪛 YAMLlint (1.37.1)
[error] 35-35: too many blank lines (1 > 0)
(empty-lines)
🤖 Prompt for AI Agents
In .github/workflows/ci.yml around line 35, there is a trailing blank line
flagged by YAMLlint; remove the empty line at the end of the file so the file
ends on line 34 without an extra newline-only line, then save the file (ensure
the file still ends with a single newline character if your editor enforces it).
❌ 1 Tests Failed:
View the top 1 failed test(s) by shortest run time
To view more test analytics, go to the Prevent Tests Dashboard |
| runs-on: ubuntu-latest | ||
|
|
||
| steps: | ||
| - uses: actions/checkout@v4 | ||
|
|
||
| - name: Set up Python | ||
| uses: actions/setup-python@v5 | ||
| with: | ||
| python-version: "3.12" | ||
|
|
||
| - name: Install uv | ||
| uses: astral-sh/setup-uv@v4 | ||
|
|
||
| - name: Install dependencies | ||
| run: | | ||
| uv pip install --system . | ||
| uv pip install --system pytest pytest-cov | ||
|
|
||
| - name: Run tests with coverage | ||
| run: pytest --cov --cov-branch --cov-report=xml --junitxml=junit.xml -o junit_family=legacy | ||
|
|
||
| - name: Upload coverage reports to Codecov | ||
| uses: codecov/codecov-action@v5 | ||
| with: | ||
| token: ${{ secrets.CODECOV_TOKEN }} | ||
| slug: Abstract-Data/campaignfinance-2023 | ||
|
|
||
| - name: Upload test results to Codecov | ||
| if: ${{ !cancelled() }} | ||
| uses: codecov/test-results-action@v1 | ||
| with: | ||
| token: ${{ secrets.CODECOV_TOKEN }} |
Check warning
Code scanning / CodeQL
Workflow does not contain permissions Medium
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 3 months ago
The recommended fix is to explicitly set a permissions block for the workflow or for the individual job(s). In this case, since there’s only one job and the workflow’s steps do not require write access to repository contents or other privileged scopes, the minimal permission should be set: contents: read. This should be added at the root level of the workflow file (just below name: and before on:), to apply to all jobs. No other imports, method changes, or additional code definitions are needed; only a single YAML key addition to .github/workflows/ci.yml.
| @@ -1,4 +1,6 @@ | ||
| name: CI | ||
| permissions: | ||
| contents: read | ||
|
|
||
| on: | ||
| push: |
This is just the start of a new PR after I messed up the last one.
Summary by CodeRabbit
New Features
Documentation
Refactor
Chores
Tests
✏️ Tip: You can customize this high-level summary in your review settings.