feat: add Claude guidelines for scraper creation and reviews#1808
feat: add Claude guidelines for scraper creation and reviews#1808Luis-manzur wants to merge 1 commit intomainfrom
Conversation
grossir
left a comment
There was a problem hiding this comment.
- has inexact info (using WebDriven)
- Is repetitive (way too much text on usage of titlecase, which is very straightforward)
- has redundant recommendations (ex: creating a session, which is already created by AbstractSite)
- missing important heuristics (testing backscrapers, actually counting the number of results and comparing with what was parsed, skipping rows silently, bad XPATHs)
For the heuristics, check the PRs reviewed and find the usual pain points
I would recommend discussing ideas in the issue before implementation. Otherwise the PR itself becomes the discussion, which works too, but after trying to get a more polished starting point
| **Why OpinionSiteLinear:** | ||
| - Modern, maintainable architecture | ||
| - Handles any data source (JSON APIs, HTML, XML) | ||
| - Built-in pagination support |
There was a problem hiding this comment.
- Built-in pagination support
?
|
|
||
| ## Choosing the Right Base Class | ||
|
|
||
| **⚠️ CRITICAL: All new scrapers MUST use `OpinionSiteLinear` or `OpinionSiteLinearWebDriven`.** |
There was a problem hiding this comment.
We don't use this OpinionSiteLinearWebDriven because CL doesn't support it?
| - [ ] Complete docstring with CourtID, Court Short Name, Author, Reviewer, History | ||
| - [ ] Clear comments for complex logic or non-obvious code | ||
| - [ ] Type hints for methods in new scrapers | ||
|
|
There was a problem hiding this comment.
missing that any PR should edit CHANGES.md, citing the original issue number, too
| from juriscraper.lib.string_utils import convert_date_string | ||
|
|
||
| # Auto-detect format | ||
| case_dict["date"] = convert_date_string("01/15/2024") |
There was a problem hiding this comment.
this is automatically done by OpinionSiteLinear
| ## Advanced Topics | ||
|
|
||
| ### PDF Content Extraction | ||
| Some scrapers need to extract metadata from downloaded PDFs: |
| self.court_id = self.__module__ | ||
|
|
||
| # Session persists across requests | ||
| self.request["session"] = requests.Session() |
There was a problem hiding this comment.
this is already present?
juriscraper/juriscraper/AbstractSite.py
Line 64 in 8640947
|
It may be worth considering some of the findings in this paper. |
add new Claude development guidelines #1793