feat(saved-jobs): add saved/bookmarked jobs scraping with pagination and progress#167
Conversation
…hub-actions-1755279694708 Add Claude Code GitHub Workflow
…gure chore: Configure Renovate
…l-sh-setup-uv-7.x chore(deps): update astral-sh/setup-uv action to v7
…hub-actions-1766618312657 Add Claude Code GitHub Workflow
…sh-setup-bun-2.x chore(deps): update oven-sh/setup-bun action to v2
…ns-checkout-6.x chore(deps): update actions/checkout action to v6
…n-3.x chore(deps): update python docker tag to v3.14
Python 3.14 is too new and key dependencies lack support: - pydantic-core: PyO3 doesn't support Python 3.14 yet - lxml: No pre-built wheels for Python 3.14 Python 3.13 is still modern and has full ecosystem support.
Python 3.14 is too new and key dependencies lack support: - pydantic-core: PyO3 doesn't support Python 3.14 yet - lxml: No pre-built wheels for Python 3.14 Python 3.13 is still modern and has full ecosystem support.
Add ToolAnnotations to all 6 tools with appropriate hints: - get_person_profile: readOnly, openWorld (LinkedIn API) - get_company_profile: readOnly, openWorld (LinkedIn API) - get_job_details: readOnly, openWorld (LinkedIn API) - search_jobs: readOnly, openWorld (LinkedIn API) - get_recommended_jobs: readOnly, openWorld (LinkedIn API) - close_session: not readOnly, not openWorld (local session mgmt) Tool annotations help LLM clients understand tool behavior and make better decisions about tool selection and user confirmations. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
…aniel#65) ## Summary Add `ToolAnnotations` to all 6 tools to help LLM clients understand tool behavior and make better decisions about tool selection and user confirmations. ### Changes - Added annotations to all 6 tools across 4 files: - `linkedin_mcp_server/tools/person.py` - `linkedin_mcp_server/tools/company.py` - `linkedin_mcp_server/tools/job.py` - `linkedin_mcp_server/server.py` ### Tool Annotations Added | Tool | title | readOnlyHint | destructiveHint | openWorldHint | |------|-------|--------------|-----------------|---------------| | get_person_profile | Get Person Profile | ✅ | ❌ | ✅ | | get_company_profile | Get Company Profile | ✅ | ❌ | ✅ | | get_job_details | Get Job Details | ✅ | ❌ | ✅ | | search_jobs | Search Jobs | ✅ | ❌ | ✅ | | get_recommended_jobs | Get Recommended Jobs | ✅ | ❌ | ✅ | | close_session | Close Session | ❌ | ❌ | ❌ | ### Annotation Rationale - **readOnlyHint=true**: 5 tools are read-only data retrieval from LinkedIn - **openWorldHint=true**: 5 tools access external LinkedIn API - **close_session**: Local session management (not read-only, not external) - **destructiveHint=false**: No tools delete or destroy any resources ### Why This Matters Tool annotations are part of the MCP specification that help AI clients: - Display appropriate confirmation dialogs for destructive operations - Make better decisions about autonomous tool execution - Show users accurate information about what tools do ### Testing - ✅ Python import test passes - ✅ All 6 tools verified 🤖 Generated with [Claude Code](https://claude.com/claude-code)
Replace non-existent main.py with module execution (-m linkedin_mcp_server) in VS Code task configurations
Replace non-existent main.py with module execution (-m linkedin_mcp_server) in VS Code task configurations <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Align VS Code tasks with module-based entry point. > > - Replace `uv run main.py` with `uv run -m linkedin_mcp_server` across debug, standard run, and HTTP MCP server tasks > - Update task `label` and `detail` to reflect server execution; preserve flags like `--debug`, `--no-headless`, `--no-lazy-init`, and `--transport streamable-http` > - Config-only change in `.vscode/tasks.json` > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit e0460c8. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->
The CLI uses --log-level {DEBUG,INFO,WARNING,ERROR} not --debug
The CLI uses --log-level {DEBUG,INFO,WARNING,ERROR} not --debug
Upgrade fastmcp from >=2.10.1 to >=2.14.0 to fix the 307 Temporary Redirect issue when using streamable-http transport. The fix was merged in FastMCP PR #896 and #998, which changed default paths to include trailing slashes and removed automatic path manipulation that caused redirect loops with Starlette's Mount routing. This also upgrades mcp from 1.10.1 to 1.25.0 which includes related fixes confirmed by users in modelcontextprotocol/python-sdk#1168. Resolves: stickerdaniel#54
Upgrade fastmcp from >=2.10.1 to >=2.14.0 to fix the 307 Temporary Redirect issue when using streamable-http transport. The fix was merged in FastMCP PR #896 and #998, which changed default paths to include trailing slashes and removed automatic path manipulation that caused redirect loops with Starlette's Mount routing. This also upgrades mcp from 1.10.1 to 1.25.0 which includes related fixes confirmed by users in modelcontextprotocol/python-sdk#1168. Resolves: stickerdaniel#54 <!-- CURSOR_SUMMARY --> --- > [!NOTE] > <sup>[Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) is generating a summary for commit f2b67c2. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->
Add fakeredis and docket loggers to noise reduction to prevent DEBUG log pollution from FastMCP's internal task queue.
…ump_version_to_4.1.0 ci(release): fix workflow blocked by branch protection
…ump_version_to_4.1.1 chore: bump version to 4.1.1
Bump version to 4.1.2 to trigger release workflow test.
…orting - Fix wait_for_function positional arg bug (arg= keyword required) - Switch pagination from broken "Next" button to numbered page buttons (button[aria-label="Page N"]) which reliably triggers content updates - Replace arbitrary asyncio.sleep() calls with DOM-based waiting via wait_for_function to detect new job links - Embed job IDs summary in section text so LLMs always surface them - Add on_progress callback for per-page progress reporting Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Detect total pages from pagination buttons on the page instead of using max_pages (10), so progress reports reflect reality (1/2, 2/2 instead of 1/10, 2/10). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…kups, and add tests Address review findings: cap total_pages with max_pages to fix misleading progress percentages, add _NAV_DELAY between page clicks for rate-limit safety, convert JS prevIds.includes() to Set.has() for O(1) lookups, guard division by zero in _report, fix docstring inaccuracies, and add 5 targeted tests covering progress callbacks, timeout graceful stop, max_pages cap, and session expired error handling. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Address Greptile review: use Set for O(1) dedup in _EXTRACT_JOB_IDS_JS, expose max_pages parameter on get_saved_jobs MCP tool, and document the new tool in AGENTS.md, README.md, and docs/docker-hub.md. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Hey there @stickerdaniel - hope you're doing well. Is there anything I can do to help get this merged in sir? Thanks in advance. Let me know. |
fd80f60 to
7661f43
Compare
131a14b to
5e68717
Compare
Greptile SummaryThis PR adds a Confidence Score: 5/5Safe to merge — all previously raised concerns are resolved and no new P0/P1 issues found. All prior review threads (Set deduplication in No files require special attention.
|
| Filename | Overview |
|---|---|
| linkedin_mcp_server/scraping/extractor.py | Adds scrape_saved_jobs method with SPA pagination via button clicks, Set-based deduplication, progress callbacks, and graceful timeout handling. |
| linkedin_mcp_server/tools/job.py | Adds get_saved_jobs MCP tool with max_pages parameter exposed and progress reporting via _report callback; correctly caps progress at 99% until final 100% signal. |
| tests/test_scraping.py | Adds 5 well-structured tests for scrape_saved_jobs covering single-page, multi-page pagination, timeout, max-pages cap, and empty cases. |
| tests/test_tools.py | Adds test_get_saved_jobs and test_get_saved_jobs_error covering the tool-level success path and session-expired error handling. |
Sequence Diagram
sequenceDiagram
participant Tool as get_saved_jobs (tool)
participant Extractor as LinkedInExtractor
participant Page as Patchright Page
participant LI as LinkedIn (jobs-tracker)
Tool->>Extractor: scrape_saved_jobs(max_pages, on_progress)
Extractor->>Page: goto(jobs-tracker/)
Page->>LI: HTTP GET /jobs-tracker/
LI-->>Page: SPA HTML
Extractor->>Page: evaluate(_EXTRACT_JOB_IDS_JS)
Page-->>Extractor: page 1 job IDs
Extractor->>Page: locator('button[aria-label^=Page]').count()
Page-->>Extractor: total_pages
Extractor->>Tool: on_progress(1, total_pages, ...)
loop for each page 2..total_pages while button exists
Extractor->>Page: locator('button[aria-label=Page N]').click()
Extractor->>Page: wait_for_function(new IDs appear, timeout=15s)
Page-->>Extractor: new IDs detected (or TimeoutError - break)
Extractor->>Page: scroll_to_bottom()
Extractor->>Page: evaluate(_EXTRACT_MAIN_TEXT_JS)
Page-->>Extractor: page text
Extractor->>Page: evaluate(_EXTRACT_JOB_IDS_JS)
Page-->>Extractor: all visible IDs (deduped with prev_ids)
Extractor->>Tool: on_progress(page_num, total_pages, ...)
end
Extractor-->>Tool: url, sections, job_ids, pages_visited, sections_requested
Tool->>Tool: ctx.report_progress(100, 100, Complete)
Tool-->>MCP Client: result dict
Prompt To Fix All With AI
This is a comment left during a code review.
Path: linkedin_mcp_server/scraping/extractor.py
Line: 446
Comment:
**Loop upper bound ignores `total_pages`**
`total_pages` is already capped at `max_pages` via `min(…, max_pages)`, so the loop can use `total_pages + 1` as its upper bound instead of `max_pages + 1`. Both produce the same result (the button check stops the loop early either way), but using `total_pages` makes the relationship between the detected page count and the iteration range explicit and avoids iterating past the last real page.
```suggestion
for page_num in range(2, total_pages + 1):
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (3): Last reviewed commit: "docs(saved-jobs): add docs, expose max_p..." | Re-trigger Greptile
|
Tip: Greploops — Automatically fix all review issues by running Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal. |
|
replaced by #338 |
Thanks for your work here - useful tool. Appreciate your efforts. I wanted the ability to read out my saved jobs - so, I added it. It will handle multiple pages.
Let me know if this is aligned to what you would like to include. Let me know of any changes you think are needed.
Summary
scrape_saved_jobstoLinkedInExtractor— scrapes the LinkedIn jobs tracker page, extracts job IDs from link hrefs, and paginates through results using numbered page buttonsget_saved_jobsMCP tool with progress reporting viaon_progresscallbacktotal_pageswithmax_pagesfor accurate progress percentagesSetfor O(1) job ID deduplication in the DOM polling functionTest plan
test_scrape_saved_jobs_single_page— single page with progress callbacktest_scrape_saved_jobs_paginates— multi-page with progress and ID collectiontest_scrape_saved_jobs_timeout_stops_gracefully— timeout returns partial resultstest_scrape_saved_jobs_stops_at_max_pages_despite_more_buttons— respects max_pages captest_scrape_saved_jobs_empty— empty resultstest_get_saved_jobs— tool-level success pathtest_get_saved_jobs_error— session expired error handlingGreptile Summary
Adds
get_saved_jobsMCP tool to scrape saved/bookmarked jobs from LinkedIn's job tracker with pagination and progress reporting.Key Changes:
/jobs/view/<id>/)Setfor O(1) job ID lookups in both JavaScript extraction and Python filteringon_progresscallback with accurate page counts capped bymax_pagesImplementation Quality:
max_pagesparameter (default 10) for user controljob_idslist and formatted textPrevious Review Items Addressed:
_EXTRACT_JOB_IDS_JS(lines 389-390)max_pagesparameter in tool signature (line 75)Confidence Score: 5/5
Important Files Changed
scrape_saved_jobsmethod with robust pagination logic, Set-based O(1) deduplication, proper error handling, and progress callbacksget_saved_jobsMCP tool with exposedmax_pagesparameter, progress reporting, and consistent error handlingget_saved_jobssuccess path and error handlingFlowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD Start([Start]) --> Navigate[Navigate to jobs-tracker] Navigate --> ExtractPage1[Extract page 1 text and IDs] ExtractPage1 --> CountButtons[Count pagination buttons] CountButtons --> CalcTotal[Calculate total_pages cap] CalcTotal --> ReportP1[Report progress page 1] ReportP1 --> CheckMore{More pages?} CheckMore -->|Yes| CheckButton{Button exists?} CheckButton -->|No| Append[Append ID summary] CheckButton -->|Yes| ClickButton[Click page button] ClickButton --> WaitDelay[Wait nav delay] WaitDelay --> WaitNewIDs{Wait for new IDs} WaitNewIDs -->|Timeout| Append WaitNewIDs -->|Success| Scroll[Scroll to bottom] Scroll --> ExtractText[Extract page text] ExtractText --> ExtractIDs[Extract job IDs] ExtractIDs --> FilterDups[Filter duplicates] FilterDups --> CheckNewIDs{New IDs?} CheckNewIDs -->|No| Append CheckNewIDs -->|Yes| AddIDs[Add to all_job_ids] AddIDs --> ReportProgress[Report progress] ReportProgress --> CheckMore CheckMore -->|No| Append Append --> BuildSections[Build sections dict] BuildSections --> Return([Return result])Last reviewed commit: 5e68717