Add Snowflake Cortex Hacker News connector to AI tutorials#555
Conversation
Adds a new connector example that syncs top stories from the Hacker News API and enriches them with AI-powered sentiment analysis and topic classification using the Snowflake Cortex REST API during ingestion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a new AI tutorial connector example that syncs Hacker News top stories and optionally enriches them with sentiment + topic classification via Snowflake Cortex during ingestion, plus links it from the repo’s main README.
Changes:
- Added a new tutorial connector (
snowflake-cortex-hacker-news) with incremental sync, batch processing, and optional Cortex enrichment. - Added tutorial documentation (setup, configuration, cost notes, behavior overview).
- Updated the root
README.mdto include the new tutorial entry.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| README.md | Adds a link/description for the new Snowflake Cortex + Hacker News tutorial. |
| all_things_ai/tutorials/snowflake-cortex-hacker-news/README.md | New tutorial README describing features, configuration, and behavior. |
| all_things_ai/tutorials/snowflake-cortex-hacker-news/connector.py | New SDK v2+ connector implementation: HN fetch + optional Cortex enrichment + batching/checkpointing. |
| all_things_ai/tutorials/snowflake-cortex-hacker-news/configuration.json | New example configuration file for running the tutorial. |
| - `cortex_timeout` (optional): Timeout in seconds for Cortex API calls, defaults to 30 | ||
| - `max_enrichments` (optional): Maximum number of AI enrichments per sync for cost control, defaults to 50 | ||
|
|
||
| Note: Ensure that the `configuration.json` file is not checked into version control to protect sensitive information. |
There was a problem hiding this comment.
The README note says to not check configuration.json into version control, but this example directory includes a committed configuration.json. Consider aligning with the repo README template wording: committed examples use placeholder values, and production repositories should not commit real credentials/config.
| Note: Ensure that the `configuration.json` file is not checked into version control to protect sensitive information. | |
| Note: This example repository includes a `configuration.json` file checked into version control, but it contains only placeholder values and no real credentials. In your own production repositories, do not commit `configuration.json` files containing real secrets or environment-specific configuration; instead, use environment variables or a secrets manager to store sensitive values. |
There was a problem hiding this comment.
Fixed in commit 075de23f — updated the README note to clarify that the committed configuration.json contains only placeholder values, and that production repositories should not commit real credentials.
| if enrichment.get("cortex_sentiment") is not None: | ||
| enriched_count += 1 |
There was a problem hiding this comment.
enriched_count is used to enforce max_enrichments, but it only increments when cortex_sentiment is not None. If sentiment parsing fails (or only classification succeeds), the connector will continue making Cortex calls beyond the intended limit, defeating cost control. Increment the counter based on enrichment attempts per story (or on success of either enrichment), and ideally count each Cortex call if that’s what you’re trying to cap.
| if enrichment.get("cortex_sentiment") is not None: | |
| enriched_count += 1 | |
| # Count each enrichment attempt toward the max_enrichments limit, | |
| # regardless of which specific enrichment fields are present. | |
| enriched_count += 1 |
There was a problem hiding this comment.
Fixed in commit 075de23f — enriched_count now increments on each enrichment attempt regardless of which specific enrichment fields succeed. This ensures the max_enrichments cost control limit is enforced based on Cortex API calls made, not just successful sentiment parses.
There was a problem hiding this comment.
Also updated the README Data handling section in commit bb7b654b to document that max_enrichments counts each enrichment attempt regardless of which specific fields succeed.
| synced_count = 0 | ||
| highest_synced_id = state.get("last_synced_id", 0) | ||
|
|
||
| for story_id in story_ids: | ||
| story_data = fetch_story(session, story_id) | ||
| if not story_data: | ||
| continue | ||
|
|
||
| # Enrich with Cortex if enabled and within enrichment limit | ||
| title = story_data.get("title", "") | ||
| if is_cortex_enabled and title and enriched_count < max_enrichments: | ||
| enrichment = enrich_story(session, configuration, title) | ||
| story_data.update(enrichment) | ||
| if enrichment.get("cortex_sentiment") is not None: | ||
| enriched_count += 1 | ||
|
|
||
| # Flatten nested data structures for Fivetran compatibility | ||
| flattened = flatten_dict(story_data) | ||
|
|
||
| # The 'upsert' operation is used to insert or update data in the destination table. | ||
| # The first argument is the name of the destination table. | ||
| # The second argument is a dictionary containing the record to be upserted. | ||
| op.upsert(table="stories_enriched", data=flattened) | ||
|
|
||
| synced_count += 1 | ||
| highest_synced_id = max(highest_synced_id, story_id) | ||
|
|
There was a problem hiding this comment.
State advancement can skip stories permanently: highest_synced_id only tracks the max ID among successfully upserted stories, and update() uses it as a "cursor" (sid > last_synced_id). If a lower-ID story in the batch fails to fetch/upsert but a higher-ID story succeeds, the state jumps past the failed ID and that story will never be retried. To avoid data loss, process IDs in ascending order and only advance state to the last contiguously processed ID (or don’t advance past failures).
There was a problem hiding this comment.
Fixed in commit 075de23f — story IDs are now sorted ascending before processing (sorted([sid for sid in all_story_ids if sid > last_synced_id])). This ensures contiguous state advancement so that if a lower-ID story fails to fetch, the cursor does not skip past it to a higher-ID success.
There was a problem hiding this comment.
Also updated the README Data handling section in commit bb7b654b to document the ascending sort order and contiguous state advancement behavior.
There was a problem hiding this comment.
@kellykohlleffel The ascending sort only helps if combined with stopping at the first failure — i.e., break out of the loop when fetch_story returns None.
That way highest_synced_id never advances past failing lower story_id, and it gets retried next sync.
There was a problem hiding this comment.
@fivetran-anushkaparashar Great catch — you're absolutely right. Fixed in commit cff5845e.
The fix distinguishes between fetch failures (which now halt the batch) and legitimate skips (which still advance the cursor):
fetch_story() changes:
- Fetch failures now propagate
RuntimeErrorfromfetch_data_with_retryinstead of being swallowed - Legitimate skips (deleted items, non-story types like comments/jobs/polls) still return
Nonesince the cursor may safely advance past them — they will never become valid stories
process_batch() changes:
- Catches
RuntimeError, logs the halt, and returns immediately sohighest_synced_idnever advances past the failedstory_id. The failed story is then retried on the next sync. - On
Nonereturns (legitimate skips),highest_synced_idis still advanced viamax()so the cursor can move past deleted items.
Why distinguish: A simple break on any None would halt the sync on every comment/job/poll item, potentially leaving the connector permanently stuck if the next-up story_id happens to be a non-story type. The distinction ensures the cursor advances through legitimate skips while halting only on actual fetch failures.
Also updated the README Data handling section to document the halt-on-failure semantics.
`fivetran debug` results (after fix)
Sync: SUCCEEDED
Records: 5 upserts (5 stories, 3 enriched with Cortex)
Checkpoint: `{"last_synced_id": 47607468}`
| Table | Records | Checkpoints |
|---|---|---|
| stories_enriched | 5 | 1 |
Raw fivetran debug terminal output
```
07-Apr 11:33:34.156 INFO ⚡ sdk Debugging connector at: hacker_news_plus_snowflake_cortex_v2
07-Apr 11:33:34.164 INFO ⚡ sdk Running connector tester...
07-Apr 11:33:34.676 INFO ⚡ debugger Version: 2.25.1230.001
07-Apr 11:33:35.746 INFO ⚡ debugger Previous state: {}
07-Apr 11:33:36.834 INFO ⚡ sdk Initiating the 'schema' method call...
07-Apr 11:33:36.843 INFO ⚡ debugger [SchemaChange]: tester.stories_enriched
07-Apr 11:33:36.847 INFO ⚡ sdk Initiating the 'update' method call...
07-Apr 11:33:36.847 WARNING Example: all_things_ai/tutorials : snowflake-cortex-hacker-news
07-Apr 11:33:36.847 INFO Cortex enrichment ENABLED: model=claude-sonnet-4-6, max_enrichments=3
07-Apr 11:33:36.847 INFO Last synced story ID: 0
07-Apr 11:33:36.847 INFO Fetching top stories list from Hacker News
07-Apr 11:33:37.064 INFO Retrieved 500 story IDs from HN API
07-Apr 11:33:37.065 INFO Syncing 5 new stories (filtered from 500 new)
07-Apr 11:33:37.065 INFO Processing batch 1/1 (5 stories)
07-Apr 11:33:56.260 INFO Checkpointed at story ID 47607468
07-Apr 11:33:56.261 INFO Sync complete: 5 stories synced, 3 enriched with Cortex
07-Apr 11:33:56.315 INFO ⚡ debugger [CreateTable]: tester.stories_enriched
07-Apr 11:33:56.340 INFO ⚡ debugger Checkpoint: {"last_synced_id": 47607468}
07-Apr 11:33:56.340 INFO ⚡ debugger SYNC PROGRESS:
Operation | Calls
----------------+------------
Upserts | 5
Updates | 0
Deletes | 0
Truncates | 0
SchemaChanges | 1
Checkpoints | 1
07-Apr 11:33:56.340 INFO ⚡ debugger Sync SUCCEEDED
```
…ctor - Fix enriched_count to increment on attempt, not only on sentiment success - Sort story IDs ascending for contiguous state advancement - Update README configuration note for clarity on placeholder vs production usage Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
All Copilot review threads have been replied to individually. Here is the summary: Changes in commit
|
| Table | Records |
|---|---|
| stories_enriched | 5 |
Raw fivetran debug terminal output
01-Apr 07:28:58.107 WARNING sdk requirements.txt file not found in your project folder.
01-Apr 07:28:59.365 INFO sdk Debugging connector at: /Users/kelly.kohlleffel/Documents/GitHub/fivetran_connector_sdk/all_things_ai/tutorials/snowflake-cortex-hacker-news
01-Apr 07:28:59.374 INFO sdk Running connector tester...
01-Apr 07:29:00.943 INFO debugger Version: 2.25.1230.001
01-Apr 07:29:01.916 INFO debugger Previous state: {}
01-Apr 07:29:02.998 INFO sdk Initiating the 'schema' method call...
01-Apr 07:29:03.007 INFO debugger [SchemaChange]: tester.stories_enriched
01-Apr 07:29:03.011 INFO sdk Initiating the 'update' method call...
01-Apr 07:29:03.011 WARNING Example: all_things_ai/tutorials : snowflake-cortex-hacker-news
01-Apr 07:29:03.011 INFO Cortex enrichment ENABLED: model=claude-sonnet-4-6, max_enrichments=3
01-Apr 07:29:03.011 INFO Last synced story ID: 0
01-Apr 07:29:03.011 INFO Fetching top stories list from Hacker News
01-Apr 07:29:03.181 INFO Retrieved 500 story IDs from HN API
01-Apr 07:29:03.182 INFO Syncing 5 new stories (filtered from 500 new)
01-Apr 07:29:03.182 INFO Processing batch 1/1 (5 stories)
01-Apr 07:29:32.104 INFO Checkpointed at story ID 47540833
01-Apr 07:29:32.104 INFO Sync complete: 5 stories synced, 3 enriched with Cortex
01-Apr 07:29:32.181 INFO debugger [CreateTable]: tester.stories_enriched
01-Apr 07:29:32.209 INFO debugger Checkpoint: {"last_synced_id": 47540833}
01-Apr 07:29:32.210 INFO debugger SYNC PROGRESS:
Operation | Calls
----------------+------------
Upserts | 5
Updates | 0
Deletes | 0
Truncates | 0
SchemaChanges | 1
Checkpoints | 1
01-Apr 07:29:32.210 INFO debugger Sync SUCCEEDED
All review feedback has been addressed. Ready for re-review.
- Document ascending sort of story IDs for contiguous state advancement - Clarify that max_enrichments counts attempts, not just successes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fivetran-anushkaparashar
left a comment
There was a problem hiding this comment.
I have a question regarding highest_synced_id = max(highest_synced_id, story_id). Please check!
…iled lower-ID stories Address PR fivetran#555 review feedback from @fivetran-anushkaparashar: ascending sort order alone was not sufficient to guarantee contiguous state advancement. The loop must also halt on fetch failure so highest_synced_id never advances past a failed lower story_id. Changes: - fetch_story now raises RuntimeError on fetch failure (previously returned None for both failures and legitimate skips, which made them indistinguishable) - fetch_story still returns None for legitimate skips (deleted items, non-story item types like comments/jobs/polls) since the cursor may safely advance past them - they will never become valid stories - process_batch now catches RuntimeError, logs the failure, and returns immediately without advancing state past the failed story_id - process_batch advances highest_synced_id on legitimate None skips so the cursor can move past deleted items and non-story types - README Data handling section updated to document the halt-on-failure semantics and the distinction between failures and legitimate skips Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@fivetran-anushkaparashar let me know if my update took care of the concern on highest_synced_id = max(highest_synced_id, story_id) - many thanks! |
|
@fivetran/tech-writers Friendly ping — this PR has been approved by @fivetran-anushkaparashar and is waiting on your review. Happy to address any feedback. Thanks! |
Summary
all_things_ai/tutorials/snowflake-cortex-hacker-news/op.upsert()/op.checkpoint()calls,validate_configuration(), standard boilerplate comments, and official README templateREADME.mdwith new entry in the AI tutorials sectionTest plan
fivetran debug --configuration configuration.jsonruns successfully (see results below)fivetran deploycompleted and data verified in Snowflake destinationblack --check --line-length 99 connector.pypassesflake8passes with official repo.flake8configconfiguration.json(angle-bracket placeholders)requirements.txtfile (not needed per oura-ring PR review feedback)Test results
SDK Version: 2.25.1230.001
Sync: SUCCEEDED
Records: 5 upserts (5 stories synced, 3 enriched with Cortex using claude-sonnet-4-6, max_enrichments=3)
Raw fivetran debug terminal output
Generated with Claude Code