Skip to content

Add Snowflake Cortex Hacker News connector to AI tutorials#555

Open
kellykohlleffel wants to merge 4 commits intofivetran:mainfrom
kellykohlleffel:feature/all_things_ai/tutorials/snowflake-cortex-hacker-news
Open

Add Snowflake Cortex Hacker News connector to AI tutorials#555
kellykohlleffel wants to merge 4 commits intofivetran:mainfrom
kellykohlleffel:feature/all_things_ai/tutorials/snowflake-cortex-hacker-news

Conversation

@kellykohlleffel
Copy link
Copy Markdown
Contributor

Summary

  • Adds a new connector example to all_things_ai/tutorials/snowflake-cortex-hacker-news/
  • Syncs top stories from the free Hacker News API and enriches them with AI-powered sentiment analysis and topic classification via Snowflake Cortex REST API during ingestion
  • Demonstrates configurable model selection (Claude, Mistral, Llama), cost-controlled enrichment limits, and incremental sync with batch checkpointing
  • Follows all SDK v2+ patterns: direct op.upsert()/op.checkpoint() calls, validate_configuration(), standard boilerplate comments, and official README template
  • Updates main README.md with new entry in the AI tutorials section

Test plan

  • fivetran debug --configuration configuration.json runs successfully (see results below)
  • fivetran deploy completed and data verified in Snowflake destination
  • black --check --line-length 99 connector.py passes
  • flake8 passes with official repo .flake8 config
  • All credentials scrubbed from configuration.json (angle-bracket placeholders)
  • No requirements.txt file (not needed per oura-ring PR review feedback)
  • README follows official template with all required sections

Test results

SDK Version: 2.25.1230.001
Sync: SUCCEEDED
Records: 5 upserts (5 stories synced, 3 enriched with Cortex using claude-sonnet-4-6, max_enrichments=3)

Table Records Enriched
stories_enriched 5 3
Raw fivetran debug terminal output
31-Mar 09:34:12.029 WARNING sdk `requirements.txt` file not found in your project folder.
31-Mar 09:34:13.409 INFO sdk Debugging connector at: /Users/kelly.kohlleffel/Documents/GitHub/fivetran_connector_sdk_personal/examples/quick_start_examples/hacker_news_plus_snowflake_cortex_v2
31-Mar 09:34:13.420 INFO sdk Running connector tester...
31-Mar 09:34:13.780 INFO debugger Version: 2.25.1230.001
31-Mar 09:34:13.790 INFO debugger Destination schema: .../files/warehouse.db/tester
31-Mar 09:34:15.147 INFO debugger Previous state:
{}
Mar 31, 2026 9:34:16 AM com.fivetran.partner_sdk.client.connector.PartnerSdkConnectorClient schema
INFO: Fetching schema from partner
31-Mar 09:34:16.241 INFO sdk Initiating the 'schema' method call...
31-Mar 09:34:16.251 INFO debugger [SchemaChange]: tester.stories_enriched
31-Mar 09:34:16.255 INFO sdk Initiating the 'update' method call...
31-Mar 09:34:16.256 WARNING Example: all_things_ai/tutorials : snowflake-cortex-hacker-news
31-Mar 09:34:16.256 INFO Cortex enrichment ENABLED: model=claude-sonnet-4-6, max_enrichments=3
31-Mar 09:34:16.256 INFO Last synced story ID: 0
31-Mar 09:34:16.256 INFO Fetching top stories list from Hacker News
31-Mar 09:34:16.444 INFO Retrieved 500 story IDs from HN API
31-Mar 09:34:16.444 INFO Syncing 5 new stories (filtered from 500 new)
31-Mar 09:34:16.444 INFO Processing batch 1/1 (5 stories)
31-Mar 09:34:44.538 INFO Checkpointed at story ID 47586614
31-Mar 09:34:44.538 INFO Sync complete: 5 stories synced, 3 enriched with Cortex
31-Mar 09:34:44.591 INFO debugger [CreateTable]: tester.stories_enriched
31-Mar 09:34:44.616 INFO debugger Checkpoint: {"last_synced_id": 47586614}
31-Mar 09:34:44.617 INFO debugger SYNC PROGRESS:
Operation       | Calls
----------------+------------
Upserts         | 5
Updates         | 0
Deletes         | 0
Truncates       | 0
SchemaChanges   | 1
Checkpoints     | 1
31-Mar 09:34:44.617 INFO debugger Sync SUCCEEDED

Generated with Claude Code

Adds a new connector example that syncs top stories from the Hacker News API
and enriches them with AI-powered sentiment analysis and topic classification
using the Snowflake Cortex REST API during ingestion.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new AI tutorial connector example that syncs Hacker News top stories and optionally enriches them with sentiment + topic classification via Snowflake Cortex during ingestion, plus links it from the repo’s main README.

Changes:

  • Added a new tutorial connector (snowflake-cortex-hacker-news) with incremental sync, batch processing, and optional Cortex enrichment.
  • Added tutorial documentation (setup, configuration, cost notes, behavior overview).
  • Updated the root README.md to include the new tutorial entry.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
README.md Adds a link/description for the new Snowflake Cortex + Hacker News tutorial.
all_things_ai/tutorials/snowflake-cortex-hacker-news/README.md New tutorial README describing features, configuration, and behavior.
all_things_ai/tutorials/snowflake-cortex-hacker-news/connector.py New SDK v2+ connector implementation: HN fetch + optional Cortex enrichment + batching/checkpointing.
all_things_ai/tutorials/snowflake-cortex-hacker-news/configuration.json New example configuration file for running the tutorial.

Comment thread all_things_ai/tutorials/snowflake-cortex-hacker-news/README.md
- `cortex_timeout` (optional): Timeout in seconds for Cortex API calls, defaults to 30
- `max_enrichments` (optional): Maximum number of AI enrichments per sync for cost control, defaults to 50

Note: Ensure that the `configuration.json` file is not checked into version control to protect sensitive information.
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README note says to not check configuration.json into version control, but this example directory includes a committed configuration.json. Consider aligning with the repo README template wording: committed examples use placeholder values, and production repositories should not commit real credentials/config.

Suggested change
Note: Ensure that the `configuration.json` file is not checked into version control to protect sensitive information.
Note: This example repository includes a `configuration.json` file checked into version control, but it contains only placeholder values and no real credentials. In your own production repositories, do not commit `configuration.json` files containing real secrets or environment-specific configuration; instead, use environment variables or a secrets manager to store sensitive values.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit 075de23f — updated the README note to clarify that the committed configuration.json contains only placeholder values, and that production repositories should not commit real credentials.

Comment on lines +487 to +488
if enrichment.get("cortex_sentiment") is not None:
enriched_count += 1
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enriched_count is used to enforce max_enrichments, but it only increments when cortex_sentiment is not None. If sentiment parsing fails (or only classification succeeds), the connector will continue making Cortex calls beyond the intended limit, defeating cost control. Increment the counter based on enrichment attempts per story (or on success of either enrichment), and ideally count each Cortex call if that’s what you’re trying to cap.

Suggested change
if enrichment.get("cortex_sentiment") is not None:
enriched_count += 1
# Count each enrichment attempt toward the max_enrichments limit,
# regardless of which specific enrichment fields are present.
enriched_count += 1

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit 075de23fenriched_count now increments on each enrichment attempt regardless of which specific enrichment fields succeed. This ensures the max_enrichments cost control limit is enforced based on Cortex API calls made, not just successful sentiment parses.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also updated the README Data handling section in commit bb7b654b to document that max_enrichments counts each enrichment attempt regardless of which specific fields succeed.

Comment on lines +474 to +500
synced_count = 0
highest_synced_id = state.get("last_synced_id", 0)

for story_id in story_ids:
story_data = fetch_story(session, story_id)
if not story_data:
continue

# Enrich with Cortex if enabled and within enrichment limit
title = story_data.get("title", "")
if is_cortex_enabled and title and enriched_count < max_enrichments:
enrichment = enrich_story(session, configuration, title)
story_data.update(enrichment)
if enrichment.get("cortex_sentiment") is not None:
enriched_count += 1

# Flatten nested data structures for Fivetran compatibility
flattened = flatten_dict(story_data)

# The 'upsert' operation is used to insert or update data in the destination table.
# The first argument is the name of the destination table.
# The second argument is a dictionary containing the record to be upserted.
op.upsert(table="stories_enriched", data=flattened)

synced_count += 1
highest_synced_id = max(highest_synced_id, story_id)

Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

State advancement can skip stories permanently: highest_synced_id only tracks the max ID among successfully upserted stories, and update() uses it as a "cursor" (sid > last_synced_id). If a lower-ID story in the batch fails to fetch/upsert but a higher-ID story succeeds, the state jumps past the failed ID and that story will never be retried. To avoid data loss, process IDs in ascending order and only advance state to the last contiguously processed ID (or don’t advance past failures).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit 075de23f — story IDs are now sorted ascending before processing (sorted([sid for sid in all_story_ids if sid > last_synced_id])). This ensures contiguous state advancement so that if a lower-ID story fails to fetch, the cursor does not skip past it to a higher-ID success.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also updated the README Data handling section in commit bb7b654b to document the ascending sort order and contiguous state advancement behavior.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kellykohlleffel The ascending sort only helps if combined with stopping at the first failure — i.e., break out of the loop when fetch_story returns None.
That way highest_synced_id never advances past failing lower story_id, and it gets retried next sync.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fivetran-anushkaparashar Great catch — you're absolutely right. Fixed in commit cff5845e.

The fix distinguishes between fetch failures (which now halt the batch) and legitimate skips (which still advance the cursor):

fetch_story() changes:

  • Fetch failures now propagate RuntimeError from fetch_data_with_retry instead of being swallowed
  • Legitimate skips (deleted items, non-story types like comments/jobs/polls) still return None since the cursor may safely advance past them — they will never become valid stories

process_batch() changes:

  • Catches RuntimeError, logs the halt, and returns immediately so highest_synced_id never advances past the failed story_id. The failed story is then retried on the next sync.
  • On None returns (legitimate skips), highest_synced_id is still advanced via max() so the cursor can move past deleted items.

Why distinguish: A simple break on any None would halt the sync on every comment/job/poll item, potentially leaving the connector permanently stuck if the next-up story_id happens to be a non-story type. The distinction ensures the cursor advances through legitimate skips while halting only on actual fetch failures.

Also updated the README Data handling section to document the halt-on-failure semantics.

`fivetran debug` results (after fix)

Sync: SUCCEEDED
Records: 5 upserts (5 stories, 3 enriched with Cortex)
Checkpoint: `{"last_synced_id": 47607468}`

Table Records Checkpoints
stories_enriched 5 1
Raw fivetran debug terminal output

```
07-Apr 11:33:34.156 INFO ⚡ sdk Debugging connector at: hacker_news_plus_snowflake_cortex_v2
07-Apr 11:33:34.164 INFO ⚡ sdk Running connector tester...
07-Apr 11:33:34.676 INFO ⚡ debugger Version: 2.25.1230.001
07-Apr 11:33:35.746 INFO ⚡ debugger Previous state: {}
07-Apr 11:33:36.834 INFO ⚡ sdk Initiating the 'schema' method call...
07-Apr 11:33:36.843 INFO ⚡ debugger [SchemaChange]: tester.stories_enriched
07-Apr 11:33:36.847 INFO ⚡ sdk Initiating the 'update' method call...
07-Apr 11:33:36.847 WARNING Example: all_things_ai/tutorials : snowflake-cortex-hacker-news
07-Apr 11:33:36.847 INFO Cortex enrichment ENABLED: model=claude-sonnet-4-6, max_enrichments=3
07-Apr 11:33:36.847 INFO Last synced story ID: 0
07-Apr 11:33:36.847 INFO Fetching top stories list from Hacker News
07-Apr 11:33:37.064 INFO Retrieved 500 story IDs from HN API
07-Apr 11:33:37.065 INFO Syncing 5 new stories (filtered from 500 new)
07-Apr 11:33:37.065 INFO Processing batch 1/1 (5 stories)
07-Apr 11:33:56.260 INFO Checkpointed at story ID 47607468
07-Apr 11:33:56.261 INFO Sync complete: 5 stories synced, 3 enriched with Cortex
07-Apr 11:33:56.315 INFO ⚡ debugger [CreateTable]: tester.stories_enriched
07-Apr 11:33:56.340 INFO ⚡ debugger Checkpoint: {"last_synced_id": 47607468}
07-Apr 11:33:56.340 INFO ⚡ debugger SYNC PROGRESS:
Operation | Calls
----------------+------------
Upserts | 5
Updates | 0
Deletes | 0
Truncates | 0
SchemaChanges | 1
Checkpoints | 1
07-Apr 11:33:56.340 INFO ⚡ debugger Sync SUCCEEDED
```

…ctor

- Fix enriched_count to increment on attempt, not only on sentiment success
- Sort story IDs ascending for contiguous state advancement
- Update README configuration note for clarity on placeholder vs production usage

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@kellykohlleffel
Copy link
Copy Markdown
Contributor Author

All Copilot review threads have been replied to individually. Here is the summary:

Changes in commit 075de23f

Accepted (3 items):

  1. enriched_count on attempt — Counter now increments on each enrichment attempt, not only when sentiment parsing succeeds. This ensures max_enrichments cost control is enforced based on Cortex API calls made.
  2. Contiguous state advancement — Story IDs are now sorted ascending before processing. This prevents the cursor from skipping past failed lower-ID stories when a higher-ID story succeeds.
  3. README note wording — Updated the configuration.json note to clarify that the committed file contains only placeholder values, and production repos should not commit real credentials.

Pushed back with precedent (2 items):

  1. configuration.json all-placeholders — The approved snowflake-cortex-livestock-weather-intelligence connector in this same directory uses literal defaults for non-secret values ("true", "60", "5") alongside angle-bracket placeholders for secrets. Our connector follows the same convention. Happy to change if human reviewers prefer otherwise.
  2. README config example — Same reasoning as above.

Test results

SDK Version: 2.25.1230.001
Sync: SUCCEEDED
Records: 5 upserts (5 stories, 3 enriched with Cortex via claude-sonnet-4-6)

Table Records
stories_enriched 5
Raw fivetran debug terminal output
01-Apr 07:28:58.107 WARNING sdk requirements.txt file not found in your project folder.
01-Apr 07:28:59.365 INFO sdk Debugging connector at: /Users/kelly.kohlleffel/Documents/GitHub/fivetran_connector_sdk/all_things_ai/tutorials/snowflake-cortex-hacker-news
01-Apr 07:28:59.374 INFO sdk Running connector tester...
01-Apr 07:29:00.943 INFO debugger Version: 2.25.1230.001
01-Apr 07:29:01.916 INFO debugger Previous state: {}
01-Apr 07:29:02.998 INFO sdk Initiating the 'schema' method call...
01-Apr 07:29:03.007 INFO debugger [SchemaChange]: tester.stories_enriched
01-Apr 07:29:03.011 INFO sdk Initiating the 'update' method call...
01-Apr 07:29:03.011 WARNING Example: all_things_ai/tutorials : snowflake-cortex-hacker-news
01-Apr 07:29:03.011 INFO Cortex enrichment ENABLED: model=claude-sonnet-4-6, max_enrichments=3
01-Apr 07:29:03.011 INFO Last synced story ID: 0
01-Apr 07:29:03.011 INFO Fetching top stories list from Hacker News
01-Apr 07:29:03.181 INFO Retrieved 500 story IDs from HN API
01-Apr 07:29:03.182 INFO Syncing 5 new stories (filtered from 500 new)
01-Apr 07:29:03.182 INFO Processing batch 1/1 (5 stories)
01-Apr 07:29:32.104 INFO Checkpointed at story ID 47540833
01-Apr 07:29:32.104 INFO Sync complete: 5 stories synced, 3 enriched with Cortex
01-Apr 07:29:32.181 INFO debugger [CreateTable]: tester.stories_enriched
01-Apr 07:29:32.209 INFO debugger Checkpoint: {"last_synced_id": 47540833}
01-Apr 07:29:32.210 INFO debugger SYNC PROGRESS:
Operation       | Calls
----------------+------------
Upserts         | 5
Updates         | 0
Deletes         | 0
Truncates       | 0
SchemaChanges   | 1
Checkpoints     | 1
01-Apr 07:29:32.210 INFO debugger Sync SUCCEEDED

All review feedback has been addressed. Ready for re-review.

- Document ascending sort of story IDs for contiguous state advancement
- Clarify that max_enrichments counts attempts, not just successes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@fivetran-anushkaparashar fivetran-anushkaparashar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question regarding highest_synced_id = max(highest_synced_id, story_id). Please check!

…iled lower-ID stories

Address PR fivetran#555 review feedback from @fivetran-anushkaparashar: ascending sort
order alone was not sufficient to guarantee contiguous state advancement. The
loop must also halt on fetch failure so highest_synced_id never advances past
a failed lower story_id.

Changes:
- fetch_story now raises RuntimeError on fetch failure (previously returned
  None for both failures and legitimate skips, which made them indistinguishable)
- fetch_story still returns None for legitimate skips (deleted items, non-story
  item types like comments/jobs/polls) since the cursor may safely advance past
  them - they will never become valid stories
- process_batch now catches RuntimeError, logs the failure, and returns
  immediately without advancing state past the failed story_id
- process_batch advances highest_synced_id on legitimate None skips so the
  cursor can move past deleted items and non-story types
- README Data handling section updated to document the halt-on-failure
  semantics and the distinction between failures and legitimate skips

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@kellykohlleffel
Copy link
Copy Markdown
Contributor Author

@fivetran-anushkaparashar let me know if my update took care of the concern on highest_synced_id = max(highest_synced_id, story_id) - many thanks!

Copy link
Copy Markdown
Contributor

@fivetran-anushkaparashar fivetran-anushkaparashar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@kellykohlleffel
Copy link
Copy Markdown
Contributor Author

@fivetran/tech-writers Friendly ping — this PR has been approved by @fivetran-anushkaparashar and is waiting on your review. Happy to address any feedback. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants