Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,5 @@ output/
node_modules/
.DS_Store
*.log
output/
*.db
69 changes: 66 additions & 3 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ This file is the entry point for any AI agent using this repository. Read it bef
3. **User prompts** — non-technical users paste these into their AI to drive the whole migration
4. **A living playbook** — this repo improves itself through AI-contributed discoveries

Currently supports: **Wix** and **Squarespace**. Webflow and Shopify are planned.
Currently supports: **Wix**, **Squarespace**, and **Instagram**. Webflow and Shopify are planned.

## If you're helping a user migrate from Wix

Expand Down Expand Up @@ -136,6 +136,47 @@ Generate a redirect map (old paths → new WP paths) for the user to configure i

---

## If you're helping a user migrate from Instagram

Instagram requires an authenticated browser session — there's no public API or sitemap to crawl.

### Step 1 — Launch browser with CDP

```bash
pkill -9 -f "Google Chrome"
sleep 3
"/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" \
--remote-debugging-port=9222 \
--user-data-dir="$HOME/.data-liberation/cdp-profile/chrome" \
--restore-last-session
```

### Step 2 — Discover all posts

```bash
node scripts/instagram/discover.js USERNAME --cdp-port 9222
```

Scrolls the profile and intercepts GraphQL responses to build a complete post inventory. Use `--delay 3000` for large profiles.

### Step 3 — Extract content and media

```bash
node scripts/instagram/extract.js USERNAME --cdp-port 9222
```

For each post, navigates to the individual post URL, captures metadata, and downloads full-resolution images. For carousel posts, uses `?img_index=N` to access each slide directly.

### Step 4 — Import to WordPress.com

```bash
node scripts/instagram/import.js --site <wordpress-site> --user <wpcom-user> --token <app-password>
```

Creates published posts with correct dates, featured images, gallery blocks for carousels, and source links back to Instagram.

---

## Using Claude in Chrome MCP

If the user has the Chrome DevTools MCP set up (`npx chrome-devtools-mcp@latest`), you can drive extraction directly from the browser without running scripts:
Expand Down Expand Up @@ -183,6 +224,17 @@ This approach works for any JavaScript-heavy platform, not just Wix.
| Admin UI noise in extracted content | Smart fallback heuristics filter admin shell text, sidebar artifacts |
| Products/commerce | Extract metadata but skip import (WooCommerce out of scope) |

### Instagram

| Problem | Solution |
|---|---|
| No public API or data export | Intercept GraphQL responses via CDP browser session |
| Authentication required | Connect to user's logged-in browser via `--cdp-port` |
| Carousel slides lazy-load | Use `?img_index=N` URL parameter to load each slide directly |
| CDN image URLs expire | Download media immediately during extraction |
| Rate limiting on scroll | Add `--delay 3000` for profiles with 200+ posts |
| Two API response formats | Handle both `edge_owner_to_timeline_media` and `xdt_api__v1__feed` shapes |

---

## How to contribute improvements back
Expand Down Expand Up @@ -224,15 +276,20 @@ data-liberation-agent/
├── package.json
├── prompts/
│ ├── wix.md ← what users paste into their AI for a Wix migration
│ └── squarespace.md ← what users paste into their AI for a Squarespace migration
│ ├── squarespace.md ← what users paste into their AI for a Squarespace migration
│ └── instagram.md ← what users paste into their AI for an Instagram migration
├── scripts/
│ ├── wix/
│ │ ├── discover.js ← inventory the Wix site (sitemap + categorization)
│ │ └── extract.js ← extract all content via network interception
│ ├── squarespace/
│ │ ├── discover.js ← inventory via admin CDP or public JSON API
│ │ ├── extract.js ← extract content via admin API interception + DOM fallback
│ │ └── import.js ← publish to WordPress.com via REST API
│ │ └── import.js ← publish to WordPress.com via XML-RPC
│ ├── instagram/
│ │ ├── discover.js ← inventory an Instagram profile via GraphQL interception
│ │ ├── extract.js ← extract posts, carousel slides, and media
│ │ └── import.js ← publish to WordPress.com via XML-RPC
│ └── import.js ← publish to WordPress.com via REST API (Wix)
├── examples/
│ ├── wix-api-blog-post.json ← example of Wix internal API response
Expand All @@ -257,6 +314,12 @@ data-liberation-agent/
- **Password-protected pages**: Admin extraction may fail without credentials.
- **Content is HTML, not blocks**: Imported as custom HTML. Block conversion is planned but not yet implemented.

### Instagram
- **Stories and Reels**: Not yet supported — different GraphQL queries and video handling required.
- **Comments**: Post comment text is not extracted (only comment counts).
- **Private profiles**: Only works for the user's own profile or profiles they follow.
- **Very large profiles (1000+ posts)**: May hit rate limiting. Use `--delay 4000` or higher.

### General
- Import creates everything as **drafts** — the user must review and publish manually.
- WordPress.com Application Passwords must be created by the user (AI can't do this step).
Expand Down
28 changes: 28 additions & 0 deletions DISCOVERIES.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,34 @@ AI agents: when you contribute an improvement, add an entry here. See [CONTRIBUT

---

## 2026-04-03 — Instagram data extraction via CDP and GraphQL interception

**Found by:** Claude + human contributor (live testing against a real 308-post Instagram profile)
**During:** Building Instagram support for the data-liberation-agent
**Type:** API endpoint | content type | architecture

### What I found

Instagram is a React app that communicates via GraphQL queries to `https://www.instagram.com/graphql/query/`. By connecting to an authenticated browser via CDP and intercepting responses during profile scroll, we capture structured JSON for every post. Key discoveries:

1. **Carousel slide direct access via `?img_index=N`**: Individual carousel slides can be loaded by appending `?img_index=1`, `?img_index=2`, etc. to the post URL. This is significantly more reliable than clicking through carousel arrows in the DOM.

2. **Carousel DOM has 3 `<li>` elements**: Instagram keeps previous, current, and next slides in the DOM simultaneously. Deduplication by Instagram media ID (the numeric prefix in CDN URLs like `/12345_67890.jpg`) is required to avoid capturing the same image from adjacent preloaded slides.

3. **Scroll-based pagination is more reliable than direct GraphQL**: Making direct `fetch()` calls to the GraphQL endpoint triggers rate limiting. Scrolling the profile with 2-3 second delays lets Instagram's own IntersectionObserver trigger pagination naturally.

4. **WordPress.com REST API doesn't support writes with app passwords**: Returns 401 for POST operations. XML-RPC (`wp.uploadFile`, `wp.newPost`) works correctly. The `post_date` must be sent as a `<string>` in `"YYYY-MM-DD HH:MM:SS"` format — WordPress ignores `<dateTime.iso8601>` typed values.

### How it works

Three-step pipeline: discover (scroll + intercept GraphQL) → extract (visit each post, use `?img_index=N` for carousels, download media) → import (XML-RPC `wp.uploadFile` for media, `wp.newPost` with `post_thumbnail` for featured images and gallery blocks for carousels).

### Why it's better than the previous approach

Instagram's built-in data export takes days, provides lower-resolution images, and has no location data. The CDP approach captures everything in real-time at full resolution with complete metadata.

---

## 2026-04-02 — Squarespace admin extraction via CDP

**Found by:** Claude + human contributor (live testing against a Squarespace site)
Expand Down
31 changes: 31 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ This repo gives people a prompt they can paste into any AI assistant (Claude, Ch
|---|---|---|
| **Wix** | Ready | [`prompts/wix.md`](./prompts/wix.md) |
| **Squarespace** | Ready | [`prompts/squarespace.md`](./prompts/squarespace.md) |
| **Instagram** | Ready | [`prompts/instagram.md`](./prompts/instagram.md) |
| Webflow | Planned | — |
| Shopify (blog/pages) | Planned | — |

Expand Down Expand Up @@ -56,6 +57,26 @@ node scripts/squarespace/import.js --site your-wp-site \
--username your-user --token YOUR_APP_PASSWORD
```

## Quick start (Instagram)

```bash
# 1. Install dependencies
npm install

# 2. Launch Chrome with remote debugging (Instagram requires an authenticated session)
"/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" \
--remote-debugging-port=9222 --user-data-dir="$HOME/.data-liberation/cdp-profile/chrome"

# 3. Log into Instagram in the browser, then discover all posts
node scripts/instagram/discover.js YOUR_USERNAME --cdp-port 9222

# 4. Extract content and download all media
node scripts/instagram/extract.js YOUR_USERNAME --cdp-port 9222

# 5. Import to WordPress.com
node scripts/instagram/import.js --site your-wp-site --user your-user --token YOUR_APP_PASSWORD
```

Or skip all of that and **paste the prompt into your AI assistant** — it will handle everything.

## For AI agents
Expand Down Expand Up @@ -85,6 +106,16 @@ This means the playbook gets smarter with every migration.
- [ ] Block conversion (`core/paragraph`, `core/image`, etc.)
- [ ] Product/commerce migration

### Instagram
- [x] Profile discovery via GraphQL interception
- [x] Post extraction with full metadata (captions, dates, locations, hashtags)
- [x] Carousel slide extraction via `?img_index=N`
- [x] Gallery block output for carousel posts
- [x] Media download (photos and videos)
- [x] WordPress.com XML-RPC import with featured images
- [ ] Stories and Reels extraction
- [ ] Comment extraction

### General
- [x] WordPress.com REST API import script
- [ ] WordPress Studio local-first workflow
Expand Down
33 changes: 27 additions & 6 deletions cli.js
Original file line number Diff line number Diff line change
Expand Up @@ -267,6 +267,7 @@ function getBrowserUserAgent(browser) {

function detectPlatform(url) {
const lower = url.toLowerCase();
if (lower.includes('instagram.com')) return 'instagram';
if (lower.includes('wix.com') || lower.includes('wixsite.com')) return 'wix';
if (lower.includes('squarespace.com')) return 'squarespace';
if (lower.includes('webflow.io') || lower.includes('webflow.com')) return 'webflow';
Expand Down Expand Up @@ -405,6 +406,7 @@ async function main() {
heading('Login Detection');

const platforms = [
{ name: 'Instagram', domain: '.instagram.com' },
{ name: 'Wix', domain: '.wix.com' },
{ name: 'Squarespace', domain: '.squarespace.com' },
{ name: 'Webflow', domain: '.webflow.com' },
Expand Down Expand Up @@ -455,12 +457,13 @@ async function main() {
ok(`Detected platform: ${BOLD}${detectedPlatform}${RESET}`);
} else {
const platChoice = await askChoice('Which platform is this site on?', [
{ label: 'Instagram', value: 'instagram' },
{ label: 'Wix', value: 'wix' },
{ label: 'Squarespace', value: 'squarespace' },
{ label: 'Webflow', value: 'webflow' },
{ label: 'Other / not sure', value: 'unknown' },
]);
// Use first choice as default if detection fails
detectedPlatform = platChoice.value;
}

const activePlatform = detectedPlatform !== 'unknown' ? detectedPlatform : 'wix';
Expand Down Expand Up @@ -521,13 +524,31 @@ async function main() {
// ── Step 4: Run discovery ──

heading('Step 1: Discovering Site Content');
log(`Scanning ${siteUrl} for all pages, posts, and media...\n`);

mkdirSync('output', { recursive: true });

const uaArgs = userAgent ? ['--user-agent', userAgent] : [];
// Instagram uses the browser's own UA via CDP — user-agent flag is only for Wix/other platforms
const uaArgs = (userAgent && activePlatform !== 'instagram') ? ['--user-agent', userAgent] : [];
const cdpArgs = cdpPort ? ['--cdp-port', String(cdpPort)] : [];
const discoverResult = await runScript(`scripts/${activePlatform}/discover.js`, [siteUrl, ...uaArgs, ...cdpArgs]);

// Instagram uses a username, not a site URL
let discoverTarget = siteUrl;
if (activePlatform === 'instagram') {
// Extract username from URL or use as-is
const igMatch = siteUrl.match(/instagram\.com\/([^/?]+)/);
discoverTarget = igMatch ? igMatch[1] : siteUrl.replace(/^https?:\/\//, '').replace(/\/$/, '');
log(`Discovering posts for @${discoverTarget}...\n`);
if (!cdpPort) {
fail('Instagram requires a CDP connection to an authenticated browser.');
fail('Launch Chrome with: google-chrome --remote-debugging-port=9222');
rl.close();
return;
}
} else {
log(`Scanning ${siteUrl} for all pages, posts, and media...\n`);
}

const discoverResult = await runScript(`scripts/${activePlatform}/discover.js`, [discoverTarget, ...uaArgs, ...cdpArgs]);
if (discoverResult.code !== 0) {
fail('Discovery failed. See output above.');
const retry = await ask('Try again? (y/n)');
Expand Down Expand Up @@ -558,10 +579,10 @@ async function main() {
// ── Step 5: Extract content ──

heading('Step 2: Extracting Content');
log(`Extracting all pages, posts, and media from ${siteUrl}...\n`);
log(`Extracting all content from ${activePlatform === 'instagram' ? '@' + discoverTarget : siteUrl}...\n`);

const extractResult = await runScript(`scripts/${activePlatform}/extract.js`, [
siteUrl,
activePlatform === 'instagram' ? discoverTarget : siteUrl,
'--url-list', 'output/inventory.json',
...uaArgs,
...cdpArgs
Expand Down
77 changes: 77 additions & 0 deletions prompts/instagram.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# Instagram to WordPress.com Migration Prompt

Copy everything below this line and paste it into your AI assistant (Claude, ChatGPT, Gemini, etc.).

---

I want to migrate my Instagram photos and posts to WordPress.com. My Instagram username is: **[PASTE YOUR USERNAME HERE]**

I have (or will create) a WordPress.com account. Please help me migrate using the playbook at https://github.com/Automattic/data-liberation-agent — read AGENTS.md first for full instructions.

**Important**: Instagram requires an authenticated browser session. I'll need to have Chrome (or another Chromium browser) open and logged into Instagram before we start.

Here's what I need you to do:

## Step 1: Set up browser access

Help me launch Chrome with remote debugging enabled so the migration scripts can connect:

1. Quit Chrome completely
2. Relaunch with: `"/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" --remote-debugging-port=9222 --user-data-dir="$HOME/.data-liberation/cdp-profile/chrome" --restore-last-session`
3. Log into Instagram in the browser window that opens
4. Confirm the connection works

## Step 2: Discover all my posts

```bash
node scripts/instagram/discover.js MY_USERNAME --cdp-port 9222
```

This connects to my browser, navigates to my profile, and intercepts Instagram's internal GraphQL API responses as it scrolls through all my posts. It captures:
- Post metadata (captions, dates, locations, hashtags, tagged users)
- Post types (photos, videos, carousels with slide counts)
- Image and video URLs
- Profile information

Show me the inventory summary and wait for my approval before proceeding.

**If it stalls or gets rate limited**: Add `--delay 3000` for a gentler pace.

## Step 3: Extract full content and download media

```bash
node scripts/instagram/extract.js MY_USERNAME --cdp-port 9222
```

This visits each post individually to get:
- Full-resolution images (not thumbnails)
- All carousel slides (uses `?img_index=N` to access each slide directly)
- Video URLs
- Tagged users, location details, accessibility captions

All media is downloaded locally — Instagram CDN URLs expire, so this must happen promptly after discovery.

## Step 4: Import to WordPress.com

```bash
node scripts/import.js --site my-wp-site.wordpress.com --token MY_APP_PASSWORD
```

This creates WordPress posts from the extracted data:
- Each Instagram post becomes a WordPress post (as draft)
- Images uploaded to the media library
- Captions become post content with @mentions and #hashtags linked
- Original post date preserved
- Instagram shortcode and URL stored as post meta

**For a custom post type** (e.g. a "photo" CPT): add `--post-type photo`

## Step 5: Verify

When done:
- Show me how many posts were imported vs. discovered
- Flag any posts with missing images or import errors
- Check that carousel posts have all their slides
- List the date range covered (oldest → newest)

Work methodically — do one step at a time, show me progress, and wait for my go-ahead before moving to the next step.
Loading