Skip to content

sufyman/website-profile-extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

website-profile-extractor

Fetch any website, hand it to an LLM with a JSON schema, get back structured profile data. BYO LLM (Gemini, OpenAI, Anthropic — or any function that returns JSON).

Lead enrichment without Clearbit. Profile completion without a vendor.

npm install website-profile-extractor

Why

LLMs are great at extracting structured fields from messy HTML. But every team writes the same boilerplate: fetch, strip tags, build a prompt from a schema, parse fences, merge with existing data, decide which fields to skip.

This package gives you extractProfile({ url, schema, provider }) and provider adapters for the major APIs. The schema is JSON-schema-ish; the result is typed.

Quick start

import { extractProfile } from "website-profile-extractor";
import { geminiProvider } from "website-profile-extractor/providers/gemini";

const provider = geminiProvider({ apiKey: process.env.GEMINI_API_KEY! });

const result = await extractProfile({
  url: "https://acme-kennels.de",
  provider,
  schema: {
    kennel_name: { type: "string", description: "Business name" },
    phone: { type: "string", description: "Primary phone number", strict: true },
    email: { type: "string" },
    address: { type: "string" },
    breeds: { type: "array", items: { type: "string" }, description: "Dog breeds raised" },
    club_memberships: { type: "array", items: { type: "string" } },
  },
});

console.log(result.data);
// {
//   kennel_name: "Acme Kennels",
//   phone: "+49 123 456 789",
//   email: "hi@acme.de",
//   address: "Musterstraße 1, 10115 Berlin",
//   breeds: ["Labrador Retriever", "Golden Retriever"],
//   club_memberships: ["VDH", "DRC"]
// }

Provider adapters

import { geminiProvider } from "website-profile-extractor/providers/gemini";
import { openAIProvider } from "website-profile-extractor/providers/openai";
import { anthropicProvider } from "website-profile-extractor/providers/anthropic";

const a = geminiProvider({ apiKey, model: "gemini-1.5-flash" });
const b = openAIProvider({ apiKey, model: "gpt-4o-mini" });
const c = anthropicProvider({ apiKey, model: "claude-haiku-4-5-20251001" });

A provider is anything that implements { generate(prompt) => Promise<{ json: string }> } — write your own for local models, Ollama, Vertex, Bedrock, etc.

Filling missing fields only

Pass an existing object and the extractor will only fill empty/null/missing keys. This makes it safe to re-run on a record without overwriting human edits.

const result = await extractProfile({
  url,
  schema,
  provider,
  existing: profile,  // already-known fields
});

// result.filled  → keys that were extracted from the page
// result.skipped → keys that were already populated or unknown

Schema fields

type FieldSchema = {
  type: "string" | "number" | "boolean" | "array" | "object";
  description?: string;        // becomes part of the prompt
  items?: FieldSchema;         // for arrays
  properties?: Record<string, FieldSchema>; // for objects
  enum?: Array<string | number>;
  strict?: boolean;            // refuse to invent values
};

The package builds a single, well-structured prompt from your schema, asks for JSON, and parses the response (handles `````json` fences, surrounding prose, etc.).

Options

extractProfile({
  url,                  // page to fetch
  schema,               // see above
  provider,             // LLM
  existing?,            // already-known fields to skip
  instructions?,        // appended to system prompt
  maxHtmlChars?,        // truncate cleaned text (default 60,000)
  userAgent?,           // override UA header
  fetch?,               // inject fetch impl
});

License

MIT

About

Fetch a website and use an LLM to extract structured profile data against a user-supplied schema. Pluggable Gemini / OpenAI / Anthropic backends. Lead enrichment without Clearbit.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors