Skip to content

[FEATURE] PgVectorProvider for PostgreSQL with pgvector Extension #28

@sajeerzeji

Description

@sajeerzeji

Is your feature request related to a problem? Please describe.

Currently, toolpack-knowledge only supports two providers:

  • MemoryProvider - In-memory, no persistence, limited to small datasets
  • PersistentKnowledgeProvider - SQLite-backed, single-process, not suitable for distributed systems

For production deployments that require:

  • Multi-instance horizontal scaling
  • Shared knowledge base across multiple agents
  • High-performance vector search at scale
  • Integration with existing PostgreSQL infrastructure

There is no suitable provider. PostgreSQL with the pgvector extension is the most widely deployed production-ready vector database and should be supported.

Describe the solution you'd like

Implement PgVectorProvider that uses PostgreSQL with the pgvector extension for vector storage and retrieval.

Implementation Details

1. Create PgVectorProvider Class

File: packages/toolpack-knowledge/src/providers/pgvector.ts

Implement a provider that:

  • Uses pg (node-postgres) for database connectivity
  • Supports both connection string and individual connection parameters
  • Implements connection pooling for multi-instance deployments
  • Configures pool size, idle timeout, and connection timeout
  • Handles pool errors gracefully

Configuration Options:

  • connectionString - PostgreSQL connection string
  • host, port, database, user, password - Individual connection params
  • ssl - SSL configuration (boolean or object)
  • table - Table name (default: 'knowledge_chunks')
  • schema - Schema name (default: 'public')
  • poolSize - Connection pool size (default: 10)
  • idleTimeoutMs - Idle connection timeout (default: 30000)
  • connectionTimeoutMs - Connection timeout (default: 2000)

2. Database Initialization

Automatic Setup:

  • Check for pgvector extension, throw helpful error if missing
  • Create schema if it doesn't exist (for non-public schemas)
  • Create table with columns: id (TEXT PRIMARY KEY), content (TEXT), metadata (JSONB), vector (vector type), created_at, updated_at
  • Create GIN index on metadata for efficient filtering
  • Create HNSW index on vector column for fast similarity search
  • Create updated_at trigger for automatic timestamp updates

Table Schema:

  • Use JSONB for flexible metadata storage
  • Use pgvector's vector type for embeddings
  • Add timestamps for auditing
  • Support upsert with ON CONFLICT DO UPDATE

3. Core Methods Implementation

validateDimensions():

  • Query existing vectors to check dimensions
  • Compare with embedder dimensions
  • Throw DimensionMismatchError if mismatch detected
  • Create HNSW vector index after dimensions are known
  • Use cosine distance operator for similarity

add():

  • Use transactions for batch operations
  • Support upsert (INSERT ... ON CONFLICT DO UPDATE)
  • Convert vector arrays to pgvector format
  • Serialize metadata as JSONB
  • Handle errors and rollback on failure

query():

  • Use cosine distance operator (<=>) for similarity search
  • Calculate similarity score as 1 - distance
  • Support metadata filtering with WHERE clauses
  • Build dynamic SQL for complex filters ($in, $gt, $lt operators)
  • Order by distance and limit results
  • Filter by threshold score
  • Optionally include metadata and vectors

keywordQuery():

  • Use PostgreSQL full-text search (tsvector/tsquery)
  • Rank results using ts_rank
  • Support same filtering as vector query
  • Enable hybrid search capabilities

delete():

  • Delete by array of IDs using ANY operator
  • Handle empty arrays gracefully

clear():

  • TRUNCATE table for fast deletion
  • Reset stored dimensions

getAllChunks():

  • Fetch all chunks for migration/export
  • Parse pgvector format back to number arrays

close():

  • End connection pool
  • Clean up resources

4. Metadata Filtering

Implement buildFilterConditions() helper to convert MetadataFilter to SQL:

  • Simple equality: metadata->>'key' = value
  • $in operator: metadata->>'key' = ANY($1)
  • $gt operator: (metadata->>'key')::numeric > $1
  • $lt operator: (metadata->>'key')::numeric < $1
  • Combine multiple conditions with AND

5. Export from Index

File: packages/toolpack-knowledge/src/index.ts

Add exports for PgVectorProvider and PgVectorProviderOptions.

6. Add Dependencies

File: packages/toolpack-knowledge/package.json

Add dependencies:

  • pg - PostgreSQL client for Node.js
  • @types/pg - TypeScript types

7. Testing

File: packages/toolpack-knowledge/src/__tests__/pgvector-provider.test.ts

Create comprehensive tests (requires running PostgreSQL instance):

  • Add and retrieve chunks - Basic CRUD operations
  • Metadata filtering - Test equality, $in, $gt, $lt operators
  • Keyword search - Test full-text search functionality
  • Dimension validation - Test mismatch detection
  • Delete operations - Test chunk deletion
  • Upsert behavior - Test updating existing chunks
  • Connection pooling - Test concurrent operations
  • Error handling - Test failure scenarios
  • Schema support - Test non-public schemas

Use environment variable DATABASE_URL for test database connection.
Skip tests if database not available.

8. Documentation

File: packages/toolpack-knowledge/README.md

Add section on PostgreSQL setup:

  • Installation of pgvector extension
  • Connection configuration examples
  • Production deployment recommendations
  • Performance tuning tips
  • Multi-tenancy patterns

9. Migration Guide

File: packages/toolpack-knowledge/docs/pgvector-migration.md

Provide migration paths:

  • From MemoryProvider (simple config change)
  • From PersistentKnowledgeProvider (export/import chunks)
  • Schema migration scripts
  • Performance optimization guide

Acceptance Criteria

  • PgVectorProvider implements all KnowledgeProvider methods
  • Automatic table and index creation
  • pgvector extension check on initialization
  • HNSW index for efficient vector search
  • GIN index for metadata filtering
  • Connection pooling support
  • SSL connection support
  • Schema support for multi-tenancy
  • Metadata filtering with $in, $gt, $lt operators
  • Keyword search using PostgreSQL full-text search
  • Dimension validation
  • Upsert support (ON CONFLICT DO UPDATE)
  • Transaction support for batch operations
  • Comprehensive tests (requires test database)
  • Documentation with setup instructions
  • Migration guide from other providers

Describe alternatives you've considered

  1. Supabase client: Use Supabase's client library - rejected because it's vendor-specific and adds unnecessary abstraction
  2. TypeORM/Prisma: Use an ORM - rejected to avoid heavy dependencies and maintain direct SQL control
  3. Raw SQL only: No abstraction layer - rejected because we need consistent interface with other providers

Additional context

  • PostgreSQL with pgvector is the most widely deployed production vector database
  • HNSW index provides excellent performance for large datasets (millions of vectors)
  • Native PostgreSQL features (JSONB, full-text search) enable powerful filtering
  • Connection pooling is critical for multi-instance deployments
  • Schema support enables multi-tenancy in a single database

Dependencies:

  • pg - PostgreSQL client for Node.js
  • @types/pg - TypeScript types
  • Requires PostgreSQL 12+ with pgvector extension installed

Related Issues:

  • Complements existing MemoryProvider and PersistentKnowledgeProvider
  • Enables production-scale deployments
  • Required for multi-instance agent architectures

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestmedium-priorityMedium priority issuestoolpack-knowledgeIssues related to toolpack-knowledge package

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions