OxidizePdf.NET

.NET bindings for oxidize-pdf - Fast, memory-safe PDF text extraction optimized for RAG/LLM pipelines with intelligent chunking.

Features

🚀 High Performance - Native Rust speed (3,000-4,000 pages/second)
🧠 AI/RAG Optimized - Intelligent text chunking with sentence boundaries
🛡️ Memory Safe - Zero-copy FFI with automatic resource management
🌍 Cross-Platform - Linux, Windows, macOS (x64)
📦 Zero Dependencies - Self-contained native binaries in NuGet package
🔍 Metadata Rich - Page numbers, confidence scores, bounding boxes

Installation

dotnet add package OxidizePdf.NET

Quick Start

Basic Text Extraction

using OxidizePdf.NET;

// Extract all text from PDF
using var extractor = new PdfExtractor();
byte[] pdfBytes = File.ReadAllBytes("document.pdf");

string text = await extractor.ExtractTextAsync(pdfBytes);
Console.WriteLine(text);

AI/RAG Integration with KernelMemory

using OxidizePdf.NET;
using Microsoft.KernelMemory;

var extractor = new PdfExtractor();
var memory = new KernelMemoryBuilder().Build();

// Extract chunks optimized for embeddings
var chunks = await extractor.ExtractChunksAsync(
    pdfBytes,
    new ChunkOptions
    {
        MaxChunkSize = 512,                // Token limit for embedding model
        Overlap = 50,                      // Context overlap between chunks
        PreserveSentenceBoundaries = true, // No mid-sentence cuts
        IncludeMetadata = true             // Page numbers, confidence scores
    }
);

// Store in vector database
foreach (var chunk in chunks)
{
    await memory.ImportTextAsync(
        text: chunk.Text,
        documentId: $"doc_{chunk.PageNumber}_{chunk.Index}",
        tags: new Dictionary<string, object>
        {
            ["source"] = "SharePoint/Documents/report.pdf",
            ["page"] = chunk.PageNumber,
            ["confidence"] = chunk.Confidence
        }
    );
}

SharePoint Crawler Example

using OxidizePdf.NET;
using Microsoft.Graph;

var extractor = new PdfExtractor();
var graphClient = new GraphServiceClient(...);

// Crawl SharePoint document library
var driveItems = await graphClient.Sites["root"]
    .Drives["Documents"]
    .Root
    .Children
    .Request()
    .Filter("endsWith(name,'.pdf')")
    .GetAsync();

foreach (var item in driveItems)
{
    var stream = await graphClient.Sites["root"]
        .Drives["Documents"]
        .Items[item.Id]
        .Content
        .Request()
        .GetAsync();

    using var ms = new MemoryStream();
    await stream.CopyToAsync(ms);

    var chunks = await extractor.ExtractChunksAsync(ms.ToArray());

    // Process chunks for embeddings...
}

Performance

Based on oxidize-pdf v1.6.4 benchmarks:

Text Extraction: 3,000-4,000 pages/second
Chunking: 0.62ms for 100 pages
Memory Overhead: <1MB per document
PDF Parsing: 98.8% success rate on 759 real-world PDFs

Supported Platforms

Platform	Runtime Identifier	Status
Linux x64	`linux-x64`	✅ Supported
Windows x64	`win-x64`	✅ Supported
macOS x64	`osx-x64`	✅ Supported

Native binaries are automatically included in the NuGet package.

Architecture

native/ - Rust FFI layer (cdylib)
dotnet/ - C# wrapper with P/Invoke
examples/ - Integration examples (KernelMemory, BasicUsage)

See ARCHITECTURE.md for detailed design decisions.

API Reference

PdfExtractor

public class PdfExtractor : IDisposable
{
    // Extract plain text from PDF
    public Task<string> ExtractTextAsync(byte[] pdfBytes);

    // Extract text chunks optimized for RAG/LLM
    public Task<DocumentChunks> ExtractChunksAsync(
        byte[] pdfBytes,
        ChunkOptions options = null
    );

    // Extract metadata (page count, title, author)
    public Task<PdfMetadata> ExtractMetadataAsync(byte[] pdfBytes);
}

ChunkOptions

public class ChunkOptions
{
    public int MaxChunkSize { get; set; } = 512;          // Max tokens per chunk
    public int Overlap { get; set; } = 50;                // Overlap between chunks
    public bool PreserveSentenceBoundaries { get; set; } = true;
    public bool IncludeMetadata { get; set; } = true;
}

DocumentChunk

public class DocumentChunk
{
    public int Index { get; set; }             // Chunk index in document
    public int PageNumber { get; set; }        // Source page number
    public string Text { get; set; }           // Chunk text content
    public double Confidence { get; set; }     // Extraction confidence (0.0-1.0)
    public BoundingBox BoundingBox { get; set; } // Optional spatial info
}

Requirements

.NET 8.0+ (tested on .NET 8, 9)
Native Runtime: Automatically included in NuGet package

Note: .NET 6 support was dropped in v0.2.0 as it reached end-of-support in November 2024. Use v0.1.0 if you still require .NET 6 compatibility.

Building from Source

# Clone repository
git clone https://github.com/bzsanti/oxidize-pdf-dotnet.git
cd oxidize-pdf-dotnet

# Build native library
cd native
cargo build --release

# Build .NET wrapper
cd ../dotnet
dotnet build

# Run tests
dotnet test

Examples

See examples/ directory:

BasicUsage/ - Simple text extraction
KernelMemory/ - Full SharePoint crawler with RAG pipeline

License

This project is licensed under the MIT License - see LICENSE file.

Contributing

Contributions are welcome! Please read CONTRIBUTING.md for guidelines.

Acknowledgments

Built on top of oxidize-pdf by Santiago Fernández Muñoz.

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
.claude		.claude
.github/workflows		.github/workflows
build		build
docs		docs
dotnet		dotnet
examples		examples
native		native
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
PROJECT_PROGRESS.md		PROJECT_PROGRESS.md
README.md		README.md
RELEASE-NOTES-v0.4.0.md		RELEASE-NOTES-v0.4.0.md
RELEASE-NOTES-v0.5.0.md		RELEASE-NOTES-v0.5.0.md
RELEASE-NOTES-v0.6.0.md		RELEASE-NOTES-v0.6.0.md
SESSION_SUMMARY.md		SESSION_SUMMARY.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OxidizePdf.NET

Features

Installation

Quick Start

Basic Text Extraction

AI/RAG Integration with KernelMemory

SharePoint Crawler Example

Performance

Supported Platforms

Architecture

API Reference

PdfExtractor

ChunkOptions

DocumentChunk

Requirements

Building from Source

Examples

License

Contributing

Acknowledgments

About

Uh oh!

Releases 11

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OxidizePdf.NET

Features

Installation

Quick Start

Basic Text Extraction

AI/RAG Integration with KernelMemory

SharePoint Crawler Example

Performance

Supported Platforms

Architecture

API Reference

PdfExtractor

ChunkOptions

DocumentChunk

Requirements

Building from Source

Examples

License

Contributing

Acknowledgments

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 11

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages