Skip to content

IcyDesert/bktree

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

bktree

A Go implementation of the Burkhard-Keller Tree (BK-Tree) for fast approximate string matching in metric spaces.

Features

  • Generic distance function — plug in any metric (Levenshtein, Hamming, or your own)
  • Single tree + ForestBKTree for general use; Forest partitions by string length for natural Hamming support and Levenshtein length-based pruning
  • Exists shortcutExists(word, maxDist) stops at the first match, faster than Query when you only need a boolean
  • Sorted resultsQuery returns results ordered by distance (then word)
  • Unicode-aware — Levenshtein distance operates on runes, not raw bytes
  • Zero dependencies — standard library only

Install

go get github.com/IcyDesert/bktree

Quick Start

Single Tree with Levenshtein

package main

import (
    "fmt"
    "github.com/IcyDesert/bktree"
)

func main() {
    tree := bktree.New(bktree.Levenshtein)

    words := []string{"book", "books", "cake", "boo", "cape"}
    for _, w := range words {
        tree.Add(w)
    }

    // Find all words within edit distance 2 of "boak"
    results := tree.Query("boak", 2)
    for _, r := range results {
        fmt.Printf("%s (distance: %d)\n", r.Word, r.Distance)
    }
    // Output:
    // book (distance: 1)
    // boo (distance: 2)
    // books (distance: 2)

    // Just check if anything is close enough
    if tree.Exists("hallo", 1) {
        fmt.Println("found a near-match")
    }
}

Forest with Hamming

Forest groups words by length into separate trees. This makes Hamming distance trivial — only the same-length tree is ever queried.

forest := bktree.NewForest(bktree.Hamming)

words := []string{"000", "001", "010", "111"}
for _, w := range words {
    forest.Add(w)
}

results := forest.Query("001", 1)
for _, r := range results {
    fmt.Printf("%s (hamming: %d)\n", r.Word, r.Distance)
}
// Output:
// 001 (hamming: 0)
// 000 (hamming: 1)

Custom Distance Function

// Jaccard distance on character sets (example)
func jaccard(a, b string) int {
    // ... compute distance ...
    return distance
}

tree := bktree.New(jaccard)
tree.Add("foo")
tree.Add("bar")

results := tree.Query("baz", 2)

Persistence

Trees can be saved to disk with gob encoding. Only the topology is persisted — the distance function is not saved. You must provide the same function when loading.

import "os"

// Build
forest := bktree.NewForest(bktree.Levenshtein)
forest.Add("book")
forest.Add("books")

// Save
f, _ := os.Create(bktree.DefaultFilename("index.gob", bktree.Levenshtein))
// Creates "index_levenshtein.gob"
forest.Save(f)
f.Close()

// Load (elsewhere, later)
f, _ = os.Open("index_levenshtein.gob")
defer f.Close()

loaded, err := bktree.LoadForest(f, bktree.Levenshtein)
if err != nil {
    log.Fatal(err)
}

// Use exactly like before
results := loaded.Query("boak", 2)

BKTree uses the same pattern with Save and Load.

How Forest Pruning Works

For Levenshtein and other reasonable metrics, if dist(a, b) <= d then |len(a) - len(b)| <= d. Forest.Query exploits this to skip entire length-buckets, reducing the search space.

For Hamming, the same logic collapses to checking only the exact-length bucket.

About

A Go implementation of the Burkhard-Keller Tree for fast approximate string matching in metric spaces.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Languages