Skip to content

network-plane/textscore

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

textscore

A Go package for calculating text similarity scores using multiple metrics. textscore provides flexible text normalization options and supports various similarity algorithms including Levenshtein distance, Jaccard similarity, Dice coefficient, and a hybrid approach.

Installation

go get github.com/network-plane/textscore

Quick Start

package main

import (
    "fmt"
    "github.com/network-plane/textscore"
)

func main() {
    opts := textscore.Options{
        Normalize:     true,
        CaseFold:      true,
        StripPunct:    true,
        CollapseSpace: true,
    }
    
    score := textscore.Similarity("Hello World", "hello world", textscore.MetricJaccard, opts)
    fmt.Printf("Similarity: %.2f\n", score) // Output: Similarity: 1.00
}

Features

  • Multiple Similarity Metrics: Levenshtein, Jaccard, Dice, and Hybrid
  • Flexible Normalization: Case folding, punctuation stripping, whitespace collapsing, and company suffix removal
  • Distance Calculation: Integer Levenshtein distance with optional maximum distance filtering
  • Ranking: Sort candidates by similarity score
  • Unicode Support: Full support for Unicode characters and runes

API Reference

Types

Metric

Identifies the scoring metric to use.

type Metric string

Available metrics:

  • MetricLevenshtein: Levenshtein distance-based similarity
  • MetricJaccard: Jaccard similarity coefficient (token-based)
  • MetricDice: Dice coefficient (token-based)
  • MetricHybrid: Weighted combination of multiple metrics

Options

Controls normalization and metric-specific behavior.

type Options struct {
    Normalize         bool              // Enable normalization pipeline
    CaseFold          bool              // Convert to lowercase
    StripPunct        bool              // Replace punctuation with spaces
    CollapseSpace     bool              // Collapse multiple spaces to single space
    DropCompanySuffix bool              // Remove company suffixes (Inc., Ltd., etc.)
    MaxDistance       int               // Maximum Levenshtein distance (0 = no limit)
    NgramSize         int               // Reserved for future n-gram support
    Weights           map[Metric]float64 // Weights for hybrid metric
}

Note: Normalization options (CaseFold, StripPunct, CollapseSpace, DropCompanySuffix) only take effect when Normalize is true.

Scored

A value associated with a similarity score.

type Scored struct {
    Value string  // The candidate string
    Score float64 // Similarity score in [0, 1]
}

Functions

Distance

Returns an integer Levenshtein distance and whether it is within MaxDistance.

func Distance(a, b string, opts Options) (int, bool)

Parameters:

  • a, b: Strings to compare
  • opts: Options controlling normalization and distance limits

Returns:

  • int: Levenshtein distance between the normalized strings
  • bool: true if distance is within MaxDistance (or MaxDistance is 0), false otherwise

Example:

opts := textscore.Options{
    Normalize:   true,
    CaseFold:    true,
    MaxDistance: 5,
}
dist, within := textscore.Distance("Hello", "Hallo", opts)
fmt.Printf("Distance: %d, Within limit: %v\n", dist, within)

Similarity

Returns a similarity score in the range [0, 1], where 1.0 indicates identical strings and 0.0 indicates no similarity.

func Similarity(a, b string, metric Metric, opts Options) float64

Parameters:

  • a, b: Strings to compare
  • metric: The similarity metric to use
  • opts: Options controlling normalization and metric behavior

Returns:

  • float64: Similarity score in [0, 1]

Example:

opts := textscore.Options{
    Normalize:     true,
    CaseFold:      true,
    StripPunct:    true,
    CollapseSpace: true,
}

// Using Jaccard similarity
score := textscore.Similarity("Apple Inc.", "apple", textscore.MetricJaccard, opts)

// Using hybrid metric with custom weights
opts.Weights = map[textscore.Metric]float64{
    textscore.MetricJaccard:     0.5,
    textscore.MetricDice:        0.3,
    textscore.MetricLevenshtein: 0.2,
}
score = textscore.Similarity("Hello World", "hello world", textscore.MetricHybrid, opts)

Rank

Returns candidates sorted by descending similarity to the query. Candidates with zero similarity are excluded.

func Rank(query string, candidates []string, metric Metric, opts Options) []Scored

Parameters:

  • query: The query string to match against
  • candidates: Slice of candidate strings to rank
  • metric: The similarity metric to use
  • opts: Options controlling normalization and metric behavior

Returns:

  • []Scored: Sorted slice of scored candidates (highest score first). Ties are broken alphabetically.

Example:

opts := textscore.Options{
    Normalize:     true,
    CaseFold:      true,
    StripPunct:    true,
    CollapseSpace: true,
}

candidates := []string{
    "Apple Inc.",
    "Microsoft Corporation",
    "Google LLC",
    "Amazon.com Inc.",
}

results := textscore.Rank("apple", candidates, textscore.MetricJaccard, opts)
for _, result := range results {
    fmt.Printf("%s: %.3f\n", result.Value, result.Score)
}

Similarity Metrics

Levenshtein Similarity

Based on Levenshtein (edit) distance. Calculates the minimum number of single-character edits (insertions, deletions, substitutions) needed to transform one string into another.

Formula: 1 - (distance / max(len(a), len(b)))

Best for: Detecting typos, character-level differences, short strings

Example:

score := textscore.Similarity("kitten", "sitting", textscore.MetricLevenshtein, opts)
// Score reflects character-level differences

Jaccard Similarity

Token-based similarity using the Jaccard coefficient. Compares sets of tokens (words) between strings.

Formula: |A ∩ B| / |A ∪ B|

Best for: Comparing documents, finding shared words, order-independent matching

Example:

score := textscore.Similarity("apple banana", "banana apple", textscore.MetricJaccard, opts)
// Returns 1.0 (same tokens, different order)

Dice Coefficient

Token-based similarity using the Sørensen-Dice coefficient. Similar to Jaccard but gives more weight to common tokens.

Formula: 2 * |A ∩ B| / (|A| + |B|)

Best for: Similar use cases to Jaccard, but slightly more forgiving

Example:

score := textscore.Similarity("apple banana", "apple orange", textscore.MetricDice, opts)
// Returns 0.67 (1 common token out of 3 total)

Hybrid Metric

Weighted combination of multiple metrics. Default weights are:

  • Jaccard: 0.4
  • Dice: 0.3
  • Levenshtein: 0.3

Custom weights can be provided via Options.Weights.

Best for: General-purpose matching when you want to balance multiple approaches

Example:

opts := textscore.Options{
    Normalize: true,
    CaseFold:  true,
    Weights: map[textscore.Metric]float64{
        textscore.MetricJaccard:     0.6,
        textscore.MetricLevenshtein: 0.4,
    },
}
score := textscore.Similarity("Apple Inc.", "apple", textscore.MetricHybrid, opts)

Normalization Options

Normalization is applied in the following order when Normalize is true:

  1. Case Folding (CaseFold): Converts strings to lowercase
  2. Punctuation Stripping (StripPunct): Replaces punctuation and symbols with spaces
  3. Space Collapsing (CollapseSpace): Collapses multiple whitespace characters to single spaces
  4. Company Suffix Removal (DropCompanySuffix): Removes common company suffixes

Company Suffixes

The following suffixes are removed when DropCompanySuffix is enabled:

  • inc, inc.
  • ltd, ltd.
  • llc
  • corp, corp.
  • corporation
  • gmbh
  • s.a., sa
  • bv
  • co, co.
  • company
  • srl
  • ag
  • plc

Example:

opts := textscore.Options{
    Normalize:         true,
    DropCompanySuffix: true,
}
// "Apple Inc." becomes "Apple"
// "Microsoft Corporation" becomes "Microsoft"

Use Cases

Fuzzy String Matching

opts := textscore.Options{
    Normalize:     true,
    CaseFold:      true,
    StripPunct:    true,
    CollapseSpace: true,
}

query := "john smith"
candidates := []string{"John A. Smith", "John Smith", "Jane Doe"}

results := textscore.Rank(query, candidates, textscore.MetricHybrid, opts)
// Best match will be "John A. Smith" or "John Smith"

Company Name Matching

opts := textscore.Options{
    Normalize:         true,
    CaseFold:          true,
    StripPunct:        true,
    CollapseSpace:     true,
    DropCompanySuffix: true,
}

query := "Apple"
candidates := []string{"Apple Inc.", "Apple Corporation", "Microsoft Corp."}

results := textscore.Rank(query, candidates, textscore.MetricJaccard, opts)
// "Apple Inc." and "Apple Corporation" will rank highest

Typo Detection

opts := textscore.Options{
    Normalize:   true,
    CaseFold:    true,
    MaxDistance: 2, // Only consider strings within 2 edits
}

dist, within := textscore.Distance("kitten", "sitting", opts)
if within {
    fmt.Printf("Close match (distance: %d)\n", dist)
}

Performance Considerations

  • Levenshtein: O(n*m) time complexity where n and m are string lengths. Optimized for memory by using only two rows.
  • Token-based metrics (Jaccard, Dice): O(n+m) time complexity where n and m are token counts.
  • Hybrid: Combines multiple metrics, so performance depends on which metrics are included.

For large candidate sets, consider:

  • Using MaxDistance to filter out poor matches early
  • Pre-normalizing candidate strings if they're reused
  • Using token-based metrics for longer strings

License

[Specify your license here]

Contributing

[Specify contribution guidelines here]

About

package that offers various text scoring funcs

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages