textscore

A Go package for calculating text similarity scores using multiple metrics. textscore provides flexible text normalization options and supports various similarity algorithms including Levenshtein distance, Jaccard similarity, Dice coefficient, and a hybrid approach.

Installation

go get github.com/network-plane/textscore

Quick Start

package main

import (
    "fmt"
    "github.com/network-plane/textscore"
)

func main() {
    opts := textscore.Options{
        Normalize:     true,
        CaseFold:      true,
        StripPunct:    true,
        CollapseSpace: true,
    }
    
    score := textscore.Similarity("Hello World", "hello world", textscore.MetricJaccard, opts)
    fmt.Printf("Similarity: %.2f\n", score) // Output: Similarity: 1.00
}

Features

Multiple Similarity Metrics: Levenshtein, Jaccard, Dice, and Hybrid
Flexible Normalization: Case folding, punctuation stripping, whitespace collapsing, and company suffix removal
Distance Calculation: Integer Levenshtein distance with optional maximum distance filtering
Ranking: Sort candidates by similarity score
Unicode Support: Full support for Unicode characters and runes

API Reference

Types

`Metric`

Identifies the scoring metric to use.

type Metric string

Available metrics:

MetricLevenshtein: Levenshtein distance-based similarity
MetricJaccard: Jaccard similarity coefficient (token-based)
MetricDice: Dice coefficient (token-based)
MetricHybrid: Weighted combination of multiple metrics

`Options`

Controls normalization and metric-specific behavior.

type Options struct {
    Normalize         bool              // Enable normalization pipeline
    CaseFold          bool              // Convert to lowercase
    StripPunct        bool              // Replace punctuation with spaces
    CollapseSpace     bool              // Collapse multiple spaces to single space
    DropCompanySuffix bool              // Remove company suffixes (Inc., Ltd., etc.)
    MaxDistance       int               // Maximum Levenshtein distance (0 = no limit)
    NgramSize         int               // Reserved for future n-gram support
    Weights           map[Metric]float64 // Weights for hybrid metric
}

Note: Normalization options (CaseFold, StripPunct, CollapseSpace, DropCompanySuffix) only take effect when Normalize is true.

`Scored`

A value associated with a similarity score.

type Scored struct {
    Value string  // The candidate string
    Score float64 // Similarity score in [0, 1]
}

Functions

`Distance`

Returns an integer Levenshtein distance and whether it is within MaxDistance.

func Distance(a, b string, opts Options) (int, bool)

Parameters:

a, b: Strings to compare
opts: Options controlling normalization and distance limits

Returns:

int: Levenshtein distance between the normalized strings
bool: true if distance is within MaxDistance (or MaxDistance is 0), false otherwise

Example:

opts := textscore.Options{
    Normalize:   true,
    CaseFold:    true,
    MaxDistance: 5,
}
dist, within := textscore.Distance("Hello", "Hallo", opts)
fmt.Printf("Distance: %d, Within limit: %v\n", dist, within)

`Similarity`

Returns a similarity score in the range [0, 1], where 1.0 indicates identical strings and 0.0 indicates no similarity.

func Similarity(a, b string, metric Metric, opts Options) float64

Parameters:

a, b: Strings to compare
metric: The similarity metric to use
opts: Options controlling normalization and metric behavior

Returns:

float64: Similarity score in [0, 1]

Example:

opts := textscore.Options{
    Normalize:     true,
    CaseFold:      true,
    StripPunct:    true,
    CollapseSpace: true,
}

// Using Jaccard similarity
score := textscore.Similarity("Apple Inc.", "apple", textscore.MetricJaccard, opts)

// Using hybrid metric with custom weights
opts.Weights = map[textscore.Metric]float64{
    textscore.MetricJaccard:     0.5,
    textscore.MetricDice:        0.3,
    textscore.MetricLevenshtein: 0.2,
}
score = textscore.Similarity("Hello World", "hello world", textscore.MetricHybrid, opts)

`Rank`

Returns candidates sorted by descending similarity to the query. Candidates with zero similarity are excluded.

func Rank(query string, candidates []string, metric Metric, opts Options) []Scored

Parameters:

query: The query string to match against
candidates: Slice of candidate strings to rank
metric: The similarity metric to use
opts: Options controlling normalization and metric behavior

Returns:

[]Scored: Sorted slice of scored candidates (highest score first). Ties are broken alphabetically.

Example:

opts := textscore.Options{
    Normalize:     true,
    CaseFold:      true,
    StripPunct:    true,
    CollapseSpace: true,
}

candidates := []string{
    "Apple Inc.",
    "Microsoft Corporation",
    "Google LLC",
    "Amazon.com Inc.",
}

results := textscore.Rank("apple", candidates, textscore.MetricJaccard, opts)
for _, result := range results {
    fmt.Printf("%s: %.3f\n", result.Value, result.Score)
}

Similarity Metrics

Levenshtein Similarity

Based on Levenshtein (edit) distance. Calculates the minimum number of single-character edits (insertions, deletions, substitutions) needed to transform one string into another.

Formula: 1 - (distance / max(len(a), len(b)))

Best for: Detecting typos, character-level differences, short strings

Example:

score := textscore.Similarity("kitten", "sitting", textscore.MetricLevenshtein, opts)
// Score reflects character-level differences

Jaccard Similarity

Token-based similarity using the Jaccard coefficient. Compares sets of tokens (words) between strings.

Formula: |A ∩ B| / |A ∪ B|

Best for: Comparing documents, finding shared words, order-independent matching

Example:

score := textscore.Similarity("apple banana", "banana apple", textscore.MetricJaccard, opts)
// Returns 1.0 (same tokens, different order)

Dice Coefficient

Token-based similarity using the Sørensen-Dice coefficient. Similar to Jaccard but gives more weight to common tokens.

Formula: 2 * |A ∩ B| / (|A| + |B|)

Best for: Similar use cases to Jaccard, but slightly more forgiving

Example:

score := textscore.Similarity("apple banana", "apple orange", textscore.MetricDice, opts)
// Returns 0.67 (1 common token out of 3 total)

Hybrid Metric

Weighted combination of multiple metrics. Default weights are:

Jaccard: 0.4
Dice: 0.3
Levenshtein: 0.3

Custom weights can be provided via Options.Weights.

Best for: General-purpose matching when you want to balance multiple approaches

Example:

opts := textscore.Options{
    Normalize: true,
    CaseFold:  true,
    Weights: map[textscore.Metric]float64{
        textscore.MetricJaccard:     0.6,
        textscore.MetricLevenshtein: 0.4,
    },
}
score := textscore.Similarity("Apple Inc.", "apple", textscore.MetricHybrid, opts)

Normalization Options

Normalization is applied in the following order when Normalize is true:

Case Folding (CaseFold): Converts strings to lowercase
Punctuation Stripping (StripPunct): Replaces punctuation and symbols with spaces
Space Collapsing (CollapseSpace): Collapses multiple whitespace characters to single spaces
Company Suffix Removal (DropCompanySuffix): Removes common company suffixes

Company Suffixes

The following suffixes are removed when DropCompanySuffix is enabled:

inc, inc.
ltd, ltd.
llc
corp, corp.
corporation
gmbh
s.a., sa
bv
co, co.
company
srl
ag
plc

Example:

opts := textscore.Options{
    Normalize:         true,
    DropCompanySuffix: true,
}
// "Apple Inc." becomes "Apple"
// "Microsoft Corporation" becomes "Microsoft"

Use Cases

Fuzzy String Matching

opts := textscore.Options{
    Normalize:     true,
    CaseFold:      true,
    StripPunct:    true,
    CollapseSpace: true,
}

query := "john smith"
candidates := []string{"John A. Smith", "John Smith", "Jane Doe"}

results := textscore.Rank(query, candidates, textscore.MetricHybrid, opts)
// Best match will be "John A. Smith" or "John Smith"

Company Name Matching

opts := textscore.Options{
    Normalize:         true,
    CaseFold:          true,
    StripPunct:        true,
    CollapseSpace:     true,
    DropCompanySuffix: true,
}

query := "Apple"
candidates := []string{"Apple Inc.", "Apple Corporation", "Microsoft Corp."}

results := textscore.Rank(query, candidates, textscore.MetricJaccard, opts)
// "Apple Inc." and "Apple Corporation" will rank highest

Typo Detection

opts := textscore.Options{
    Normalize:   true,
    CaseFold:    true,
    MaxDistance: 2, // Only consider strings within 2 edits
}

dist, within := textscore.Distance("kitten", "sitting", opts)
if within {
    fmt.Printf("Close match (distance: %d)\n", dist)
}

Performance Considerations

Levenshtein: O(n*m) time complexity where n and m are string lengths. Optimized for memory by using only two rows.
Token-based metrics (Jaccard, Dice): O(n+m) time complexity where n and m are token counts.
Hybrid: Combines multiple metrics, so performance depends on which metrics are included.

For large candidate sets, consider:

Using MaxDistance to filter out poor matches early
Pre-normalizing candidate strings if they're reused
Using token-based metrics for longer strings

License

[Specify your license here]

Contributing

[Specify contribution guidelines here]

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
cmd/textscore		cmd/textscore
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
go.mod		go.mod
textscore.go		textscore.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

textscore

Installation

Quick Start

Features

API Reference

Types

`Metric`

`Options`

`Scored`

Functions

`Distance`

`Similarity`

`Rank`

Similarity Metrics

Levenshtein Similarity

Jaccard Similarity

Dice Coefficient

Hybrid Metric

Normalization Options

Company Suffixes

Use Cases

Fuzzy String Matching

Company Name Matching

Typo Detection

Performance Considerations

License

Contributing

About

Uh oh!

Releases 1

Packages

Languages

License

network-plane/textscore

Folders and files

Latest commit

History

Repository files navigation

textscore

Installation

Quick Start

Features

API Reference

Types

Metric

Options

Scored

Functions

Distance

Similarity

Rank

Similarity Metrics

Levenshtein Similarity

Jaccard Similarity

Dice Coefficient

Hybrid Metric

Normalization Options

Company Suffixes

Use Cases

Fuzzy String Matching

Company Name Matching

Typo Detection

Performance Considerations

License

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

`Metric`

`Options`

`Scored`

`Distance`

`Similarity`

`Rank`

Packages