A Go package for calculating text similarity scores using multiple metrics. textscore provides flexible text normalization options and supports various similarity algorithms including Levenshtein distance, Jaccard similarity, Dice coefficient, and a hybrid approach.
go get github.com/network-plane/textscorepackage main
import (
"fmt"
"github.com/network-plane/textscore"
)
func main() {
opts := textscore.Options{
Normalize: true,
CaseFold: true,
StripPunct: true,
CollapseSpace: true,
}
score := textscore.Similarity("Hello World", "hello world", textscore.MetricJaccard, opts)
fmt.Printf("Similarity: %.2f\n", score) // Output: Similarity: 1.00
}- Multiple Similarity Metrics: Levenshtein, Jaccard, Dice, and Hybrid
- Flexible Normalization: Case folding, punctuation stripping, whitespace collapsing, and company suffix removal
- Distance Calculation: Integer Levenshtein distance with optional maximum distance filtering
- Ranking: Sort candidates by similarity score
- Unicode Support: Full support for Unicode characters and runes
Identifies the scoring metric to use.
type Metric stringAvailable metrics:
MetricLevenshtein: Levenshtein distance-based similarityMetricJaccard: Jaccard similarity coefficient (token-based)MetricDice: Dice coefficient (token-based)MetricHybrid: Weighted combination of multiple metrics
Controls normalization and metric-specific behavior.
type Options struct {
Normalize bool // Enable normalization pipeline
CaseFold bool // Convert to lowercase
StripPunct bool // Replace punctuation with spaces
CollapseSpace bool // Collapse multiple spaces to single space
DropCompanySuffix bool // Remove company suffixes (Inc., Ltd., etc.)
MaxDistance int // Maximum Levenshtein distance (0 = no limit)
NgramSize int // Reserved for future n-gram support
Weights map[Metric]float64 // Weights for hybrid metric
}Note: Normalization options (CaseFold, StripPunct, CollapseSpace, DropCompanySuffix) only take effect when Normalize is true.
A value associated with a similarity score.
type Scored struct {
Value string // The candidate string
Score float64 // Similarity score in [0, 1]
}Returns an integer Levenshtein distance and whether it is within MaxDistance.
func Distance(a, b string, opts Options) (int, bool)Parameters:
a,b: Strings to compareopts: Options controlling normalization and distance limits
Returns:
int: Levenshtein distance between the normalized stringsbool:trueif distance is withinMaxDistance(orMaxDistanceis 0),falseotherwise
Example:
opts := textscore.Options{
Normalize: true,
CaseFold: true,
MaxDistance: 5,
}
dist, within := textscore.Distance("Hello", "Hallo", opts)
fmt.Printf("Distance: %d, Within limit: %v\n", dist, within)Returns a similarity score in the range [0, 1], where 1.0 indicates identical strings and 0.0 indicates no similarity.
func Similarity(a, b string, metric Metric, opts Options) float64Parameters:
a,b: Strings to comparemetric: The similarity metric to useopts: Options controlling normalization and metric behavior
Returns:
float64: Similarity score in [0, 1]
Example:
opts := textscore.Options{
Normalize: true,
CaseFold: true,
StripPunct: true,
CollapseSpace: true,
}
// Using Jaccard similarity
score := textscore.Similarity("Apple Inc.", "apple", textscore.MetricJaccard, opts)
// Using hybrid metric with custom weights
opts.Weights = map[textscore.Metric]float64{
textscore.MetricJaccard: 0.5,
textscore.MetricDice: 0.3,
textscore.MetricLevenshtein: 0.2,
}
score = textscore.Similarity("Hello World", "hello world", textscore.MetricHybrid, opts)Returns candidates sorted by descending similarity to the query. Candidates with zero similarity are excluded.
func Rank(query string, candidates []string, metric Metric, opts Options) []ScoredParameters:
query: The query string to match againstcandidates: Slice of candidate strings to rankmetric: The similarity metric to useopts: Options controlling normalization and metric behavior
Returns:
[]Scored: Sorted slice of scored candidates (highest score first). Ties are broken alphabetically.
Example:
opts := textscore.Options{
Normalize: true,
CaseFold: true,
StripPunct: true,
CollapseSpace: true,
}
candidates := []string{
"Apple Inc.",
"Microsoft Corporation",
"Google LLC",
"Amazon.com Inc.",
}
results := textscore.Rank("apple", candidates, textscore.MetricJaccard, opts)
for _, result := range results {
fmt.Printf("%s: %.3f\n", result.Value, result.Score)
}Based on Levenshtein (edit) distance. Calculates the minimum number of single-character edits (insertions, deletions, substitutions) needed to transform one string into another.
Formula: 1 - (distance / max(len(a), len(b)))
Best for: Detecting typos, character-level differences, short strings
Example:
score := textscore.Similarity("kitten", "sitting", textscore.MetricLevenshtein, opts)
// Score reflects character-level differencesToken-based similarity using the Jaccard coefficient. Compares sets of tokens (words) between strings.
Formula: |A ∩ B| / |A ∪ B|
Best for: Comparing documents, finding shared words, order-independent matching
Example:
score := textscore.Similarity("apple banana", "banana apple", textscore.MetricJaccard, opts)
// Returns 1.0 (same tokens, different order)Token-based similarity using the Sørensen-Dice coefficient. Similar to Jaccard but gives more weight to common tokens.
Formula: 2 * |A ∩ B| / (|A| + |B|)
Best for: Similar use cases to Jaccard, but slightly more forgiving
Example:
score := textscore.Similarity("apple banana", "apple orange", textscore.MetricDice, opts)
// Returns 0.67 (1 common token out of 3 total)Weighted combination of multiple metrics. Default weights are:
- Jaccard: 0.4
- Dice: 0.3
- Levenshtein: 0.3
Custom weights can be provided via Options.Weights.
Best for: General-purpose matching when you want to balance multiple approaches
Example:
opts := textscore.Options{
Normalize: true,
CaseFold: true,
Weights: map[textscore.Metric]float64{
textscore.MetricJaccard: 0.6,
textscore.MetricLevenshtein: 0.4,
},
}
score := textscore.Similarity("Apple Inc.", "apple", textscore.MetricHybrid, opts)Normalization is applied in the following order when Normalize is true:
- Case Folding (
CaseFold): Converts strings to lowercase - Punctuation Stripping (
StripPunct): Replaces punctuation and symbols with spaces - Space Collapsing (
CollapseSpace): Collapses multiple whitespace characters to single spaces - Company Suffix Removal (
DropCompanySuffix): Removes common company suffixes
The following suffixes are removed when DropCompanySuffix is enabled:
inc,inc.ltd,ltd.llccorp,corp.corporationgmbhs.a.,sabvco,co.companysrlagplc
Example:
opts := textscore.Options{
Normalize: true,
DropCompanySuffix: true,
}
// "Apple Inc." becomes "Apple"
// "Microsoft Corporation" becomes "Microsoft"opts := textscore.Options{
Normalize: true,
CaseFold: true,
StripPunct: true,
CollapseSpace: true,
}
query := "john smith"
candidates := []string{"John A. Smith", "John Smith", "Jane Doe"}
results := textscore.Rank(query, candidates, textscore.MetricHybrid, opts)
// Best match will be "John A. Smith" or "John Smith"opts := textscore.Options{
Normalize: true,
CaseFold: true,
StripPunct: true,
CollapseSpace: true,
DropCompanySuffix: true,
}
query := "Apple"
candidates := []string{"Apple Inc.", "Apple Corporation", "Microsoft Corp."}
results := textscore.Rank(query, candidates, textscore.MetricJaccard, opts)
// "Apple Inc." and "Apple Corporation" will rank highestopts := textscore.Options{
Normalize: true,
CaseFold: true,
MaxDistance: 2, // Only consider strings within 2 edits
}
dist, within := textscore.Distance("kitten", "sitting", opts)
if within {
fmt.Printf("Close match (distance: %d)\n", dist)
}- Levenshtein: O(n*m) time complexity where n and m are string lengths. Optimized for memory by using only two rows.
- Token-based metrics (Jaccard, Dice): O(n+m) time complexity where n and m are token counts.
- Hybrid: Combines multiple metrics, so performance depends on which metrics are included.
For large candidate sets, consider:
- Using
MaxDistanceto filter out poor matches early - Pre-normalizing candidate strings if they're reused
- Using token-based metrics for longer strings
[Specify your license here]
[Specify contribution guidelines here]