Skip to content

Add top_features functions#289

Open
jamesalster wants to merge 7 commits intoJuliaText:masterfrom
jamesalster:add-top_features-functions
Open

Add top_features functions#289
jamesalster wants to merge 7 commits intoJuliaText:masterfrom
jamesalster:add-top_features-functions

Conversation

@jamesalster
Copy link
Copy Markdown

@jamesalster jamesalster commented Mar 7, 2026

Suggested addition of functions for top_features(), for Corpus, DocumentTermMatrix, Document and lexicons. I've provided docs and tests. It's my first time contributing to this package, so let me know if I've missed anything.

One question is if/how we sort terms with tied counts: I've gone for no sorting, using julia's default sort!(OrderedDict(...)) since it seemed simplest and I couldn't think of an obvious solution.

@jamesalster
Copy link
Copy Markdown
Author

I enforced alphabetic sorting for ties after the online tests were having problems reproducing the order (it was possibly run-or-machine specific)

src/dtm.jl Outdated
function top_features(D::DocumentTermMatrix)
counts = vec(sum(D.dtm; dims=1))
return sort!(sort!(OrderedDict(zip(D.terms, counts))); byvalue=true, rev=true) # double sort for key and value order
end No newline at end of file
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Computational complexity of this method is very high. Consider something like this one:

top_features(D::DocumentTermMatrix, n::Int) = first.(top_features(D, Val(n)))

function top_features(D::DocumentTermMatrix, ::Val{N}) where {N}
    counts = @view(sum(D.dtm; dims=1)[1, :])
    n = min(N, length(counts))
    idx = partialsortperm(counts, 1:n; rev=true)
    collect(zip(D.terms[idx], counts[idx]))
end

Here, we avoid exceeding the memory allocation and the need to sort everything, opting instead to take only n-elements.

using TextAnalysis
using WordTokenizers
using Serialization
using OrderedCollections: OrderedDict
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is needed.

@rssdev10
Copy link
Copy Markdown
Collaborator

I'm thinking about the name top_features. Don't you think top_terms or most_frequent_terms would be clearer?

@rssdev10
Copy link
Copy Markdown
Collaborator

The comment regarding sort!(sort!(OrderedDict(zip... is related to all 3 places with full sorting.

@jamesalster
Copy link
Copy Markdown
Author

jamesalster commented Mar 17, 2026

Thank you for the review @rssdev10 ! You're right, I forgot how inefficient the full sort!() could become.

Changes:

  • partialsortperm() implemented, with numeric sorting using alphabetic order to break ties, consistent with quanteda
  • Method without the n argument dropped, since on reflection it's unncessary
  • OrderedDict kept, since I think the order of the terms (keys) being strictly consistent is definitely expected user behaviour. If you feel strongly about the dependency, we could discard the count information and return a string vector?
  • Naming was taken from quanteda::topfeatures() but I've changed it to keep it consistent with TextAnalysis.

@rssdev10
Copy link
Copy Markdown
Collaborator

  • OrderedDict kept, since I think the order of the terms (keys) being strictly consistent is definitely expected user behaviour. If you feel strongly about the dependency, we could discard the count information and return a string vector?

I think it would be better not to add OrderedDict as a new package dependency. Users who need it can simply use top_terms(...) |> OrderedDict in their own code.

You could use either a Vector{Pair} or a Vector{Tuple} struct as the result instead. See allowable arguments types - https://github.com/JuliaCollections/OrderedCollections.jl/blob/master/src/ordered\_dict.jl\#L61-L68

Also, please add a new line to the end of all files you have modified. GitHub indicates the absence of a free line in places like this one - https://github.com/JuliaText/TextAnalysis.jl/pull/289/changes#diff-c6b4870a7fa201c05bd4ce8bf468b4f092f94fb6d18bc8a913e9e6009b342956R415

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants