Add top_features functions by jamesalster · Pull Request #289 · JuliaText/TextAnalysis.jl

jamesalster · 2026-03-07T11:20:35Z

Suggested addition of functions for top_features(), for Corpus, DocumentTermMatrix, Document and lexicons. I've provided docs and tests. It's my first time contributing to this package, so let me know if I've missed anything.

One question is if/how we sort terms with tied counts: I've gone for no sorting, using julia's default sort!(OrderedDict(...)) since it seemed simplest and I couldn't think of an obvious solution.

jamesalster · 2026-03-07T12:36:32Z

I enforced alphabetic sorting for ties after the online tests were having problems reproducing the order (it was possibly run-or-machine specific)

rssdev10 · 2026-03-13T15:31:27Z

src/dtm.jl

+function top_features(D::DocumentTermMatrix)
+    counts = vec(sum(D.dtm; dims=1))
+    return sort!(sort!(OrderedDict(zip(D.terms, counts))); byvalue=true, rev=true) # double sort for key and value order
+end


Computational complexity of this method is very high. Consider something like this one:

top_features(D::DocumentTermMatrix, n::Int) = first.(top_features(D, Val(n))) function top_features(D::DocumentTermMatrix, ::Val{N}) where {N} counts = @view(sum(D.dtm; dims=1)[1, :]) n = min(N, length(counts)) idx = partialsortperm(counts, 1:n; rev=true) collect(zip(D.terms[idx], counts[idx])) end

Here, we avoid exceeding the memory allocation and the need to sort everything, opting instead to take only n-elements.

rssdev10 · 2026-03-13T15:32:10Z

test/runtests.jl

 using TextAnalysis
 using WordTokenizers
 using Serialization
+using OrderedCollections: OrderedDict


I'm not sure this is needed.

rssdev10 · 2026-03-13T15:36:12Z

I'm thinking about the name top_features. Don't you think top_terms or most_frequent_terms would be clearer?

rssdev10 · 2026-03-13T15:48:35Z

The comment regarding sort!(sort!(OrderedDict(zip... is related to all 3 places with full sorting.

jamesalster · 2026-03-17T13:13:28Z

Thank you for the review @rssdev10 ! You're right, I forgot how inefficient the full sort!() could become.

Changes:

partialsortperm() implemented, with numeric sorting using alphabetic order to break ties, consistent with quanteda
Method without the n argument dropped, since on reflection it's unncessary
OrderedDict kept, since I think the order of the terms (keys) being strictly consistent is definitely expected user behaviour. If you feel strongly about the dependency, we could discard the count information and return a string vector?
Naming was taken from quanteda::topfeatures() but I've changed it to keep it consistent with TextAnalysis.

rssdev10 · 2026-03-23T15:58:36Z

OrderedDict kept, since I think the order of the terms (keys) being strictly consistent is definitely expected user behaviour. If you feel strongly about the dependency, we could discard the count information and return a string vector?

I think it would be better not to add OrderedDict as a new package dependency. Users who need it can simply use top_terms(...) |> OrderedDict in their own code.

You could use either a Vector{Pair} or a Vector{Tuple} struct as the result instead. See allowable arguments types - https://github.com/JuliaCollections/OrderedCollections.jl/blob/master/src/ordered\_dict.jl\#L61-L68

Also, please add a new line to the end of all files you have modified. GitHub indicates the absence of a free line in places like this one - https://github.com/JuliaText/TextAnalysis.jl/pull/289/changes#diff-c6b4870a7fa201c05bd4ce8bf468b4f092f94fb6d18bc8a913e9e6009b342956R415

jamesalster added 4 commits March 7, 2026 11:17

Add top_features functions

aaea693

OrderedCollections version compatibility for Julia 1.6

be58387

enforce alphabetic sorting

dbe1f57

improve docs

7cf6d2b

rssdev10 reviewed Mar 13, 2026

View reviewed changes

jamesalster added 3 commits March 17, 2026 12:47

incorporate comments

a3bc21f

rename top_features to top_terms

7ba28a8

Fix dtm method for top_terms

fc76728

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add top_features functions#289

Add top_features functions#289
jamesalster wants to merge 7 commits intoJuliaText:masterfrom
jamesalster:add-top_features-functions

jamesalster commented Mar 7, 2026 •

edited

Loading

Uh oh!

jamesalster commented Mar 7, 2026

Uh oh!

rssdev10 Mar 13, 2026

Uh oh!

rssdev10 Mar 13, 2026

Uh oh!

rssdev10 commented Mar 13, 2026

Uh oh!

rssdev10 commented Mar 13, 2026

Uh oh!

jamesalster commented Mar 17, 2026 •

edited

Loading

Uh oh!

rssdev10 commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jamesalster commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jamesalster commented Mar 7, 2026

Uh oh!

rssdev10 Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

rssdev10 Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

rssdev10 commented Mar 13, 2026

Uh oh!

rssdev10 commented Mar 13, 2026

Uh oh!

jamesalster commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rssdev10 commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jamesalster commented Mar 7, 2026 •

edited

Loading

jamesalster commented Mar 17, 2026 •

edited

Loading