Skip to content

Rank function could be faster #1

@rikedyp

Description

@rikedyp

Edit: I double checked and found (0.5×⍳+⍸)⍨ is not the same as TamStat.Util.rank, but have updated text below to use APLCart's alternative that does return the same result for ties.

I'm reviewing the MannWhitney function for my own interest. I tried implementing the procedure on Wikipedia (Calculations, method 2) and got:

 MannWhitneyU{
     r(2÷+),
     r1+/r
     r1-0.5×(×1+)
 }

which I've verified with TamStat as obtaining the correct U statistic. Part of the code includes cardinal ranking with an averaged score for ties. From APLCart:

Rank2÷+

Comparing with TamStat.Util.rank

      C← 1 3 3 3 3 3 3 3 6
      ]runtime -c "(2÷⍨⍋∘⍋+⍒∘⍋∘⌽)C" "rank C"
                                                                          
  (2÷⍨⍋∘⍋+⍒∘⍋∘⌽)C → 1.0E¯6 |  0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕   
  rank C          → 1.0E¯6 | +4% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ 
      C← 1e4⍴0
      ]runtime -c "(2÷⍨⍋∘⍋+⍒∘⍋∘⌽)C" "rank C"
                                                                             
  (2÷⍨⍋∘⍋+⍒∘⍋∘⌽)C → 7.2E¯5 |     0% ⎕                                        
  rank C          → 5.0E¯3 | +6881% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ 

the APLCart-based formulation is much faster. It also now works with actually any array thanks to Total Array Ordering and use of grade instead of comparison functions.


Original translation with averaged rank for ties

 MannWhitneyU{
     r(0.5×+)ss[gs,]
     r1+/r[()g]
     r1-0.5×(×1+)
 }

which I've verified with TamStat as obtaining the correct U statistic. Part of the code includes cardinal ranking with an averaged score for ties. From APLCart:

Rank(0.5×+)

Use of interval-index (⍺⍸⍵) requires the argument is sorted ascending. Extra work is needed to return the ranks in corresponding values for unordered input:

Rank{(g)(0.5×+)[g]}

Comparing this with TamStat.Util.rank:

With integers:

      C←?1e4⍴9
      ]runtime -c "(⊂⍋g)⌷(0.5×⍳+⍸)⍨C[g←⍋C]" "rank C"
                                                                                      
  (⊂⍋g)⌷(0.5×⍳+⍸)⍨C[g←⍋C] → 6.3E¯5 |      0%                                          
  rank C                  → 6.8E¯3 | +10796% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ 

with floats:

      C←?1e4⍴0
      ]runtime -c "(⊂⍋g)⌷(0.5×⍳+⍸)⍨C[g←⍋C]" "rank C"
                                                                                     
  (⊂⍋g)⌷(0.5×⍳+⍸)⍨C[g←⍋C] → 1.5E¯3 |     0% ⎕⎕                                       
  rank C                  → 3.6E¯2 | +2270% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ 

the APLCart-based formulation is much faster. It also now works with actually any array thanks to Total Array Ordering and use of grade and set functions instead of comparison functions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions