-
Notifications
You must be signed in to change notification settings - Fork 7
Expand file tree
/
Copy pathsort5a.asm
More file actions
187 lines (184 loc) · 6.72 KB
/
sort5a.asm
File metadata and controls
187 lines (184 loc) · 6.72 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
;*******************************************************************
; sort5a.asm
; Sorting Networks
;
; Author: Kareem Omar
; kareem.omar@uah.edu
; https//github.com/komrad36
;
; Last updated Feb 15, 2017
;*******************************************************************
;
; SUMMARY: I present novel and state-of-the-art sorting of 4 int32_t
; and sorting of 6 int8_t, as examples, using SSE4, and some thoughts
; and notes on fast sorting of small fixed-size arrays.
;
; These are the fastest known approaches on modern CPUs.
; They are completely branchless and extremely fast.
; For example, 4 int32_t can be sorted in ~18 cycles on Skylake.
;
; These examples can be modified to sort signed or unsigned data as long
; as the total size of the data is <= 128 bits (16 bytes). Note that
; trying to use AVX2 to push that to 256 bits will NOT HELP
; because of the 3 cycle interlane latency that makes shuffling
; across the 128-bit lane boundary too expensive to be useful.
;
; In that case, or in the case that you can't support SSE4,
; efficient scalar code can also be generated that isn't too much
; slower, but it's not immediately evident what the optimal approach
; is in assembly, nor how to coerce C/C++ compilers to produce that
; assembly from C/C++ code. (I discuss the approach at the end of this
; documentation.)
;
; Not all compilers correctly generate optimal assembly for either the
; SSE or scalar code, so I provide assembly versions too. In fact, this
; is not a strong enough statement; many compilers FAIL MISERABLY
; at both the SSE and scalar cases for n >= 3. CL in particular
; (Microsoft Visual C++) just dies horribly in a fire, so the assembly
; snippets may not be a bad idea. Profile and/or check your disassembly.
;
; Note that these assembly snippets use the Windows x64 calling convention,
; but then proceed to clobber a bunch of registers they shouldn't. Normally
; they'd be inlined. Feel free to callee-save registers that your
; application cannot safely clobber.
;
; These code snippets work on the principle of sorting networks.
; Conventional sorting algorithms like quicksort and mergesort
; are not great at small array sizes. One notices in profiling that
; simpler sorts like insertion and selection tend to win. However,
; even these are not NEARLY as fast as they could be for
; fixed-size, small arrays.
;
; Available sorts and their function names:
;
; >>> SSE Assembly (fast as hell. FASTEST option on modern CPUs.
; these are written in MASM for Win64;
; but it's Intel syntax and you can make the small
; modifications required for other assemblers.)
; Sorting 4 int32_t | simdsort4a()
; Sorting 4 int32_t | simdsort4a_noconstants()
; Sorting 4 int32_t | simdsort4a_nofloat()
;
; >>> SSE C++ (make sure generated assembly matches):
; Sorting 4 int32_t | simdsort4()
; Sorting 6 int8_t | simdsort6()
;
; >>> Scalar Assembly (these are written in MASM for Win64;
; but it's Intel syntax and you can make the small
; modifications required for other assemblers.)
; Sorting 2 int32_t | sort2a()
; Sorting 3 int32_t | sort3a()
; Sorting 4 int32_t | sort4a()
; Sorting 5 int32_t | sort5a()
; Sorting 6 int32_t | sort6a()
;
; >>> Scalar C++ (make sure generated assembly matches)
; Sorting 2 int32_t | sort2()
; Sorting 6 int32_t | sort6()
;
;
; Okay, if you've made it this far, let's discuss
; fast conditional swap operations. Conditional swaps
; if the lower element is greater are the heart of sorting
; networks. Given two values,
; 'a', and 'b', leave them as they are if 'a' is less
; than 'b', i.e. if they are in sorted order. However,
; swap them if 'a' is greater than or equal to 'b'.
; Thus after such a conditional swap operation 'a' and 'b' are in sorted
; order no matter what order they came in as.
;
; A series of such operations can deterministically sort
; a fixed-size array. Typically one can optimize for depth
; (minimum number of operations given infinite parallel
; processing) or for size (minimum number of operations given
; no parallel processing). For n == 4 these two optimal solutions
; are actually given by the same network, with depth 3 and
; size 5.
;
; Scalar first: how do you efficiently conditional swap? Again, note that
; lots of compilers don't produce optimal assembly no matter
; what C++ you give them. But what is the optimal assembly?
; Well, on modern processors, the answer is conditional moves:
;
; ; inputs: eax, r9d
; ; scratch register: edx
; cmp eax, r9d
; mov edx, eax
; cmovg eax, r9d
; cmovg r9d, edx
; ; eax and r9d have been swapped if necessary such that eax is now <= r9d
;
; See the function 'sort6' in 'sorts.cpp' for an attempt at some C++ code
; that has a decent chance of compiling into conditional swaps that look like that.
; Again, they OFTEN DON'T, especially the CL compiler and g++. Use the assembly
; snippets instead, or at least profile and inspect your disassembly to be sure.
;
; The SSE sorts are rather more sophisticated, but the basic principle
; is to branchlessly generate shuffle index masks at runtime and then
; use them to permute the order of the data in parallel, performing
; one complete level of the sorting network at a time.
;
; I provide 3 versions of the assembly for sorting 4 int32_t. The fastest
; should be the plain 'simdsort4a'. It performs float reinterpretation
; and relies on some constant loads, but both of these are USUALLY
; better than the alternatives. However:
;
; Older CPUs may incur too much latency from the reinterpretation to be
; worth it. For such CPUs, try 'simdsort4a_nofloat.asm'.
;
; Applications that cannot afford to have these constants occupying L1
; may wish to try 'simdsort4a_noconstants.asm', which does not eat
; up any cache space with constants - everything is done with immediates
; and some fairly nasty bit twiddling.
;
.CODE
sort5a PROC
mov eax, [rcx]
mov edx, [rcx+4]
mov r9d, [rcx+8]
mov r11d, [rcx+12]
mov r8d, [rcx+16]
cmp eax, edx
mov r10d, eax
cmovg r10d, edx
cmovg edx, eax
cmp r9d, r11d
mov eax, r9d
cmovg eax, r11d
cmovg r11d, r9d
cmp edx, r11d
mov r9d, edx
cmovg r9d, r11d
cmovg r11d, edx
cmp eax, r8d
mov edx, eax
cmovg edx, r8d
cmovl eax, r8d
cmp r10d, edx
mov r8d, r10d
cmovg r8d, edx
cmovg edx, r10d
cmp r9d, eax
mov r12d, r9d
cmovg r12d, eax
cmovg eax, r9d
cmp r12d, edx
mov r9d, r12d
cmovg r9d, edx
cmovl r12d, edx
cmp r11d, eax
mov edx, r11d
cmovg edx, eax
cmovg eax, r11d
cmp r12d, edx
mov r11d, r12d
cmovg r11d, edx
cmovg edx, r12d
mov [rcx], r8d
mov [rcx+4], r9d
mov [rcx+8], r11d
mov [rcx+12], edx
mov [rcx+16], eax
ret
sort5a ENDP
END