SortingNetworks/sort5a.asm at master · komrad36/SortingNetworks · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
;*******************************************************************
;   sort5a.asm
;   Sorting Networks
;
;	Author: Kareem Omar
;	kareem.omar@uah.edu
;	https//github.com/komrad36
;
;	Last updated Feb 15, 2017
;*******************************************************************
;
; SUMMARY: I present novel and state-of-the-art sorting of 4 int32_t
; and sorting of 6 int8_t, as examples, using SSE4, and some thoughts
; and notes on fast sorting of small fixed-size arrays.
;
; These are the fastest known approaches on modern CPUs.
; They are completely branchless and extremely fast.
; For example, 4 int32_t can be sorted in ~18 cycles on Skylake.
;
; These examples can be modified to sort signed or unsigned data as long
; as the total size of the data is <= 128 bits (16 bytes). Note that
; trying to use AVX2 to push that to 256 bits will NOT HELP
; because of the 3 cycle interlane latency that makes shuffling
; across the 128-bit lane boundary too expensive to be useful.
;
; In that case, or in the case that you can't support SSE4,
; efficient scalar code can also be generated that isn't too much
; slower, but it's not immediately evident what the optimal approach
; is in assembly, nor how to coerce C/C++ compilers to produce that
; assembly from C/C++ code. (I discuss the approach at the end of this
; documentation.)
;
; Not all compilers correctly generate optimal assembly for either the
; SSE or scalar code, so I provide assembly versions too. In fact, this
; is not a strong enough statement; many compilers FAIL MISERABLY
; at both the SSE and scalar cases for n >= 3. CL in particular
; (Microsoft Visual C++) just dies horribly in a fire, so the assembly
; snippets may not be a bad idea. Profile and/or check your disassembly.
;
; Note that these assembly snippets use the Windows x64 calling convention,
; but then proceed to clobber a bunch of registers they shouldn't. Normally
; they'd be inlined. Feel free to callee-save registers that your
; application cannot safely clobber.
;
; These code snippets work on the principle of sorting networks.
; Conventional sorting algorithms like quicksort and mergesort
; are not great at small array sizes. One notices in profiling that
; simpler sorts like insertion and selection tend to win. However,
; even these are not NEARLY as fast as they could be for
; fixed-size, small arrays.
;
; Available sorts and their function names:
;
; >>> SSE Assembly (fast as hell. FASTEST option on modern CPUs.
;					 these are written in MASM for Win64;
;					 but it's Intel syntax and you can make the small
;					 modifications required for other assemblers.)
; Sorting 4 int32_t  |  simdsort4a()
; Sorting 4 int32_t  |  simdsort4a_noconstants()
; Sorting 4 int32_t  |  simdsort4a_nofloat()
;
; >>> SSE C++ (make sure generated assembly matches):
; Sorting 4 int32_t  |  simdsort4()
; Sorting 6 int8_t   |  simdsort6()
;
; >>> Scalar Assembly (these are written in MASM for Win64;
;						but it's Intel syntax and you can make the small
;						modifications required for other assemblers.)
; Sorting 2 int32_t  |  sort2a()
; Sorting 3 int32_t  |  sort3a()
; Sorting 4 int32_t  |  sort4a()
; Sorting 5 int32_t  |  sort5a()
; Sorting 6 int32_t  |  sort6a()
;
; >>> Scalar C++ (make sure generated assembly matches)
; Sorting 2 int32_t  |  sort2()
; Sorting 6 int32_t  |  sort6()
;
;
; Okay, if you've made it this far, let's discuss
; fast conditional swap operations. Conditional swaps
; if the lower element is greater are the heart of sorting
; networks. Given two values,
; 'a', and 'b', leave them as they are if 'a' is less
; than 'b', i.e. if they are in sorted order. However,
; swap them if 'a' is greater than or equal to 'b'.
; Thus after such a conditional swap operation 'a' and 'b' are in sorted
; order no matter what order they came in as.
;
; A series of such operations can deterministically sort
; a fixed-size array. Typically one can optimize for depth
; (minimum number of operations given infinite parallel
; processing) or for size (minimum number of operations given
; no parallel processing). For n == 4 these two optimal solutions
; are actually given by the same network, with depth 3 and
; size 5.
;
; Scalar first: how do you efficiently conditional swap? Again, note that
; lots of compilers don't produce optimal assembly no matter
; what C++ you give them. But what is the optimal assembly?
; Well, on modern processors, the answer is conditional moves:
;
;	; inputs: eax, r9d
;	; scratch register: edx
;	cmp     eax, r9d
;	mov     edx, eax
;	cmovg   eax, r9d
;	cmovg   r9d, edx
;	; eax and r9d have been swapped if necessary such that eax is now <= r9d
;
; See the function 'sort6' in 'sorts.cpp' for an attempt at some C++ code
; that has a decent chance of compiling into conditional swaps that look like that.
; Again, they OFTEN DON'T, especially the CL compiler and g++. Use the assembly
; snippets instead, or at least profile and inspect your disassembly to be sure.
;
; The SSE sorts are rather more sophisticated, but the basic principle
; is to branchlessly generate shuffle index masks at runtime and then
; use them to permute the order of the data in parallel, performing
; one complete level of the sorting network at a time.
;
; I provide 3 versions of the assembly for sorting 4 int32_t. The fastest
; should be the plain 'simdsort4a'. It performs float reinterpretation
; and relies on some constant loads, but both of these are USUALLY
; better than the alternatives. However:
;
; Older CPUs may incur too much latency from the reinterpretation to be
; worth it. For such CPUs, try 'simdsort4a_nofloat.asm'.
;
; Applications that cannot afford to have these constants occupying L1
; may wish to try 'simdsort4a_noconstants.asm', which does not eat
; up any cache space with constants - everything is done with immediates
; and some fairly nasty bit twiddling.
;

.CODE

sort5a PROC
	mov		eax, [rcx]
	mov		edx, [rcx+4]
	mov		r9d, [rcx+8]
	mov		r11d, [rcx+12]
	mov		r8d, [rcx+16]
	cmp		eax, edx
	mov		r10d, eax
	cmovg	r10d, edx
	cmovg	edx, eax
	cmp		r9d, r11d
	mov		eax, r9d
	cmovg	eax, r11d
	cmovg	r11d, r9d
	cmp		edx, r11d
	mov		r9d, edx
	cmovg	r9d, r11d
	cmovg	r11d, edx
	cmp		eax, r8d
	mov		edx, eax
	cmovg	edx, r8d
	cmovl	eax, r8d
	cmp		r10d, edx
	mov		r8d, r10d
	cmovg	r8d, edx
	cmovg	edx, r10d
	cmp		r9d, eax
	mov		r12d, r9d
	cmovg	r12d, eax
	cmovg	eax, r9d
	cmp		r12d, edx
	mov		r9d, r12d
	cmovg	r9d, edx
	cmovl	r12d, edx
	cmp		r11d, eax
	mov		edx, r11d
	cmovg	edx, eax
	cmovg	eax, r11d
	cmp		r12d, edx
	mov		r11d, r12d
	cmovg	r11d, edx
	cmovg	edx, r12d
	mov		[rcx], r8d
	mov		[rcx+4], r9d
	mov		[rcx+8], r11d
	mov		[rcx+12], edx
	mov		[rcx+16], eax
	ret
sort5a ENDP

END