Skip to content

Slightly better neon code.#3

Open
lemire wants to merge 1 commit into
skeeto:masterfrom
lemire:master
Open

Slightly better neon code.#3
lemire wants to merge 1 commit into
skeeto:masterfrom
lemire:master

Conversation

@lemire
Copy link
Copy Markdown

@lemire lemire commented Mar 26, 2018

I think that this will be faster. Feel free to drop this PR.

@skeeto
Copy link
Copy Markdown
Owner

skeeto commented Mar 29, 2018 via email

@lemire
Copy link
Copy Markdown
Author

lemire commented Mar 30, 2018

I can give you access to a server-class AMD Softiron 1000. I think that @vielmetti might also be able to get you access to server-class ARM hardware.

I also a get a warning about dereferencing a type-punned pointer (e.g.
strict aliasing) in the return expression of is_zero(). This suggests to
me that that particular lane access is not valid and the compiler isn't
obligated to ensure it will work correctly.

This index access code is used in production all over (including at Google and Apple). But you can use vget_lane_u64(result,0) == 0 if you prefer.

Before your change it takes 2.26s, and with your change it takes 2.40s.

Let us look at the assembly... for my proposal, we get

        uqxtn   v0.2s, v0.2d
        fmov    x0, d0
        cmp     x0, 0
        cset    w0, eq

Your approach has (GCC 6.3)

        umov    w0, v0.s[0]
        cbz     w0, .L4
.L6:
        mov     w0, 0
        ret
.L4:
        umov    w0, v0.s[1]
        cbnz    w0, .L6
        umov    w0, v0.s[2]
        cbnz    w0, .L6
        umov    w0, v0.s[3]
        cmp     w0, 0
        cset    w0, eq

The main question is how fast uqxtn is... This obviously depends on your specific ARM hardware. On my AMD Softiron uqxtn has a throughput of 1 instruction per cycle, so it might be hard to beat.

I wrote a benchmark that tests specifically this zero-test function:
https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/blob/master/extra/neon/iszero/iszero.c

Here are my results on a softiron 1000 server, your approach is clearly slower... could you check what you get?

$ clang -O3 -o iszero iszero.c && ./iszero
density = 0.000001
rdtsc_overhead set to 0
run_is_zero(buffer,N)                                       	:  689.00000  (clock units)  per operation (best) 	759.00000  (clock units)  per operation (avg)
run_is_zero_long(buffer,N)                                  	:  997.00000  (clock units)  per operation (best) 	1002.00000  (clock units)  per operation (avg)

density = 0.000002
run_is_zero(buffer,N)                                       	:  664.00000  (clock units)  per operation (best) 	669.00000  (clock units)  per operation (avg)
run_is_zero_long(buffer,N)                                  	:  992.00000  (clock units)  per operation (best) 	995.00000  (clock units)  per operation (avg)

density = 0.000004
run_is_zero(buffer,N)                                       	:  661.00000  (clock units)  per operation (best) 	662.00000  (clock units)  per operation (avg)
run_is_zero_long(buffer,N)                                  	:  991.00000  (clock units)  per operation (best) 	991.00000  (clock units)  per operation (avg)

density = 0.000008
run_is_zero(buffer,N)                                       	:  660.00000  (clock units)  per operation (best) 	662.00000  (clock units)  per operation (avg)
run_is_zero_long(buffer,N)                                  	:  991.00000  (clock units)  per operation (best) 	991.00000  (clock units)  per operation (avg)

density = 0.000015
run_is_zero(buffer,N)                                       	:  659.00000  (clock units)  per operation (best) 	661.00000  (clock units)  per operation (avg)
run_is_zero_long(buffer,N)                                  	:  991.00000  (clock units)  per operation (best) 	991.00000  (clock units)  per operation (avg)

density = 0.000031
run_is_zero(buffer,N)                                       	:  659.00000  (clock units)  per operation (best) 	661.00000  (clock units)  per operation (avg)
run_is_zero_long(buffer,N)                                  	:  991.00000  (clock units)  per operation (best) 	991.00000  (clock units)  per operation (avg)

density = 0.000061
run_is_zero(buffer,N)                                       	:  658.00000  (clock units)  per operation (best) 	658.00000  (clock units)  per operation (avg)
run_is_zero_long(buffer,N)                                  	:  991.00000  (clock units)  per operation (best) 	993.00000  (clock units)  per operation (avg)

density = 0.000122
run_is_zero(buffer,N)                                       	:  658.00000  (clock units)  per operation (best) 	658.00000  (clock units)  per operation (avg)
run_is_zero_long(buffer,N)                                  	:  991.00000  (clock units)  per operation (best) 	993.00000  (clock units)  per operation (avg)

density = 0.000244
run_is_zero(buffer,N)                                       	:  657.00000  (clock units)  per operation (best) 	657.00000  (clock units)  per operation (avg)
run_is_zero_long(buffer,N)                                  	:  991.00000  (clock units)  per operation (best) 	993.00000  (clock units)  per operation (avg)

density = 0.000488
run_is_zero(buffer,N)                                       	:  657.00000  (clock units)  per operation (best) 	657.00000  (clock units)  per operation (avg)
run_is_zero_long(buffer,N)                                  	:  991.00000  (clock units)  per operation (best) 	993.00000  (clock units)  per operation (avg)

density = 0.000977
run_is_zero(buffer,N)                                       	:  659.00000  (clock units)  per operation (best) 	660.00000  (clock units)  per operation (avg)
run_is_zero_long(buffer,N)                                  	:  990.00000  (clock units)  per operation (best) 	991.00000  (clock units)  per operation (avg)

density = 0.001953
run_is_zero(buffer,N)                                       	:  657.00000  (clock units)  per operation (best) 	660.00000  (clock units)  per operation (avg)
run_is_zero_long(buffer,N)                                  	:  990.00000  (clock units)  per operation (best) 	991.00000  (clock units)  per operation (avg)

density = 0.003906
run_is_zero(buffer,N)                                       	:  658.00000  (clock units)  per operation (best) 	661.00000  (clock units)  per operation (avg)
run_is_zero_long(buffer,N)                                  	:  990.00000  (clock units)  per operation (best) 	991.00000  (clock units)  per operation (avg)

density = 0.007812
run_is_zero(buffer,N)                                       	:  657.00000  (clock units)  per operation (best) 	660.00000  (clock units)  per operation (avg)
run_is_zero_long(buffer,N)                                  	:  991.00000  (clock units)  per operation (best) 	992.00000  (clock units)  per operation (avg)

density = 0.015625
run_is_zero(buffer,N)                                       	:  658.00000  (clock units)  per operation (best) 	658.00000  (clock units)  per operation (avg)
run_is_zero_long(buffer,N)                                  	:  991.00000  (clock units)  per operation (best) 	994.00000  (clock units)  per operation (avg)

density = 0.031250
run_is_zero(buffer,N)                                       	:  658.00000  (clock units)  per operation (best) 	662.00000  (clock units)  per operation (avg)
run_is_zero_long(buffer,N)                                  	:  991.00000  (clock units)  per operation (best) 	991.00000  (clock units)  per operation (avg)

density = 0.062500
run_is_zero(buffer,N)                                       	:  660.00000  (clock units)  per operation (best) 	663.00000  (clock units)  per operation (avg)
run_is_zero_long(buffer,N)                                  	:  991.00000  (clock units)  per operation (best) 	991.00000  (clock units)  per operation (avg)

density = 0.125000
run_is_zero(buffer,N)                                       	:  657.00000  (clock units)  per operation (best) 	659.00000  (clock units)  per operation (avg)
run_is_zero_long(buffer,N)                                  	:  991.00000  (clock units)  per operation (best) 	993.00000  (clock units)  per operation (avg)

density = 0.250000
run_is_zero(buffer,N)                                       	:  658.00000  (clock units)  per operation (best) 	659.00000  (clock units)  per operation (avg)
run_is_zero_long(buffer,N)                                  	:  991.00000  (clock units)  per operation (best) 	993.00000  (clock units)  per operation (avg)

density = 0.500000
run_is_zero(buffer,N)                                       	:  658.00000  (clock units)  per operation (best) 	659.00000  (clock units)  per operation (avg)
run_is_zero_long(buffer,N)                                  	:  990.00000  (clock units)  per operation (best) 	990.00000  (clock units)  per operation (avg)

density = 1.000000
run_is_zero(buffer,N)                                       	:  657.00000  (clock units)  per operation (best) 	658.00000  (clock units)  per operation (avg)
run_is_zero_long(buffer,N)                                  	:  991.00000  (clock units)  per operation (best) 	991.00000  (clock units)  per operation (avg)

@lemire
Copy link
Copy Markdown
Author

lemire commented Mar 30, 2018

I obviously have no vested interest in you merging this, but I do have an interest in figuring out what provides the best performance, however. So I'd be pleased if you could run my benchmark on your favorite system.

@vielmetti
Copy link
Copy Markdown

Here's the results on the Packet Type 2A server (Cavium ThunderX, which is not known to have a particularly wonderful NEON implementation).

density = 0.000001 
rdtsc_overhead set to 2
run_is_zero(buffer,N)                                       	:  2214.00000  (clock units)  per operation (best) 	2231.00000  (clock units)  per operation (avg) 
run_is_zero_long(buffer,N)                                  	:  2083.00000  (clock units)  per operation (best) 	2095.00000  (clock units)  per operation (avg) 

density = 0.000002 
run_is_zero(buffer,N)                                       	:  2214.00000  (clock units)  per operation (best) 	2224.00000  (clock units)  per operation (avg) 
run_is_zero_long(buffer,N)                                  	:  2083.00000  (clock units)  per operation (best) 	2092.00000  (clock units)  per operation (avg) 

density = 0.000004 
run_is_zero(buffer,N)                                       	:  2214.00000  (clock units)  per operation (best) 	2223.00000  (clock units)  per operation (avg) 
run_is_zero_long(buffer,N)                                  	:  2083.00000  (clock units)  per operation (best) 	2089.00000  (clock units)  per operation (avg) 

density = 0.000008 
run_is_zero(buffer,N)                                       	:  2214.00000  (clock units)  per operation (best) 	2223.00000  (clock units)  per operation (avg) 
run_is_zero_long(buffer,N)                                  	:  2083.00000  (clock units)  per operation (best) 	2094.00000  (clock units)  per operation (avg) 

density = 0.000015 
run_is_zero(buffer,N)                                       	:  2214.00000  (clock units)  per operation (best) 	2223.00000  (clock units)  per operation (avg) 
run_is_zero_long(buffer,N)                                  	:  2083.00000  (clock units)  per operation (best) 	2090.00000  (clock units)  per operation (avg) 

density = 0.000031 
run_is_zero(buffer,N)                                       	:  2214.00000  (clock units)  per operation (best) 	2223.00000  (clock units)  per operation (avg) 
run_is_zero_long(buffer,N)                                  	:  2083.00000  (clock units)  per operation (best) 	2092.00000  (clock units)  per operation (avg) 

density = 0.000061 
run_is_zero(buffer,N)                                       	:  2214.00000  (clock units)  per operation (best) 	2220.00000  (clock units)  per operation (avg) 
run_is_zero_long(buffer,N)                                  	:  2083.00000  (clock units)  per operation (best) 	2089.00000  (clock units)  per operation (avg) 

density = 0.000122 
run_is_zero(buffer,N)                                       	:  2214.00000  (clock units)  per operation (best) 	2223.00000  (clock units)  per operation (avg) 
run_is_zero_long(buffer,N)                                  	:  2083.00000  (clock units)  per operation (best) 	2092.00000  (clock units)  per operation (avg) 

density = 0.000244 
run_is_zero(buffer,N)                                       	:  2214.00000  (clock units)  per operation (best) 	2220.00000  (clock units)  per operation (avg) 
run_is_zero_long(buffer,N)                                  	:  2083.00000  (clock units)  per operation (best) 	2092.00000  (clock units)  per operation (avg) 

density = 0.000488 
run_is_zero(buffer,N)                                       	:  2214.00000  (clock units)  per operation (best) 	2223.00000  (clock units)  per operation (avg) 
run_is_zero_long(buffer,N)                                  	:  2082.00000  (clock units)  per operation (best) 	2091.00000  (clock units)  per operation (avg) 

density = 0.000977 
run_is_zero(buffer,N)                                       	:  2213.00000  (clock units)  per operation (best) 	2220.00000  (clock units)  per operation (avg) 
run_is_zero_long(buffer,N)                                  	:  2083.00000  (clock units)  per operation (best) 	2092.00000  (clock units)  per operation (avg) 

density = 0.001953 
run_is_zero(buffer,N)                                       	:  2214.00000  (clock units)  per operation (best) 	2223.00000  (clock units)  per operation (avg) 
run_is_zero_long(buffer,N)                                  	:  2083.00000  (clock units)  per operation (best) 	2092.00000  (clock units)  per operation (avg) 

density = 0.003906 
run_is_zero(buffer,N)                                       	:  2214.00000  (clock units)  per operation (best) 	2228.00000  (clock units)  per operation (avg) 
run_is_zero_long(buffer,N)                                  	:  2083.00000  (clock units)  per operation (best) 	2089.00000  (clock units)  per operation (avg) 

density = 0.007812 
run_is_zero(buffer,N)                                       	:  2214.00000  (clock units)  per operation (best) 	2223.00000  (clock units)  per operation (avg) 
run_is_zero_long(buffer,N)                                  	:  2083.00000  (clock units)  per operation (best) 	2094.00000  (clock units)  per operation (avg) 

density = 0.015625 
run_is_zero(buffer,N)                                       	:  2214.00000  (clock units)  per operation (best) 	2220.00000  (clock units)  per operation (avg) 
run_is_zero_long(buffer,N)                                  	:  2083.00000  (clock units)  per operation (best) 	2092.00000  (clock units)  per operation (avg) 

density = 0.031250 
run_is_zero(buffer,N)                                       	:  2214.00000  (clock units)  per operation (best) 	2223.00000  (clock units)  per operation (avg) 
run_is_zero_long(buffer,N)                                  	:  2083.00000  (clock units)  per operation (best) 	2092.00000  (clock units)  per operation (avg) 

density = 0.062500 
run_is_zero(buffer,N)                                       	:  2214.00000  (clock units)  per operation (best) 	2222.00000  (clock units)  per operation (avg) 
run_is_zero_long(buffer,N)                                  	:  2083.00000  (clock units)  per operation (best) 	2092.00000  (clock units)  per operation (avg) 

density = 0.125000 
run_is_zero(buffer,N)                                       	:  2214.00000  (clock units)  per operation (best) 	2220.00000  (clock units)  per operation (avg) 
run_is_zero_long(buffer,N)                                  	:  2083.00000  (clock units)  per operation (best) 	2092.00000  (clock units)  per operation (avg) 

density = 0.250000 
run_is_zero(buffer,N)                                       	:  2214.00000  (clock units)  per operation (best) 	2223.00000  (clock units)  per operation (avg) 
run_is_zero_long(buffer,N)                                  	:  2083.00000  (clock units)  per operation (best) 	2088.00000  (clock units)  per operation (avg) 

density = 0.500000 
run_is_zero(buffer,N)                                       	:  2214.00000  (clock units)  per operation (best) 	2223.00000  (clock units)  per operation (avg) 
run_is_zero_long(buffer,N)                                  	:  2083.00000  (clock units)  per operation (best) 	2092.00000  (clock units)  per operation (avg) 

density = 1.000000 
run_is_zero(buffer,N)                                       	:  2214.00000  (clock units)  per operation (best) 	2223.00000  (clock units)  per operation (avg) 
run_is_zero_long(buffer,N)                                  	:  2083.00000  (clock units)  per operation (best) 	2089.00000  (clock units)  per operation (avg) 

bogus 46313650 

@lemire
Copy link
Copy Markdown
Author

lemire commented Mar 30, 2018

@vielmetti Interesting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants