-
Notifications
You must be signed in to change notification settings - Fork 349
Description
As everybody knows, not coherent cache is a serious complication we must deal with.
There are 2 ways of share data between cores:
- Using non-cached alias - read/write directly from L2 RAM
- Using a cached alias and a mechanism of writeback/invalidation to ensure coherent data between cores
Currently Zephyr uses solution 1, SOF uses solution 2
Shared memory in general has a lot of advantages - it is easy to use, the code is "obvious correct" instead of "correct". The structures may be used by several cores at the same time (standard multithread rules for sharing data apply of course)
The question is - how much it is slower?
I made some measurements on MTL RVP - access to the data using several methods:
- not shared data. 2 ways were measured:
- cached memory ensuring cache hit
- using coherent.h in not-shared mode
- shared data - 3 ways were measured:
- not cached memory alias
- using coherent.h in shared mode
- manual invalidation before operations / writeback after operations
all results in CPU cycles
NOT SHARED MEMORY
| ID | desc | Cached (hit) | coherent.h (not shared) |
|---|---|---|---|
| 1 | read single dword | 30 | 45 |
| 2 | read double dword | 31 | 44 |
| 3 | read cacheline (16 dword) | 80 | 98 |
| 4 | read 64dwords | 226 | 257 |
| 5 | read 1024dword | 3132 | 3449 |
| 6 | read twice 1024 | 6234 | 6852 |
| 7 | write single dword | 30 | 44 |
| 8 | write double dword | 31 | 43 |
| 9 | write cacheline (16dword) | 84 | 43 |
| 10 | write 64 dwords | 195 | 211 |
| 11 | wrtite 1024 dwords | 2632 | 2748 |
| 12 | twice write 1024 dwords | 5535 | 5749 |
no surprise here - using of direct access to memory is faster
SHARED MEMORY
| ID | desc | uncahced | cached (with inv/wb) | coherent.h (shared) |
|---|---|---|---|---|
| 1 | read single dword | 35 | 38 | 119 |
| 2 | read double dword | 35 | 39 | 119 |
| 3 | read cacheline (16 dword) | 115 | 87 | 175 |
| 4 | read 64dwords | 344 | 234 | 336 |
| 5 | read 1024dword | 5034 | 3177 | 3575 |
| 6 | read twice 1024 | 10036 | 6279 | 6972 |
| 7 | write single dword | 32 | 37 | 121 |
| 8 | write double dword | 32 | 40 | 119 |
| 9 | write cacheline (16dword) | 76 | 81 | 172 |
| 10 | write 64 dwords | 200 | 208 | 302 |
| 11 | wrtite 1024 dwords | 2636 | 2749 | 2955 |
!!!!!!!!!!!!!!!!!! USING OF UNSHARED MEMORY ALIAS IS FASTER THAN COHERENT.H UNLESS WE WANT TO ACCESS MORE THAN 64 DWORDS!!!!!!!!!!!!!!!!!!
"64 dwords" means - less than 64 accesses from acquire till release. Probably 99% (or even 100%) of all accesses to structures like comp_buffer are like this.
I don't mean access to the audio samples in buffers - only to control structures common usage of buffer is:
sink_c = buffer_acquire(sink);
min_free_frames = MIN(min_free_frames, audio_stream_get_free_frames(&sink_c->stream));
buffer_release(sink_c);
in fact the worse case - 1 dword access
CONCLUSIONS
My proposition is to stop using coherent.h for comp_buffer/audio_stream, instead create those structures in shared (non-cached) or not-shared(cached) memory
Good news is that we do know if the buffer will be shared or not at allocation time. In current code marking a buffer as shared is performed by calling coherent_shared_thread, in my solution it would be by choosing proper memory alias
- cross core safety - exactly same as in current solution. Currently if a buffer would be used on several cores without marking it as shared the system will crash. Same in my approach - cached must be used on single core exclusively
- performance - better (!!!!) than currently
- in case of not shared buffers the gain of performance is obvoius
- even in case of shared buffers - probably 99% (or even 100%) of accesses will be smaller than 64 dword operations.
- shared buffers are rare in the system - currently 100% of buffers are not shared. Shared buffers will be used on cross core connected pipelines and in cross-core DP module ONLY.
- so - we do gain performance in every case
- code is WAY simpler and WAY more easy to develop/debug. That means - less cache related defect in the future (and those defects are extremely hard to track)
Of course the above don't apply to audio samples in buffers - access to sample streams must be cached with proper inv/writebacks. BUT currently buffer_acquire and buffer_release DO NOT do any cache releated operations on audio buffers themselves. So change I'm proposing does not affect it at all.