Skip to content

[RFC] Direct use of cached / not-cached shared memory alias instead of coherent.h in case of buffers #8006

@marcinszkudlinski

Description

@marcinszkudlinski

As everybody knows, not coherent cache is a serious complication we must deal with.

There are 2 ways of share data between cores:

  1. Using non-cached alias - read/write directly from L2 RAM
  2. Using a cached alias and a mechanism of writeback/invalidation to ensure coherent data between cores

Currently Zephyr uses solution 1, SOF uses solution 2

Shared memory in general has a lot of advantages - it is easy to use, the code is "obvious correct" instead of "correct". The structures may be used by several cores at the same time (standard multithread rules for sharing data apply of course)

The question is - how much it is slower?

I made some measurements on MTL RVP - access to the data using several methods:

  1. not shared data. 2 ways were measured:
  • cached memory ensuring cache hit
  • using coherent.h in not-shared mode
  1. shared data - 3 ways were measured:
  • not cached memory alias
  • using coherent.h in shared mode
  • manual invalidation before operations / writeback after operations

all results in CPU cycles

NOT SHARED MEMORY

ID desc Cached (hit) coherent.h (not shared)
1 read single dword 30 45
2 read double dword 31 44
3 read cacheline (16 dword) 80 98
4 read 64dwords 226 257
5 read 1024dword 3132 3449
6 read twice 1024 6234 6852
7 write single dword 30 44
8 write double dword 31 43
9 write cacheline (16dword) 84 43
10 write 64 dwords 195 211
11 wrtite 1024 dwords 2632 2748
12 twice write 1024 dwords 5535 5749

no surprise here - using of direct access to memory is faster

SHARED MEMORY

ID desc uncahced cached (with inv/wb) coherent.h (shared)
1 read single dword 35 38 119
2 read double dword 35 39 119
3 read cacheline (16 dword) 115 87 175
4 read 64dwords 344 234 336
5 read 1024dword 5034 3177 3575
6 read twice 1024 10036 6279 6972
7 write single dword 32 37 121
8 write double dword 32 40 119
9 write cacheline (16dword) 76 81 172
10 write 64 dwords 200 208 302
11 wrtite 1024 dwords 2636 2749 2955

!!!!!!!!!!!!!!!!!! USING OF UNSHARED MEMORY ALIAS IS FASTER THAN COHERENT.H UNLESS WE WANT TO ACCESS MORE THAN 64 DWORDS!!!!!!!!!!!!!!!!!!

"64 dwords" means - less than 64 accesses from acquire till release. Probably 99% (or even 100%) of all accesses to structures like comp_buffer are like this.
I don't mean access to the audio samples in buffers - only to control structures common usage of buffer is:

sink_c = buffer_acquire(sink);
min_free_frames = MIN(min_free_frames, audio_stream_get_free_frames(&sink_c->stream));
buffer_release(sink_c);

in fact the worse case - 1 dword access

CONCLUSIONS

My proposition is to stop using coherent.h for comp_buffer/audio_stream, instead create those structures in shared (non-cached) or not-shared(cached) memory

Good news is that we do know if the buffer will be shared or not at allocation time. In current code marking a buffer as shared is performed by calling coherent_shared_thread, in my solution it would be by choosing proper memory alias

  1. cross core safety - exactly same as in current solution. Currently if a buffer would be used on several cores without marking it as shared the system will crash. Same in my approach - cached must be used on single core exclusively
  2. performance - better (!!!!) than currently
  • in case of not shared buffers the gain of performance is obvoius
  • even in case of shared buffers - probably 99% (or even 100%) of accesses will be smaller than 64 dword operations.
  • shared buffers are rare in the system - currently 100% of buffers are not shared. Shared buffers will be used on cross core connected pipelines and in cross-core DP module ONLY.
  • so - we do gain performance in every case
  1. code is WAY simpler and WAY more easy to develop/debug. That means - less cache related defect in the future (and those defects are extremely hard to track)

Of course the above don't apply to audio samples in buffers - access to sample streams must be cached with proper inv/writebacks. BUT currently buffer_acquire and buffer_release DO NOT do any cache releated operations on audio buffers themselves. So change I'm proposing does not affect it at all.

Metadata

Metadata

Assignees

No one assigned

    Labels

    MTLApplies to Meteor Lake platformdp_schedulerenhancementNew feature or requestperformanceIssues related to code performance and speed optimizations.pipeline2.0

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions