[RFC] Direct use of cached / not-cached shared memory alias instead of coherent.h in case of buffers

As everybody knows, not coherent cache is a serious complication we must deal with. 

There are 2 ways of share data between cores:
1) Using non-cached alias - read/write directly from L2 RAM 
2) Using a cached alias and a mechanism of writeback/invalidation to ensure coherent data between cores

Currently Zephyr uses solution 1, SOF uses solution 2

Shared memory in general has a lot of advantages - it is easy to use, the code is "obvious correct" instead of "correct". The structures may be used by several cores at the same time (standard multithread rules for sharing data apply of course) 

The question is - **how much it is slower?** 

I made some measurements on MTL RVP - access to the data using several methods:
1) not shared data. 2 ways were measured:
 - cached memory ensuring cache hit 
 - using coherent.h in not-shared mode

2) shared data - 3 ways were measured:
- not cached memory alias
- using coherent.h in shared mode
- manual invalidation before operations / writeback after operations

all results in CPU cycles

NOT SHARED MEMORY

|ID|desc|Cached (hit)|coherent.h (not shared)|
|--|--|--|--|
|1|read single dword|30|45|
|2|read double dword|31|44|
|3|read cacheline (16 dword)|80|98|
|4|read 64dwords|226|257|
|5|read 1024dword|3132|3449|
|6|read twice 1024|6234|6852|
|7|write single dword|30|44|
|8|write double dword|31|43|
|9|write cacheline (16dword)|84|43|
|10|write 64 dwords|195|211|
|11|wrtite 1024 dwords|2632|2748|
|12|twice write 1024 dwords|5535|5749|


no surprise here - using of direct access to memory is faster


SHARED MEMORY

|ID|desc|uncahced|cached (with inv/wb)|coherent.h (shared)|
|--|--|--|--|--|
|1|read single dword|35|38|119|
|2|read double dword|35|39|119|
|3|read cacheline (16 dword)|115|87|175|
|4|read 64dwords|344|234|336|
|5|read 1024dword|5034|3177|3575|
|6|read twice 1024|10036|6279|6972|
|7|write single dword|32|37|121|
|8|write double dword|32|40|119|
|9|write cacheline (16dword)|76|81|172|
|10|write 64 dwords|200|208|302|
|11|wrtite 1024 dwords|2636|2749|2955|


!!!!!!!!!!!!!!!!!! USING OF UNSHARED MEMORY ALIAS IS FASTER THAN COHERENT.H UNLESS WE WANT TO ACCESS MORE THAN 64 DWORDS!!!!!!!!!!!!!!!!!! 

"64 dwords" means - less than 64 accesses from acquire till release. Probably 99% (or even 100%) of all accesses to structures like comp_buffer are like this. 
**I don't mean access to the audio samples in buffers - only to control structures** common usage of buffer is:
```
sink_c = buffer_acquire(sink);
min_free_frames = MIN(min_free_frames, audio_stream_get_free_frames(&sink_c->stream));
buffer_release(sink_c);
```
in fact the worse case - 1 dword access

**CONCLUSIONS**

My proposition is to stop using coherent.h for comp_buffer/audio_stream, instead create those structures in shared (non-cached) or not-shared(cached) memory

Good news is that we do know if the buffer will be shared or not at allocation time. In current code marking a buffer as shared is performed by calling `coherent_shared_thread`, in my solution it would be by choosing proper memory alias

1) cross core safety - exactly same as in current solution. Currently if a buffer would be used on several cores without marking it as shared the system will crash. Same in my approach - cached must be used on single core exclusively
2) performance - better (!!!!) than currently 
 - in case of not shared buffers the gain of performance is obvoius
 - even in case of shared buffers - probably 99% (or even 100%) of accesses will be smaller than 64 dword operations. 
- shared buffers are rare in the system - currently 100% of buffers are not shared. Shared buffers will be used on cross core connected pipelines and in cross-core DP module ONLY. 
 - so - we do gain performance in every case
3) code is WAY simpler and WAY more easy to develop/debug. That means - less cache related defect in the future (and those defects are extremely hard to track)


Of course the above don't apply to audio samples in buffers - access to sample streams must be cached with proper inv/writebacks. BUT currently buffer_acquire and buffer_release DO NOT do any cache releated operations on audio buffers themselves. So change I'm proposing does not affect it at all. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Direct use of cached / not-cached shared memory alias instead of coherent.h in case of buffers #8006

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ID	desc	Cached (hit)	coherent.h (not shared)
1	read single dword	30	45
2	read double dword	31	44
3	read cacheline (16 dword)	80	98
4	read 64dwords	226	257
5	read 1024dword	3132	3449
6	read twice 1024	6234	6852
7	write single dword	30	44
8	write double dword	31	43
9	write cacheline (16dword)	84	43
10	write 64 dwords	195	211
11	wrtite 1024 dwords	2632	2748
12	twice write 1024 dwords	5535	5749

[RFC] Direct use of cached / not-cached shared memory alias instead of coherent.h in case of buffers #8006

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions