Skip to content

[Celestica][Ladakh800bcls] Fix PAI concurrency crash by passing in sync callbacks#889

Open
gang-tao wants to merge 1 commit into
facebook:mainfrom
gang-tao:ladakh800bcls_retimer_dev2
Open

[Celestica][Ladakh800bcls] Fix PAI concurrency crash by passing in sync callbacks#889
gang-tao wants to merge 1 commit into
facebook:mainfrom
gang-tao:ladakh800bcls_retimer_dev2

Conversation

@gang-tao
Copy link
Copy Markdown
Contributor

@gang-tao gang-tao commented Jan 31, 2026

Pre-submission checklist

  • [✓ ] I've ran the linters locally and fixed lint errors related to the files I modified in this PR. You can install the linters by running pip install -r requirements-dev.txt && pre-commit install
  • [✓ ] pre-commit run
    clang-format.............................................................Passed
    black................................................(no files to check)Skipped
    shellcheck...........................................(no files to check)Skipped
    shfmt................................................(no files to check)Skipped
    trim trailing whitespace.................................................Passed
    fix end of files.........................................................Passed
    check yaml...........................................(no files to check)Skipped
    check json...........................................(no files to check)Skipped
    check for merge conflicts................................................Passed
    ruff check...........................................(no files to check)Skipped

Summary

The Broadcom PAI implementation is not thread-safe, which can lead to crashes when multiple threads attempt to initialize PHYs concurrently.

Motivation

This change fixes the race condition by providing the PAI layer with a thread-safe locking mechanism from the agent. Two new SAI extension attributes, SyncLock and SyncUnlock, are introduced to pass pointers to lock and unlock functions. These functions wrap a global mutex, allowing the PAI layer to serialize access to its critical sections and prevent crashes.

The change is enabled only when SAI_BRCM_PAI_IMPL is defined, ensuring no impact on other SAI implementations.

Currently, both a global lock and per-XPHY locks have been implemented. The per-XPHY lock is enabled by default to ensure high concurrency. The global lock can be enabled by configuring enable_xphy_global_lock.

Test Plan

In addition to the initial testing, we have run qsfp_hw_test 100 consecutive times with a 100% pass rate. It is currently being used across multiple project testbeds, and no issues have been found.

build

targets fboss_hw_agent-sai_impl, multi_switch_agent_hw_test, sai_agent_hw_test-sai_impl, qsfp_hw_test, qsfp_service build passed. This ensures the definition SAI_BRCM_PAI_IMPL isolates SAI and PAI, preventing them from affecting each other.
note: need brcm_pai_extensions.h in PAI, saved in /var/FBOSS/pai_impl/include/sai/experimental/brcm_pai_extensions.h

run scripts as below

ulimit -n 65535
for ((i=0; i<10; i++) do
for filter in EmptyHwTest.CheckInit; do
  TS=$(date +%Y%m%d_%H%M%S)
  rm /dev/shm/fboss/qsfp_service/can_warm_boot
  LD_LIBRARY_PATH=/opt/fboss/lib/ /root/gangtao/img/qsfp_hw_test --port-manager-mode --qsfp-config /root/gangtao/cfg/qsfpfull2.conf --gtest_filter=${filter} --enable_sai_log DEBUG --sai_log /root/gangtao/log/sai_replayer_${TS}.log --enable_replayer=true --phy_config_file=/root/gangtao/cfg/retimer.conf --logging=DBG5  2>&1 | tee /root/gangtao/log/qsfp_hwtest_${filter}_${TS}.log
  echo log is /root/gangtao/log/qsfp_hwtest_${filter}_${TS}.log
done
done

result

-rw-r--r-- 1 root root           0 Jan 31 16:24 '=========2roundtest============'
-rw-r--r-- 1 root root     1172137 Jan 31 16:45  sai_replayer_20260131_163715.log
-rw-r--r-- 1 root root  2431063669 Jan 31 16:45  qsfp_hwtest_EmptyHwTest.CheckInit_20260131_163715.log
-rw-r--r-- 1 root root     1172137 Jan 31 16:54  sai_replayer_20260131_164551.log
-rw-r--r-- 1 root root  2441761681 Jan 31 16:54  qsfp_hwtest_EmptyHwTest.CheckInit_20260131_164551.log
-rw-r--r-- 1 root root     1172136 Jan 31 17:03  sai_replayer_20260131_165429.log
-rw-r--r-- 1 root root  2449867093 Jan 31 17:03  qsfp_hwtest_EmptyHwTest.CheckInit_20260131_165429.log
-rw-r--r-- 1 root root     1172133 Jan 31 17:11  sai_replayer_20260131_170306.log
-rw-r--r-- 1 root root  2536055347 Jan 31 17:11  qsfp_hwtest_EmptyHwTest.CheckInit_20260131_170306.log
-rw-r--r-- 1 root root     1172137 Jan 31 17:20  sai_replayer_20260131_171144.log
-rw-r--r-- 1 root root  2469150715 Jan 31 17:20  qsfp_hwtest_EmptyHwTest.CheckInit_20260131_171144.log
-rw-r--r-- 1 root root     1172137 Jan 31 17:28  sai_replayer_20260131_172021.log
-rw-r--r-- 1 root root  2502823460 Jan 31 17:28  qsfp_hwtest_EmptyHwTest.CheckInit_20260131_172021.log
-rw-r--r-- 1 root root     1172137 Jan 31 17:37  sai_replayer_20260131_172859.log
-rw-r--r-- 1 root root  2462251124 Jan 31 17:37  qsfp_hwtest_EmptyHwTest.CheckInit_20260131_172859.log
-rw-r--r-- 1 root root     1172135 Jan 31 17:46  sai_replayer_20260131_173735.log
-rw-r--r-- 1 root root  2409597262 Jan 31 17:46  qsfp_hwtest_EmptyHwTest.CheckInit_20260131_173735.log
-rw-r--r-- 1 root root     1172137 Jan 31 17:54  sai_replayer_20260131_174610.log
-rw-r--r-- 1 root root  2484741728 Jan 31 17:54  qsfp_hwtest_EmptyHwTest.CheckInit_20260131_174610.log
-rw-r--r-- 1 root root     1172136 Jan 31 18:03  sai_replayer_20260131_175447.log
-rw-r--r-- 1 root root  2451277658 Jan 31 18:03  qsfp_hwtest_EmptyHwTest.CheckInit_20260131_175447.log
[root@localhost log]# ls -lrt ../log/ | sed -n '/2roundtest/,$p' | grep qsfp | awk '{print $9}' | xargs -n 1 grep '\[       OK \] EmptyHwTest.CheckInit'
[       OK ] EmptyHwTest.CheckInit (515956 ms)
[       OK ] EmptyHwTest.CheckInit (517313 ms)
[       OK ] EmptyHwTest.CheckInit (517547 ms)
[       OK ] EmptyHwTest.CheckInit (517416 ms)
[       OK ] EmptyHwTest.CheckInit (517482 ms)
[       OK ] EmptyHwTest.CheckInit (517300 ms)
[       OK ] EmptyHwTest.CheckInit (515789 ms)
[       OK ] EmptyHwTest.CheckInit (515057 ms)
[       OK ] EmptyHwTest.CheckInit (517277 ms)
[       OK ] EmptyHwTest.CheckInit (515753 ms)
[root@localhost log]# ls -lrt ../log/ | sed -n '/2roundtest/,$p' | grep qsfp | awk '{print $9}' | xargs -n 1 grep '\[       OK \] EmptyHwTest.CheckInit' | wc -l
10

@meta-cla meta-cla Bot added the CLA Signed label Jan 31, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented Feb 2, 2026

@Protick666 has imported this pull request. If you are a Meta employee, you can view this in D92066754.

@Protick666
Copy link
Copy Markdown

Hello @gang-tao

Thanks for this change!

There is one issue: introducing a global mutex and passing it down to SAI makes all operations sequential—including switch initialization, firmware downloads, and port programming.

Sequential FW downloads are quite slow, which will become even more problematic as we move toward a 75% retimer design. This approach likely won't scale.

Could you explore the possibility of using a separate mutex per xphy? From my review of the PAI source code, this might be feasible.
Alternatively, do you see any critical paths that would require a single global mutex across all phys?

Let me know your thoughts!

@gang-tao gang-tao changed the title Ladakh800bcls: Fix PAI concurrency crash by passing in sync callbacks [Celestica][Ladakh800bcls] Fix PAI concurrency crash by passing in sync callbacks Feb 27, 2026
@gang-tao gang-tao force-pushed the ladakh800bcls_retimer_dev2 branch from fa0aeed to 190e09e Compare May 22, 2026 05:06
@gang-tao gang-tao requested review from a team as code owners May 22, 2026 05:06
@facebook-github-tools
Copy link
Copy Markdown

@gang-tao has updated the pull request. You must reimport the pull request before landing.

@gang-tao
Copy link
Copy Markdown
Contributor Author

gang-tao commented May 22, 2026

Hello @gang-tao

Thanks for this change!

There is one issue: introducing a global mutex and passing it down to SAI makes all operations sequential—including switch initialization, firmware downloads, and port programming.

Sequential FW downloads are quite slow, which will become even more problematic as we move toward a 75% retimer design. This approach likely won't scale.

Could you explore the possibility of using a separate mutex per xphy? From my review of the PAI source code, this might be feasible. Alternatively, do you see any critical paths that would require a single global mutex across all phys?

Let me know your thoughts!

Hi @Protick666
I have updated the implementation to support both the per-xphy lock and the global lock. it has been tested on PAI14.1+patch. Please let me know if you have any further comments.
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants