Conversation
| commands: | ||
| - uv pip install vllm==0.8.5.post1 | ||
| - uv pip install setuptools | ||
| - uv pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl |
There was a problem hiding this comment.
(Minor) I guess a simply pip install will try to build the wheel? Isn't there a way to prevent that without hardcoding the wheel url?
There was a problem hiding this comment.
@peterschmidt85
The flash_attn url has specific ABI flag (cxx11abiFALSE). ABI flag can be TRUE or FALSE and the torch package installed by vllm==0.8.5.post1 needs FALSE.
To check whether we need TRUE or FALSE we need to do
python -c "import torch; print(torch._C._GLIBCXX_USE_CXX11_ABI)" which will return either TRUE or FASE.
I far as I remember, when I did uv pip install flash_attn==2.7.4 I got undefined symbol error:
ImportError: /root/.venv/lib/python3.12/site-packages/flash_attn_2_cuda.cpython-312-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
| trl vllm-serve --model $MODEL --tensor_parallel_size $TP --data_parallel_size $DP --host 0.0.0.0 | ||
| else | ||
| # Training node - adjust world size and nodes count for training | ||
| GPUS_PER_NODE=$(($DSTACK_GPUS_NUM / $DSTACK_NODES_NUM)) |
There was a problem hiding this comment.
We already have the built-in DSTACK_GPUS_PER_NODE - which is calculated the same way.
| - uv pip install . | ||
| - | | ||
| # Get the last IP from DSTACK_NODES_IPS for vLLM node | ||
| VLLM_HOST=$(echo $DSTACK_NODES_IPS | tr ' ' '\n' | tail -n 1) |
There was a problem hiding this comment.
Shouldn't we move this under if [ "$USE_VLLM" = "true" ]; then?
| ADJUSTED_NODES_NUM=$(($DSTACK_NODES_NUM - 1)) | ||
| ADJUSTED_GPUS_TOTAL=$(($GPUS_PER_NODE * $ADJUSTED_NODES_NUM)) | ||
| # Other nodes run training | ||
| echo "Starting training with VLLM on $VLLM_HOST" |
There was a problem hiding this comment.
Do we need this echo? Just thinking of simplifying the configuration. Same for the echo above...
There was a problem hiding this comment.
Echo is not necessary. We can remove it to simplify configuration
| shm_size: 128GB | ||
|
|
||
| volumes: | ||
| - /checkpoints:/checkpoints |
There was a problem hiding this comment.
(Minor) Just curious, given the VLLM may run on a random node, would checkpoints recover just work?
There was a problem hiding this comment.
@peterschmidt85 This is a very interesting question. I think theoretically it should not work if nodes are shuffled. This is because the node in which vLLM is running is not recognized by accelerate . The vLLM node is not within accelerate's world size.
|
This PR is stale because it has been open for 14 days with no activity. |
No description provided.