Upon debugging, vLLM 0.11’s Transformers-backend expects the HF processor to implement a method called _get_num_multimodal_tokens which is not implemented for mllama in transformers 4.57.1.
INFO 10-20 14:22:27 [__init__.py:216] Automatically detected platform cuda.
INFO 10-20 14:22:29 [api_server.py:1839] vLLM API server version 0.11.0
INFO 10-20 14:22:29 [utils.py:233] non-default args: {'model_tag': 'meta-llama/Llama-3.2-11B-Vision', 'model': 'meta-llama/Llama-3.2-11B-Vision'}
INFO 10-20 14:22:30 [model.py:547] Resolved architecture: TransformersForMultimodalLM
`torch_dtype` is deprecated! Use `dtype` instead!
INFO 10-20 14:22:30 [model.py:1510] Using max model len 131072
INFO 10-20 14:22:30 [scheduler.py:205] Chunked prefill is enabled with max_num_batched_tokens=2048.
WARNING 10-20 14:22:31 [utils.py:184] TransformersForMultimodalLM has no vLLM implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
INFO 10-20 14:22:34 [__init__.py:216] Automatically detected platform cuda.
INFO 10-20 14:22:36 [core.py:644] Waiting for init message from front-end.
INFO 10-20 14:22:36 [core.py:77] Initializing a V1 LLM engine (v0.11.0) with config: model='meta-llama/Llama-3.2-11B-Vision', speculative_config=None, tokenizer='meta-llama/Llama-3.2-11B-Vision', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=meta-llama/Llama-3.2-11B-Vision, enable_prefix_caching=True, chunked_prefill_enabled=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention","vllm.sparse_attn_indexer"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":[2,1],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"use_inductor_graph_partition":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null}
WARNING 10-20 14:22:36 [utils.py:184] TransformersForMultimodalLM has no vLLM implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
INFO 10-20 14:22:37 [parallel_state.py:1208] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
WARNING 10-20 14:22:37 [topk_topp_sampler.py:66] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: 08868a34-7fd8-4913-9d34-ff76e6d38ce8)')' thrown while requesting HEAD https://huggingface.co/meta-llama/Llama-3.2-11B-Vision/resolve/main/preprocessor_config.json
Retrying in 1s [Retry 1/5].
ERROR 10-20 14:22:55 [core.py:708] EngineCore failed to start.
ERROR 10-20 14:22:55 [core.py:708] Traceback (most recent call last):
ERROR 10-20 14:22:55 [core.py:708] File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 699, in run_engine_core
ERROR 10-20 14:22:55 [core.py:708] engine_core = EngineCoreProc(*args, **kwargs)
ERROR 10-20 14:22:55 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-20 14:22:55 [core.py:708] File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 498, in __init__
ERROR 10-20 14:22:55 [core.py:708] super().__init__(vllm_config, executor_class, log_stats,
ERROR 10-20 14:22:55 [core.py:708] File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 83, in __init__
ERROR 10-20 14:22:55 [core.py:708] self.model_executor = executor_class(vllm_config)
ERROR 10-20 14:22:55 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-20 14:22:55 [core.py:708] File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 54, in __init__
ERROR 10-20 14:22:55 [core.py:708] self._init_executor()
ERROR 10-20 14:22:55 [core.py:708] File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 54, in _init_executor
ERROR 10-20 14:22:55 [core.py:708] self.collective_rpc("init_device")
ERROR 10-20 14:22:55 [core.py:708] File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 83, in collective_rpc
ERROR 10-20 14:22:55 [core.py:708] return [run_method(self.driver_worker, method, args, kwargs)]
ERROR 10-20 14:22:55 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-20 14:22:55 [core.py:708] File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/utils/__init__.py", line 3122, in run_method
ERROR 10-20 14:22:55 [core.py:708] return func(*args, **kwargs)
ERROR 10-20 14:22:55 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^
ERROR 10-20 14:22:55 [core.py:708] File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 259, in init_device
ERROR 10-20 14:22:55 [core.py:708] self.worker.init_device() # type: ignore
ERROR 10-20 14:22:55 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-20 14:22:55 [core.py:708] File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 201, in init_device
ERROR 10-20 14:22:55 [core.py:708] self.model_runner: GPUModelRunner = GPUModelRunner(
ERROR 10-20 14:22:55 [core.py:708] ^^^^^^^^^^^^^^^
ERROR 10-20 14:22:55 [core.py:708] File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 421, in __init__
ERROR 10-20 14:22:55 [core.py:708] self.mm_budget = MultiModalBudget(
ERROR 10-20 14:22:55 [core.py:708] ^^^^^^^^^^^^^^^^^
ERROR 10-20 14:22:55 [core.py:708] File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/v1/worker/utils.py", line 48, in __init__
ERROR 10-20 14:22:55 [core.py:708] .get_max_tokens_per_item_by_nonzero_modality(model_config,
ERROR 10-20 14:22:55 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-20 14:22:55 [core.py:708] File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/multimodal/registry.py", line 167, in get_max_tokens_per_item_by_nonzero_modality
ERROR 10-20 14:22:55 [core.py:708] max_tokens_per_item = self.get_max_tokens_per_item_by_modality(
ERROR 10-20 14:22:55 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-20 14:22:55 [core.py:708] File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/multimodal/registry.py", line 143, in get_max_tokens_per_item_by_modality
ERROR 10-20 14:22:55 [core.py:708] return profiler.get_mm_max_contiguous_tokens(
ERROR 10-20 14:22:55 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-20 14:22:55 [core.py:708] File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/multimodal/profiling.py", line 282, in get_mm_max_contiguous_tokens
ERROR 10-20 14:22:55 [core.py:708] return self._get_mm_max_tokens(seq_len,
ERROR 10-20 14:22:55 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-20 14:22:55 [core.py:708] File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/multimodal/profiling.py", line 255, in _get_mm_max_tokens
ERROR 10-20 14:22:55 [core.py:708] max_tokens_per_item = self.processing_info.get_mm_max_tokens_per_item(
ERROR 10-20 14:22:55 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-20 14:22:55 [core.py:708] File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/model_executor/models/transformers.py", line 226, in get_mm_max_tokens_per_item
ERROR 10-20 14:22:55 [core.py:708] return {"image": self.get_max_image_tokens()}
ERROR 10-20 14:22:55 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-20 14:22:55 [core.py:708] File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/model_executor/models/transformers.py", line 233, in get_max_image_tokens
ERROR 10-20 14:22:55 [core.py:708] mm_tokens = processor._get_num_multimodal_tokens(
ERROR 10-20 14:22:55 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-20 14:22:55 [core.py:708] AttributeError: 'MllamaProcessor' object has no attribute '_get_num_multimodal_tokens'
Process EngineCore_DP0:
Traceback (most recent call last):
File "anaconda3/envs/vllm_env/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "anaconda3/envs/vllm_env/lib/python3.12/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 712, in run_engine_core
raise e
File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 699, in run_engine_core
engine_core = EngineCoreProc(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 498, in __init__
super().__init__(vllm_config, executor_class, log_stats,
File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 83, in __init__
self.model_executor = executor_class(vllm_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 54, in __init__
self._init_executor()
File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 54, in _init_executor
self.collective_rpc("init_device")
File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 83, in collective_rpc
return [run_method(self.driver_worker, method, args, kwargs)]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/utils/__init__.py", line 3122, in run_method
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 259, in init_device
self.worker.init_device() # type: ignore
^^^^^^^^^^^^^^^^^^^^^^^^^
File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 201, in init_device
self.model_runner: GPUModelRunner = GPUModelRunner(
^^^^^^^^^^^^^^^
File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 421, in __init__
self.mm_budget = MultiModalBudget(
^^^^^^^^^^^^^^^^^
File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/v1/worker/utils.py", line 48, in __init__
.get_max_tokens_per_item_by_nonzero_modality(model_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/multimodal/registry.py", line 167, in get_max_tokens_per_item_by_nonzero_modality
max_tokens_per_item = self.get_max_tokens_per_item_by_modality(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/multimodal/registry.py", line 143, in get_max_tokens_per_item_by_modality
return profiler.get_mm_max_contiguous_tokens(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/multimodal/profiling.py", line 282, in get_mm_max_contiguous_tokens
return self._get_mm_max_tokens(seq_len,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/multimodal/profiling.py", line 255, in _get_mm_max_tokens
max_tokens_per_item = self.processing_info.get_mm_max_tokens_per_item(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/model_executor/models/transformers.py", line 226, in get_mm_max_tokens_per_item
return {"image": self.get_max_image_tokens()}
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/model_executor/models/transformers.py", line 233, in get_max_image_tokens
mm_tokens = processor._get_num_multimodal_tokens(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'MllamaProcessor' object has no attribute '_get_num_multimodal_tokens'
[rank0]:[W1020 14:22:56.374535045 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Traceback (most recent call last):
File "anaconda3/envs/vllm_env/bin/vllm", line 7, in <module>
sys.exit(main())
^^^^^^
File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 54, in main
args.dispatch_function(args)
File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 57, in cmd
uvloop.run(run_server(args))
File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/uvloop/__init__.py", line 96, in run
return __asyncio.run(
^^^^^^^^^^^^^^
File "anaconda3/envs/vllm_env/lib/python3.12/asyncio/runners.py", line 195, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "anaconda3/envs/vllm_env/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/uvloop/__init__.py", line 48, in wrapper
return await main
^^^^^^^^^^
File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1884, in run_server
await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1902, in run_server_worker
async with build_async_engine_client(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "anaconda3/envs/vllm_env/lib/python3.12/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 180, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "anaconda3/envs/vllm_env/lib/python3.12/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 225, in build_async_engine_client_from_engine_args
async_llm = AsyncLLM.from_vllm_config(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/utils/__init__.py", line 1572, in inner
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 207, in from_vllm_config
return cls(
^^^^
File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 134, in __init__
self.engine_core = EngineCoreClient.make_async_mp_client(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 102, in make_async_mp_client
return AsyncMPClient(*client_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 769, in __init__
super().__init__(
File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 448, in __init__
with launch_core_engines(vllm_config, executor_class,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "anaconda3/envs/vllm_env/lib/python3.12/contextlib.py", line 144, in __exit__
next(self.gen)
File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 732, in launch_core_engines
wait_for_engine_startup(
File "anaconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 785, in wait_for_engine_startup
raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
Your current environment
🐛 Describe the bug
vllm serve meta-llama/Llama-3.2-11B-Visionfails onvllm 0.11.0. It works onvllm 0.10.2.Error:
MllamaProcessor' object has no attribute '_get_num_multimodal_tokens.Upon debugging, vLLM 0.11’s Transformers-backend expects the HF processor to implement a method called
_get_num_multimodal_tokenswhich is not implemented for mllama intransformers 4.57.1.Sanity Check
Useful links
Run Log
Before submitting a new issue...