Very slow inference speed 

### Description: 

1)I have been running inference on 10 rows (9 labeled examples and 1 target example) to test out the code and the process is taking approximately 8 minutes to infer a single entry (a column from the target example). I believe this is considered quite slow given a small number of examples.

2)I have also modified the code to make use of multiple GPUs. In this case, the utilization is inefficient a you may see in the picture below the codes.

### Reproduction steps:

1)When utilizing 1 GPU, I'm using the default inference notebook.
[inference.ipynb/](https://github.com/mlfoundations/rtfm/blob/db0a6dfe7a7f258f7e443077d46c7624cb671702/notebooks/inference.ipynb
)

2)And here is the multi GPUs modified code 
```python3
import os
os.environ['CUDA_VISIBLE_DEVICES']="0,1,2,3,4,5,6,7"



import pandas as pd
import torch
from transformers import AutoTokenizer, LlamaForCausalLM, AutoConfig

from rtfm.configs import TrainConfig, TokenizerConfig, SerializerConfig
from rtfm.inference_utils import InferenceModel
from rtfm.serialization.serializers import get_serializer
from rtfm.tokenization.text import prepare_tokenizer

train_config = TrainConfig(model_name="models/tabula-8b", context_length=8192)

# If using a base llama model (not fine-tuned TabuLa),
# make sure to set add_serializer_tokens=False
# (because we do not want to use special tokens for
# the base model which is not trained on them).
tokenizer_config = TokenizerConfig()

# Load the configuration
config = AutoConfig.from_pretrained(train_config.model_name, use_auth_token='')

# Set the torch_dtype to bfloat16 which matches TabuLa train/eval setup
config.torch_dtype = 'bfloat16'

device = "cuda" if torch.cuda.is_available() else "cpu"
torch.cuda.empty_cache()

model = LlamaForCausalLM.from_pretrained(
    train_config.model_name, device_map="auto", config=config, load_in_8bit=True)
torch.cuda.empty_cache()
tokenizer = AutoTokenizer.from_pretrained(train_config.model_name)
serializer = get_serializer(SerializerConfig())
print(model)
tokenizer, model = prepare_tokenizer(
    model,
    tokenizer=tokenizer,
    pretrained_model_name_or_path=train_config.model_name,
    model_max_length=train_config.context_length,
    use_fast_tokenizer=tokenizer_config.use_fast_tokenizer,
    serializer_tokens_embed_fn=tokenizer_config.serializer_tokens_embed_fn,
    serializer_tokens=serializer.special_tokens
    if tokenizer_config.add_serializer_tokens
    else None,
)

inference_model = InferenceModel(model=model, tokenizer=tokenizer, serializer=serializer)

import pandas as pd
from ucimlrepo import fetch_ucirepo

# Fetch Covertype dataset
covertype = fetch_ucirepo(id=31)

# Combine features and targets into a DataFrame
full_data = covertype.data.features.copy()
full_data['Cover_Type'] = covertype.data.targets

labeled_examples = full_data.sample(n=10).reset_index(drop=True)

# Display labeled examples
print("Labeled Examples:")
print(labeled_examples)



target_example = full_data.sample(n=1).reset_index(drop=True)
  

output = inference_model.predict(
    target_example=target_example,
    target_colname="Cover_Type",
    target_choices= full_data["Cover_Type"].unique().tolist(),
    labeled_examples=labeled_examples,
)
print(f"Prediction for sample \n {target_example} \n is: {output}")

```

![nvidia-smi](https://github.com/user-attachments/assets/01faf6ab-c487-40ad-90d2-88507483e0af)



### OS and hardware  
-server with 8 nvidia A40
-Ubuntu 22.04.5 LTS




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very slow inference speed #21

Description:

Reproduction steps:

OS and hardware

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Very slow inference speed #21

Description

Description:

Reproduction steps:

OS and hardware

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions