Skip to content

Very slow inference speed  #21

@najlasadek

Description

@najlasadek

Description:

1)I have been running inference on 10 rows (9 labeled examples and 1 target example) to test out the code and the process is taking approximately 8 minutes to infer a single entry (a column from the target example). I believe this is considered quite slow given a small number of examples.

2)I have also modified the code to make use of multiple GPUs. In this case, the utilization is inefficient a you may see in the picture below the codes.

Reproduction steps:

1)When utilizing 1 GPU, I'm using the default inference notebook.
inference.ipynb/

2)And here is the multi GPUs modified code

import os
os.environ['CUDA_VISIBLE_DEVICES']="0,1,2,3,4,5,6,7"



import pandas as pd
import torch
from transformers import AutoTokenizer, LlamaForCausalLM, AutoConfig

from rtfm.configs import TrainConfig, TokenizerConfig, SerializerConfig
from rtfm.inference_utils import InferenceModel
from rtfm.serialization.serializers import get_serializer
from rtfm.tokenization.text import prepare_tokenizer

train_config = TrainConfig(model_name="models/tabula-8b", context_length=8192)

# If using a base llama model (not fine-tuned TabuLa),
# make sure to set add_serializer_tokens=False
# (because we do not want to use special tokens for
# the base model which is not trained on them).
tokenizer_config = TokenizerConfig()

# Load the configuration
config = AutoConfig.from_pretrained(train_config.model_name, use_auth_token='')

# Set the torch_dtype to bfloat16 which matches TabuLa train/eval setup
config.torch_dtype = 'bfloat16'

device = "cuda" if torch.cuda.is_available() else "cpu"
torch.cuda.empty_cache()

model = LlamaForCausalLM.from_pretrained(
    train_config.model_name, device_map="auto", config=config, load_in_8bit=True)
torch.cuda.empty_cache()
tokenizer = AutoTokenizer.from_pretrained(train_config.model_name)
serializer = get_serializer(SerializerConfig())
print(model)
tokenizer, model = prepare_tokenizer(
    model,
    tokenizer=tokenizer,
    pretrained_model_name_or_path=train_config.model_name,
    model_max_length=train_config.context_length,
    use_fast_tokenizer=tokenizer_config.use_fast_tokenizer,
    serializer_tokens_embed_fn=tokenizer_config.serializer_tokens_embed_fn,
    serializer_tokens=serializer.special_tokens
    if tokenizer_config.add_serializer_tokens
    else None,
)

inference_model = InferenceModel(model=model, tokenizer=tokenizer, serializer=serializer)

import pandas as pd
from ucimlrepo import fetch_ucirepo

# Fetch Covertype dataset
covertype = fetch_ucirepo(id=31)

# Combine features and targets into a DataFrame
full_data = covertype.data.features.copy()
full_data['Cover_Type'] = covertype.data.targets

labeled_examples = full_data.sample(n=10).reset_index(drop=True)

# Display labeled examples
print("Labeled Examples:")
print(labeled_examples)



target_example = full_data.sample(n=1).reset_index(drop=True)
  

output = inference_model.predict(
    target_example=target_example,
    target_colname="Cover_Type",
    target_choices= full_data["Cover_Type"].unique().tolist(),
    labeled_examples=labeled_examples,
)
print(f"Prediction for sample \n {target_example} \n is: {output}")

nvidia-smi

OS and hardware

-server with 8 nvidia A40
-Ubuntu 22.04.5 LTS

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions