Description:
1)I have been running inference on 10 rows (9 labeled examples and 1 target example) to test out the code and the process is taking approximately 8 minutes to infer a single entry (a column from the target example). I believe this is considered quite slow given a small number of examples.
2)I have also modified the code to make use of multiple GPUs. In this case, the utilization is inefficient a you may see in the picture below the codes.
Reproduction steps:
1)When utilizing 1 GPU, I'm using the default inference notebook.
inference.ipynb/
2)And here is the multi GPUs modified code
import os
os.environ['CUDA_VISIBLE_DEVICES']="0,1,2,3,4,5,6,7"
import pandas as pd
import torch
from transformers import AutoTokenizer, LlamaForCausalLM, AutoConfig
from rtfm.configs import TrainConfig, TokenizerConfig, SerializerConfig
from rtfm.inference_utils import InferenceModel
from rtfm.serialization.serializers import get_serializer
from rtfm.tokenization.text import prepare_tokenizer
train_config = TrainConfig(model_name="models/tabula-8b", context_length=8192)
# If using a base llama model (not fine-tuned TabuLa),
# make sure to set add_serializer_tokens=False
# (because we do not want to use special tokens for
# the base model which is not trained on them).
tokenizer_config = TokenizerConfig()
# Load the configuration
config = AutoConfig.from_pretrained(train_config.model_name, use_auth_token='')
# Set the torch_dtype to bfloat16 which matches TabuLa train/eval setup
config.torch_dtype = 'bfloat16'
device = "cuda" if torch.cuda.is_available() else "cpu"
torch.cuda.empty_cache()
model = LlamaForCausalLM.from_pretrained(
train_config.model_name, device_map="auto", config=config, load_in_8bit=True)
torch.cuda.empty_cache()
tokenizer = AutoTokenizer.from_pretrained(train_config.model_name)
serializer = get_serializer(SerializerConfig())
print(model)
tokenizer, model = prepare_tokenizer(
model,
tokenizer=tokenizer,
pretrained_model_name_or_path=train_config.model_name,
model_max_length=train_config.context_length,
use_fast_tokenizer=tokenizer_config.use_fast_tokenizer,
serializer_tokens_embed_fn=tokenizer_config.serializer_tokens_embed_fn,
serializer_tokens=serializer.special_tokens
if tokenizer_config.add_serializer_tokens
else None,
)
inference_model = InferenceModel(model=model, tokenizer=tokenizer, serializer=serializer)
import pandas as pd
from ucimlrepo import fetch_ucirepo
# Fetch Covertype dataset
covertype = fetch_ucirepo(id=31)
# Combine features and targets into a DataFrame
full_data = covertype.data.features.copy()
full_data['Cover_Type'] = covertype.data.targets
labeled_examples = full_data.sample(n=10).reset_index(drop=True)
# Display labeled examples
print("Labeled Examples:")
print(labeled_examples)
target_example = full_data.sample(n=1).reset_index(drop=True)
output = inference_model.predict(
target_example=target_example,
target_colname="Cover_Type",
target_choices= full_data["Cover_Type"].unique().tolist(),
labeled_examples=labeled_examples,
)
print(f"Prediction for sample \n {target_example} \n is: {output}")

OS and hardware
-server with 8 nvidia A40
-Ubuntu 22.04.5 LTS
Description:
1)I have been running inference on 10 rows (9 labeled examples and 1 target example) to test out the code and the process is taking approximately 8 minutes to infer a single entry (a column from the target example). I believe this is considered quite slow given a small number of examples.
2)I have also modified the code to make use of multiple GPUs. In this case, the utilization is inefficient a you may see in the picture below the codes.
Reproduction steps:
1)When utilizing 1 GPU, I'm using the default inference notebook.
inference.ipynb/
2)And here is the multi GPUs modified code
OS and hardware
-server with 8 nvidia A40
-Ubuntu 22.04.5 LTS