In GraphStorm, there are two main methods to use language models (LMs) during graph construction: 1/ tokenizing text attributes into integers of tokens by using with HuggingFace tokenizer;2/ embedding text attributes into numerical features by using HuggingFace Bert models.
After graph construction, GraphStorm can directly use embeddings as one of node features without retraining the Bert model used in construction. And this method supports real-time inference.
Converting tokens into embeddings for the downstream training w/wt LM retraining, however, do not support real-time inference. In the transform_fn GraphStorm will raise KeyError: the key 'token_ids' is not found in node feature.
In GraphStorm, there are two main methods to use language models (LMs) during graph construction: 1/ tokenizing text attributes into integers of tokens by using with HuggingFace tokenizer;2/ embedding text attributes into numerical features by using HuggingFace Bert models.
After graph construction, GraphStorm can directly use embeddings as one of node features without retraining the Bert model used in construction. And this method supports real-time inference.
Converting tokens into embeddings for the downstream training w/wt LM retraining, however, do not support real-time inference. In the
transform_fnGraphStorm will raiseKeyError: the key 'token_ids' is not found in node feature.