Memory Tokens: Large Language Models Can Generate Reversible Sentence Embeddings

Official implementation for the paper Memory Tokens: Large Language Models Can Generate Reversible Sentence Embeddings

Abstract

In this work, we observe an interesting phenomenon: it is possible to generate reversible sentence embeddings that allow an LLM to reconstruct the original text exactly, without modifying the model's weights. This is achieved by introducing a special memory token, whose embedding is optimized through training on a fixed sequence. When prompted with this embedding, the model reconstructs the fixed sequence exactly. We evaluate this phenomenon across different datasets, sequence lengths, and model scales. Notably, Llama 3.1 8B successfully reconstructs all tested sequences. Our findings highlight an interesting capability of LLMs and suggest potential applications in memory-based retrieval, compression, and controlled text generation.

Installation

Clone the repository:

git clone https://github.com/nsuruguay05/memory_token.git
cd memory_token

Install dependencies:

pip install -r requirements.txt

Usage

Basic Usage

from llm_memory import LLMWithMemory

# Initialize the model
llm = LLMWithMemory(model_id="meta-llama/Llama-3.1-8B-Instruct")

# Add a memory
text = "The quick brown fox jumps over the lazy dog"
memory = llm.add_memory(
    text=text,
    template="{memory_token}{text}",
    epochs=3000,
    lr=5.0
)

# Generate text using the memory
generated = llm.generate("<MEMORY>")
print(generated)

# Evaluate memory accuracy
accuracy = llm.evaluate()
print(f"Memory accuracy: {accuracy:.4f}")

Evaluation (reproducing the paper's results)

Run the evaluation script to test text reconstruction on a text corpus:

python evaluate.py --corpus datasets/raschka_chunks_100.csv --num_samples 20

Command Line Arguments

Main Arguments

Argument	Default	Description
`--output_name`	`"memories"`	Name for the output file
`--model_id`	`"meta-llama/Llama-3.1-8B-Instruct"`	HuggingFace model identifier
`--corpus`	`"datasets/raschka_chunks_100.csv"`	Path to the corpus CSV file
`--num_samples`	`20`	Number of samples to evaluate (-1 for all)
`--output_dir`	`"."`	Directory to save output files

Training Parameters

Argument	Default	Description
`--epochs`	`3000`	Number of training epochs for memory
`--lr`	`5.0`	Learning rate for memory training
`--template`	`"{memory_token}{text}"`	Template for memory formatting
`--seed`	`1234`	Random seed for reproducibility

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Memory Tokens: Large Language Models Can Generate Reversible Sentence Embeddings

Abstract

Installation

Usage

Basic Usage

Evaluation (reproducing the paper's results)

Command Line Arguments

Main Arguments

Training Parameters

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
datasets		datasets
llm_memory		llm_memory
README.md		README.md
evaluate.py		evaluate.py
requirements.txt		requirements.txt

nsuruguay05/memory_token

Folders and files

Latest commit

History

Repository files navigation

Memory Tokens: Large Language Models Can Generate Reversible Sentence Embeddings

Abstract

Installation

Usage

Basic Usage

Evaluation (reproducing the paper's results)

Command Line Arguments

Main Arguments

Training Parameters

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages