A scientific experiment to measure how meaning degrades ("drifts") when information is passed sequentially through a chain of different Large Language Models.
Just like the children's game of "Telephone," this project feeds the output of one AI model (e.g., GPT-4o) as the only input for the next model (e.g., Claude 3). We measure the degradation of the core concept across 6 steps, including local models via Ollama.
The Chain:
GPT-4o β Claude 3.5 Sonnet β Gemini 1.5 β DeepSeek β Mixtral β Llama 3 (Local)
- π Universal Wrapper: A single Python function to handle API calls for OpenAI, Anthropic, Google, and Ollama.
- π‘οΈ Strict System Prompts: Ensures models act as "repeaters" rather than conversational assistants.
- π Hybrid Evaluation:
- Quantitative: Cosine similarity scoring using
SentenceTransformersembeddings. - Qualitative: GPT-4o acts as a "Judge" to score Concept Mutation and Hallucination.
- Quantitative: Cosine similarity scoring using
- π Visualization: Matplotlib charts correlating embedding distance with idea survival.
- Clone the repo
git clone https://github.com/Sama-ndari/llm-semantic-drift-analysis.git- Install dependencies
pip install -r requirements.txt
- Setup Environment
Create a
.envfile with your keys:
OPENAI_API_KEY=...
ANTHROPIC_API_KEY=...
GOOGLE_API_KEY=...
DEEPSEEK_API_KEY=...
GROQ_API_KEY=...
- Run the Notebook
Launch
telephone_game.ipynband run all cells.
"Significant semantic drift was observed at Step 4 (DeepSeek), where the specific academic context was replaced by generalized advice."
- Languages: Python
- Models: GPT-4o, Claude 3.5, Gemini 1.5 Pro, Deepseek, Groq, Llama 3:8b
- Libraries:
openai,anthropic,sentence-transformers,scikit-learn,pandas,matplotlib
Created by Sama-ndari