Skip to content

Latest commit

 

History

History
110 lines (74 loc) · 2.38 KB

File metadata and controls

110 lines (74 loc) · 2.38 KB

Here’s an example walk-through of SyntaxLab’s mutation scoring workflow with realistic outputs and logs, modeled after MuTAP-style evaluation.

📦 Task

Generate a test suite for a simple Stack class implementation in Python.

✏️ LLM-Generated Code

class Stack: def init(self): self._items = []

def push(self, item):
    self._items.append(item)

def pop(self):
    return self._items.pop()

def peek(self):
    return self._items[-1]

def is_empty(self):
    return len(self._items) == 0

🧪 LLM-Generated Test Suite

import unittest

class TestStack(unittest.TestCase): def test_push(self): s = Stack() s.push(10) self.assertEqual(s.peek(), 10)

def test_pop(self):
    s = Stack()
    s.push(5)
    s.push(6)
    self.assertEqual(s.pop(), 6)

def test_is_empty(self):
    s = Stack()
    self.assertTrue(s.is_empty())

🔁 Mutation Injection Log

Mutation ID Operator Mutation Description M001 Invert Logic len(self._items) == 0 → != 0 M002 Remove Element .pop() → no operation M003 Return Constant return self._items[-1] → return 0 M004 Swap Append Order append(item) → insert(0, item) M005 Delete Push Functionality Remove push body

✅ Test Results (Mutation Execution)

Mutant ID Tests Run Mutant Killed Reason M001 ✅ All ✅ Yes is_empty() failed M002 ✅ All ✅ Yes pop() failed (empty list) M003 ✅ All ✅ Yes peek() returned wrong value M004 ✅ All ❌ No Still returned same top element M005 ✅ All ✅ Yes push() didn’t modify state

📊 Summary • Total Mutants: 5 • Killed: 4 • Survived: 1 • Mutation Score: \frac{4}{5} = 0.80 \quad (80%)

💡 Feedback Loop Triggered

Survivor: M004 (order mutation) → New prompt injected by SyntaxLab:

“Also add a test to ensure that the most recently pushed element is on top.”

🧪 Generated New Test (Refinement)

def test_order(self): s = Stack() s.push(1) s.push(2) self.assertEqual(s.pop(), 2)

→ Mutation M004 is now killed. ✅ Updated Mutation Score: 5/5 → 100%

🧠 Key Takeaways • Mutations that survive indicate missing edge case tests • Test-first prompt generation + mutation scoring = measurable quality • Refinement loop injects precise feedback-driven improvement • This workflow scales across models, languages, and regulatory environments