Skip to content

iccccccccccccc/skeval

 
 

Repository files navigation

skeval

Semantic Evaluation Layer for LLMs

skeval is a lightweight library designed to evaluate how well Large Language Models (LLMs) understand and generate different types of sentences—such as facts, emotions, opinions, and instructions.


🚀 Motivation

Most LLM evaluation focuses on:

  • Accuracy
  • BLEU / ROUGE scores
  • Reasoning benchmarks

But real-world language understanding also requires:

  • Distinguishing facts from opinions
  • Detecting emotions
  • Identifying intent and instruction

skeval fills this gap by providing a semantic classification and evaluation layer.


🧠 What It Does

  • Classifies sentences into categories:

    • Fact
    • Emotion
    • Opinion
    • Instruction
    • (extendable)
  • Evaluates LLM outputs based on:

    • Classification accuracy
    • Confusion between categories
    • Per-class metrics
  • Works with:

    • LLM outputs
    • Custom datasets
    • Benchmark pipelines

📦 Features

  • Modular architecture (classifier, evaluator, metrics)
  • Custom evaluation metrics for semantic types
  • Compatible with LLM pipelines
  • Extensible label taxonomy
  • Clean CLI support (planned)

🏗️ Project Structure

skeval/
│
├── src/skeval/
│   ├── classifier/
│   ├── evaluator/
│   ├── metrics/
│   └── dataset/
│
├── data/
│   ├── raw/
│   └── processed/
│
├── tests/
├── scripts/
├── docs/
└── notebooks/

⚙️ Installation

git clone https://github.com/skeval-ai/skeval.git
cd skeval
pip install -e .

🧪 Example Usage

from skeval.classifier import SentenceClassifier
from skeval.evaluator import Evaluator

sentences = [
    "Water boils at 100 degrees Celsius",
    "I feel sad today",
    "I think this movie is amazing",
    "Please close the door",
]
labels = ["fact", "emotion", "opinion", "instruction"]

classifier = SentenceClassifier(embed_dim=64)
classifier.train(sentences, labels, epochs=20)

predictions = classifier.predict([
    "The sky is blue",
    "I am so happy",
    "I believe dogs are better than cats",
    "Turn off the lights",
])

evaluator = Evaluator()
results = evaluator.evaluate(predictions, ["fact", "emotion", "opinion", "instruction"])
print(results)

📊 Example Output

{
  "accuracy": 0.75,
  "per_class": {"fact": {"precision": 1.0, "recall": 1.0, "f1-score": 1.0, ...}, ...},
  "macro_avg": {"precision": ..., "recall": ..., "f1-score": ...},
  "weighted_avg": {"precision": ..., "recall": ..., "f1-score": ...},
  "confusion_matrix": [[...], ...],
  "labels": ["emotion", "fact", "instruction", "opinion"]
}

📚 Documentation

Full documentation (Sphinx-based) is available in the docs/ directory.

To build locally:

cd docs
make html

🧠 Future Roadmap

  • Multi-label classification (mixed sentences)
  • Sarcasm detection
  • Benchmark dataset release
  • Integration with LLM evaluation tools
  • CLI interface

🤝 Contributing

Contributions are welcome!

Please read CONTRIBUTING.md before submitting a PR.


📄 License

This project is licensed under the MIT License.


⚠️ Disclaimer

This project is for research and educational purposes. It does not guarantee perfect semantic understanding and should not be used for critical decision-making systems without validation.


⭐ Acknowledgments

Inspired by the need for better semantic evaluation in modern LLM systems.


🔥 Tagline

“Not just what the model says—but what it means.”

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

  • Python 97.7%
  • Makefile 2.2%
  • Jupyter Notebook 0.1%