GraphMind uses pytest with pytest-asyncio for its test suite. The project has 85 unit tests across 10 test files plus 2 integration test files, covering all major subsystems.
# Unit tests only (fast, no external services needed)
make test
# or: pytest tests/unit -v --tb=short
# Integration tests (requires Docker services running)
make test-integration
# or: pytest tests/integration -v --tb=short -m integration
# All tests with coverage report
make test-all
# or: pytest -v --tb=short --cov=src/graphmind --cov-report=term-missing
# Run a specific test file
pytest tests/unit/test_config.py -v
# Run a specific test class
pytest tests/unit/test_agents.py::TestPlannerNode -v
# Run a specific test
pytest tests/unit/test_chunker.py::TestSemanticChunker::test_single_short_text -vtests/
├── conftest.py # Shared fixtures (7 fixtures)
├── unit/ # 85 tests across 10 files
│ ├── test_config.py # Settings defaults and caching
│ ├── test_schemas.py # All 13 Pydantic models
│ ├── test_chunker.py # SemanticChunker edge cases
│ ├── test_loaders.py # DocumentLoader all 7 formats
│ ├── test_cost_tracker.py # Cost recording and aggregation
│ ├── test_metrics.py # MetricsCollector latency/p95/retry
│ ├── test_deepeval_suite.py # LLM-as-judge evaluation
│ ├── test_hybrid_retriever.py # RRF fusion formula
│ ├── test_agents.py # Planner, synthesizer, evaluator, retry
│ └── test_crew.py # CrewAI tools, agents, tasks
└── integration/
├── test_ingestion_pipeline.py # Load + chunk end-to-end
└── test_eval_suite.py # Benchmark from JSONL files
| File | Tests | What It Validates |
|---|---|---|
test_config.py |
9 | Settings defaults, lru_cache caching, nested config sections (LLM, retrieval, agents, ingestion) |
test_schemas.py |
11 | All 13 Pydantic models: creation with defaults, auto-generated UUIDs, EntityType enum, QueryRequest/QueryResponse, IngestRequest/IngestResponse, GraphStats, HealthResponse |
test_chunker.py |
7 | Empty text handling, single short text, long text splitting, metadata generation, sequential indices, unique chunk IDs, overlap correctness |
test_loaders.py |
9 | MD/TXT/HTML/PY/TS/JS loading, file vs. content mode, code block wrapping, error handling for unsupported formats |
test_cost_tracker.py |
6 | Recording cost entries, aggregation by provider, summary structure, provider grouping, empty tracker behavior |
test_metrics.py |
7 | Average latency calculation, p95 percentile, retry rate tracking, history size limits, recent queries list |
test_deepeval_suite.py |
7 | JSON parsing of LLM evaluator output, markdown fence stripping, fallback scoring on parse failure, threshold comparison, report generation |
test_hybrid_retriever.py |
5 | RRF fusion formula correctness, overlap deduplication across vector+graph lists, empty input lists, score ordering, k parameter effect |
test_agents.py |
14 | Planner decomposition, synthesizer generation, evaluator JSON parsing, evaluator fallback, retry logic, rewrite node, orchestrator graph building |
test_crew.py |
10 | HybridSearchTool/GraphExpansionTool/EvaluateAnswerTool creation and execution, agent factory functions, task creation, context chain validation, error handling |
| File | Tests | What It Validates |
|---|---|---|
test_ingestion_pipeline.py |
3 | End-to-end load + chunk for markdown, long documents, and code files |
test_eval_suite.py |
2 | Benchmark evaluation from JSONL dataset files, missing file error handling |
The shared conftest.py provides 7 fixtures available to all tests:
| Fixture | Scope | Description |
|---|---|---|
_clear_settings_cache |
autouse | Clears the get_settings() lru_cache before and after each test |
settings |
function | Fresh Settings instance with cache cleared |
mock_llm_response |
function | MagicMock with .content = "Test LLM response" |
mock_router |
function | Mocked LLMRouter with async (ainvoke) and sync (invoke) methods |
sample_chunks |
function | 2 DocumentChunk instances (LangGraph and Neo4j content) |
sample_entities |
function | 3 Entity instances (LangGraph/framework, Neo4j/technology, LangChain/framework) |
sample_relations |
function | 1 Relation (LangGraph extends LangChain) |
sample_retrieval_results |
function | 3 RetrievalResult instances with vector and graph sources |
The conftest also sets dummy API keys (test-key) via os.environ.setdefault so that tests can instantiate Settings without real credentials.
Run the full evaluation benchmark against the 10-question dataset:
# Via Make
make eval
# Via CLI
graphmind-eval
# With custom dataset and threshold
graphmind-eval --dataset path/to/custom.jsonl --threshold 0.8The benchmark evaluates each question/answer pair using LLM-as-judge scoring on three dimensions:
- Relevancy (40% weight): Does the answer address the question?
- Groundedness (40% weight): Is every claim supported by source documents?
- Completeness (20% weight): Does it cover all aspects?
The evaluation threshold is 0.7 (combined score). Results are saved to eval/reports/latest_benchmark.json.
The evaluation system supports two LLM judge implementations:
- GroqEvalModel: Uses Groq for fast evaluation
- GeminiEvalModel: Uses Gemini as fallback
# Check for issues (ruff check + format check)
make lint
# or: ruff check src/ tests/ && ruff format --check src/ tests/
# Auto-fix formatting and lint issues
make format
# or: ruff format src/ tests/ && ruff check --fix src/ tests/Ruff is configured in pyproject.toml with:
- Target: Python 3.11
- Line length: 100
- Selected rules: E, F, I, N, UP, B, SIM
mypy src/graphmindConfigured with disallow_untyped_defs = true and warn_return_any = true in pyproject.toml.
- Tests marked
@pytest.mark.integrationrequire Docker services running (Qdrant, Neo4j, Ollama) - Tests marked
@pytest.mark.evalrun the evaluation benchmark (slow, requires LLM API keys) - Unit tests run without any external dependencies -- all LLM calls and database access are mocked
- The
conftest.pysets dummy API keys soSettingscan be instantiated safely asyncio_mode = "auto"is set inpyproject.toml, so async tests do not need explicit markers
- Getting Started -- Installation including dev dependencies
- Architecture -- Understanding the components being tested
- Querying -- The pipeline that test_agents and test_crew validate
- Ingestion -- The pipeline that test_loaders and test_chunker validate