Knowledge Based Case Retrieval

English Text Filtering Script

Overview

This script cleans a text dataset by removing all non-English lines from files in a given folder. It uses langdetect for language detection and saves only English content into a new output folder.

Usage

Install dependencies:
```
pip install langdetect tqdm
```

Set folder paths in the script:

input_folder = "data/task1_train_files_2025"
output_folder = "data/processed_train_langonly"

Then for test

input_folder = "data/task1_test_files_2025"
output_folder = "data/processed_test_langonly"

Run:

Run file preprocess.ipynb

Legal Case Embedding Pipeline

A modular pipeline to process, clean, and embed legal case documents for further query candidate pair feature making. This project automates the preparation of legal case text data and generates dense embeddings at multiple granularities using transformer models.

📁 Folder Structure

project_root/
│
├── data/
│   ├── processed_train_langonly/
│   ├── processed_test_langonly/
│   ├── task1_train_labels_2025.json
│   ├── task1_test_no_labels_2025.json
│
├── embeddings_output/
│   ├── embeddings_sentences_en.npy
│   ├── embeddings_paragraphs_formatted.npy
│   ├── embeddings_propositions_en.npy
│   └── embeddings_final.pkl
│
├── file_code.py
├── model_code.py
├── pairs_code.py
└── new_main.ipynb

⚙️ Requirements

Python 3.8+
PyTorch
Transformers
Pandas, NumPy, tqdm

Install dependencies:

pip install torch transformers pandas numpy tqdm

Embedding Generation Uses sentence-transformers/all-MiniLM-L6-v2 to compute embeddings for each text component.

📦 Outputs

File	Description
`embeddings_sentences_en.npy`	Sentence-level embeddings
`embeddings_paragraphs_formatted.npy`	Paragraph-level embeddings
`embeddings_propositions_en.npy`	Proposition-level embeddings
`embeddings_final.pkl`	Combined DataFrame with all embeddings

Perfect — ab tumhara BM25-based candidate generation pipeline complete ho gaya hai ✅ Let’s write a clean, professional README.md (without unnecessary code), focused on the BM25 retrieval and pair generation stage of your workflow.

BM25 Candidate Retrieval for Legal Case Matching

This module retrieves top candidate legal cases for each query paragraph using BM25, then builds structured query–candidate pairs for downstream model training. After preprocessing and embedding generation, this stage applies BM25 ranking to identify the most textually similar candidate cases for each paragraph (or proposition) of a query case. The output is a compact set of high-relevance candidates (e.g., top-50 per paragraph) — essential for efficient fine-tuning and evaluation of retrieval or re-ranking models. Performs round-robin sampling to retain up to 100 unique candidates per query

📁 Input Requirements

File	Description
`files.pkl` or `files` DataFrame	Processed and cleaned legal cases
`data/task1_train_labels_2025.json`	Query–target training label map
`data/task1_test_no_labels_2025.json`	Test queries (unlabeled)
`rank_bm25` library	For BM25-based text ranking

⚙️ Dependencies

Install required libraries:

pip install rank_bm25 nltk spacy pandas tqdm
python -m spacy download en_core_web_sm

🧮 Output Files

Output	Description
`bm25_results_para_train.csv`	Top-50 BM25 candidates per query paragraph (train set)
`bm25_results_para_test.csv`	Top-50 BM25 candidates per query paragraph (test set)
`pairs_para_100_unique.pkl`	Final dataset with up to 100 unique BM25 candidates per query and binary `match` labels

Feature Engineering for Legal Case Pairs

This module generates feature vectors for query–candidate case pairs using both semantic embeddings and textual similarity metrics. These features are later used to train supervised models (e.g., classifiers or rankers) to predict whether two cases are relevant (match=1) or irrelevant (match=0).

🧩 Inputs

File	Description
`pairs_para_100_unique.pkl`	Query–candidate pairs (BM25-generated)
`embeddings_new_2_without_prop_afterpropparasent.pkl`	Processed file-level embeddings and metadata

Each step in pairs_code.py adds or computes a specific set of features:

📦 Output

File	Description
`features_para_100_unique.pkl`	Final feature-enriched DataFrame (ready for ML model input)

Legal Case Retrieval — Model Training & Evaluation

This stage trains and evaluates the final retrieval model that predicts whether a query–candidate pair is relevant (match=1) or not.

📁 Inputs

pickle/features_para_100_unique.pkl — feature-enriched pairs
data/task1_train_labels_2025.json, data/task1_test_labels_2025.json — ground truth labels
model_code.py — model functions
output/results_2.txt — final predictions

📦 Outputs

experiments/train_prediction_bm25.txt
experiments/test_prediction_bm25.txt
output/results_2.txt (final model predictions)

Evaluation Summary

This section presents the retrieval performance of the trained model under different inference settings. Each configuration varies by inference type (1 or 2), case granularity (full vs. paragraph-wise), and ranking cutoff (Top-100 / Top-300).

📁 Dataset Statistics

Split	Queries	Candidates
Train	1,678	5,452
Test	400	1,759

🧠 Evaluation Settings

Setting	Description
Inference 1 / 2	Two variants of inference strategy for candidate ranking
Full Cases	Model uses embeddings of entire cases
Paragraph-wise	Model aggregates similarity scores across paragraphs
Top-K	Evaluation performed for top-100 or top-300 ranked candidates

📈 Results Overview

Inference	Granularity	Top-K	Precision	Recall	F1-score
1	Full case	300	0.305	0.336	0.320
2	Full case	300	0.324	0.317	0.320
1	Paragraph-wise	100	0.312	0.351	0.330
2	Paragraph-wise	100	0.325	0.345	0.335
1	Paragraph-wise (top-100 corrected)	100	0.324	0.366	0.344
2	Paragraph-wise (top-100 corrected)	100	0.347	0.349	0.348
1	Full case	100	0.303	0.313	0.308
2	Full case	100	0.327	0.295	0.310

🏁 Key Insights

Inference-2 consistently outperforms Inference-1 across settings.
Paragraph-wise evaluation yields better F1-scores than full-case inference.
Best overall performance: 🔹 Inference-2 (Paragraph-wise, Top-100) → F1 = 0.348

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.ipynb_checkpoints		.ipynb_checkpoints
__pycache__		__pycache__
data		data
.~lock.bm25_results_check.csv#		.~lock.bm25_results_check.csv#
README.md		README.md
bm25_results_check.csv		bm25_results_check.csv
bm25_results_check_part2.csv		bm25_results_check_part2.csv
embedding_pkl.pkl		embedding_pkl.pkl
embeddings_new_2.csv		embeddings_new_2.csv
file_code.py		file_code.py
filter.ipynb		filter.ipynb
main.ipynb		main.ipynb
matched_tuples.pkl		matched_tuples.pkl
model_code.py		model_code.py
new_main.ipynb		new_main.ipynb
pairs_after_filter.pkl		pairs_after_filter.pkl
pairs_code.py		pairs_code.py
top_50_union.pkl		top_50_union.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Knowledge Based Case Retrieval

English Text Filtering Script

Overview

Usage

Legal Case Embedding Pipeline

📁 Folder Structure

⚙️ Requirements

📦 Outputs

BM25 Candidate Retrieval for Legal Case Matching

📁 Input Requirements

⚙️ Dependencies

🧮 Output Files

Feature Engineering for Legal Case Pairs

🧩 Inputs

📦 Output

Legal Case Retrieval — Model Training & Evaluation

📁 Inputs

📦 Outputs

Evaluation Summary

📁 Dataset Statistics

🧠 Evaluation Settings

📈 Results Overview

🏁 Key Insights

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Knowledge Based Case Retrieval

English Text Filtering Script

Overview

Usage

Legal Case Embedding Pipeline

📁 Folder Structure

⚙️ Requirements

📦 Outputs

BM25 Candidate Retrieval for Legal Case Matching

📁 Input Requirements

⚙️ Dependencies

🧮 Output Files

Feature Engineering for Legal Case Pairs

🧩 Inputs

📦 Output

Legal Case Retrieval — Model Training & Evaluation

📁 Inputs

📦 Outputs

Evaluation Summary

📁 Dataset Statistics

🧠 Evaluation Settings

📈 Results Overview

🏁 Key Insights

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages