ClassifAI is a Python package that simplifies semantic search and Retrieval Augmented Generation (RAG) pipelines for classification tasks in the production of official statistics. It is designed to help data professionals build applications and pipelines to label new text samples to official statistical classifications, by leveraging (augmented) semantic search over a knowledgebase of previously coded examples.
The Office for National Statistics often needs to classify free-text survey responses or other data to standard statistical classifications. The most well-known examples include the Standard Industrial Classification (SIC), the Standard Occupational Classification (SOC), and the Classification Of Individual COnsumption according to Purpose (COICOP), as well as international equivalents such as ISCO and ISIC. The ClassifAI package has been designed specifically to help us build applications, pipelines and analyses for this kind of task.
Use cases:
- Web apps: ClassifAI can provide a production-ready, batteries-included REST interface to serve as a backend for web applications that need to code free text
- Pipelines: using either the Python SDK or the REST mode, ClassifAI can be a key ingredient in scheduled statistical processing pipelines
- Analysis: ClassifAI is simple enough to be used for one-off classification tasks. It requires no training or fine-tuning, and can give accurate results out-of-the-box, dependent only on the quality of the knowledgebase.
Key Features of the package include:
- Use included vectorisers (including Google Cloud, Huggingface and Ollama embedders) or implement your own
- Built in support for custom hook logic - choose from a library of pre-processing and post-processing functions that control the flow of data (spell checking, results deduplication, etc) or write your own hooks
- Deploy Easily with FastAPI - Deploy your semantic search classifier with FastAPI capabilities built into the package for easy REST API deployment
ClassifAI facilitates text classification using semantic search over a knowledgebase (possibly augmented by RAG). The knowledgebase is a collection of pieces of text (called "documents" even if they're very short) with a known classification label that serve as a reference for the classification task. They might be canonical definitions of each classification label, or examples of correct labelling, or a combination of those.
Semantic search uses a vectoriser (usually a language model) to convert each document into a vector (called an embedding). When a query (another piece of text) arrives from the user, the query is first embedded in the same way as a vector of numbers, then the query embedding is compared to each of the embeddings in the knowledgebase by calculating similarities between vectors. The top N closest knowledgebase entries are returned along with the matching labels, documents and distance scores for each one.
Retrieval-augmented generation (RAG) entails feeding search results into the prompt of a generative language model, with instructions for the model to further process the result. In the context of the package, RAG is constrained to operate on structured result sets and return result sets with the same structure.
The knowledgebase search approach to classification has advantages and disadvantages compared to machine learning (ML) approaches that train a model for the task. On balance, this approach is well-suited to statistical production use cases. To make a fair comparison between ML and knowledgebase search, it helps to think of the knowledgebase as training data, and compare the performance when the size of the training dataset is similar to the size of a knowledgebase (tens to low hundreds of thousands)
Advantages
- Explainability: it's relatively easy to understand why ClassifAI produced a result, because the matching knowledgebase entries are provided
- Agility: the quality of results can be tweaked by adding or removing a handful of knowledgebase entries, without expensive retraining
- Accountability: the knowledgebase can be maintained and developed by professional methodologists, separate from the developer team
- Time-to-prototype: Because no training or feature engineering is needed, it's possible to have a working prototype producing reasonable results in minutes
- Customisation: the method is easily customisable with the option to easily swap vectorisers, knowledgebases and pre- and post-processing hooks
- Accuracy: the knowledgebase approach can produce results with accuracy comparable to machine learning approaches in like-for-like comparisons
- Large, unbalanced label sets: ML approaches tend not to work well on label sets like SIC, SOC or COICOP where the set of classifications is large and unbalanced, and the training data may be sparse for some classifications. Knowledgebase search works well in these cases
Disadvantages
- Uncertainty quantification: the knowledgebase approach doesn't produce calibrated uncertainty measures out-of-the-box, although there are methods (such as conformal prediction) that show promise
- Scaling: the knowledgebase approach has been tested and works well on mid-scale training data and may not scale as well as ML approaches on big data
- Cost: Cloud-provided vectorisers have a per-request cost, although this is usually small and outweighed by the cost of developer FTEs
The comparison on other aspects, such as per-request speed or hardware requirements is less clear because no formal comparison has been undertaken. ClassifAI is in production in the ONS and appears to provide results quickly enough for users.
- Installation
- Example: Indexing and Searching a Document Collection
- Dev set-up and contributing
- Contact and Support
Install the package directly from GitHub in your Python environment
pip install "git+https://github.com/datasciencecampus/classifAI[huggingface]"ClassifAI supports statistical classification through searching a knowledgebase. The knowledgebase is a set of labelled examples that show how different texts (e.g. survey responses) map to statistical classifications. The size and quality of the knowledgebase dictates the quality of results you can expect from using the tool.
A vectoriser transforms a query text string into an embedding vector. You can choose from embedding models accessed via HuggingFace, Google or Ollama, or build your own vectoriser.
from classifai.vectorisers import HuggingFaceVectoriser
# Create a vectorizer model
vectoriser = HuggingFaceVectoriser(model_name="sentence-transformers/all-MiniLM-L6-v2")
# You can use the 'transform' method to call the vectoriser
vector = vectoriser.transform("Example text to vectorize")
print(vector.shape)You provide a knowledgebase of labelled examples (currently only allows data to be provided as a csv) to build a vector store
from classifai.indexers import VectorStore
vector_store = VectorStore(
file_name="occupations_knowledgebase.csv",
vectoriser=vectoriser,
batch_size=8,
output_dir="vector_store"
)The query is provided as a dataframe-like object to the search method, and the results are returned as a dataframe-like object.
from classifai.indexers.dataclasses import VectorStoreSearchInput
input_data = VectorStoreSearchInput({'id': [1], 'query':["construction worker scaffolder"]})
results = vector_store.search(input_data, n_results=5)
print(results)You can use ClassifAI as a local package, or deploy it as an API server, using FastAPI.
from classifai.servers import start_api
start_api(vector_stores=[vector_store], endpoint_names=["Occupations"], port=8000)Further guides and tutorials can be found in the DEMO folder of this repo. It currently includes the following notebooks:
- General workflow
- Custom vectorisers
- make your own custom vectoriser model that will interact with the core features of the package,
- Custom pre- and post-processing hooks
- Add your own custom 'hook' logic to the VectorStore search processes, allowing you to inject custom behaviour to your VectorStores.
- (more demos to come soon)
If you are a developer working on ClassifAI, this section describes how to set up the repo correctly on your local machine to start working on the codebase. 3rd party developers please also read CONTRIBUTING.md
NOTE: This section is for developers who are making alterations and changes to the codebase of this repo itself, not for developers who are using the features of the package.
To begin making changes to the codebase follow the following instructions on your development machine:
- Clone the repo:
git clone git@github.com:datasciencecampus/classifAI.git
cd classifAI- Set up pre-commit hooks:
make setup-git-hooks(or, if you don't have Docker available)
make setup-git-hooks-no-docker- Create / activate the virtual environment:
uv lock
uv syncAnd that's you good to go!
During development, you might want to run linters / code vulnerability scans; you can do so at any point via
make check-pythonIn the event that you come across any issues such as potential bugs in the code or something that may be unclear or not explained in the various DEMO content, please read CONTRIBUTING.md file for instructions. It contains sections on Getting Help, Asking Questions, Reporting Bugs, and other guides on working with the package.
