In this project, I built an Agentic RAG app using Huggingface's smolagents library. The main advantage of this library is that its very easy to use, and it utilizes CodeAgents which are agents that use code to express and execute its actions instead of the usual dictionary-like outputs such as JSON. Advantages of using CodeAgents are summarized in this post by Huggingface.
There are two available RAG configurations: single-agent and multi-agent. The single-agent one utilizes a single agent that has access to a Retrieval tool. The multi-agent one utilizes 3 agents in total; one is a RAG agent, which has access to a Retrieval tool (this is pretty much the same with the single-agent one), another is a Web Search agent, which handles searching the web for information if needed, and lastly, a Manager agent, which is responsible for figuring out what to do for the given task by the user, and delegates subtasks to the other two agents.
The rule of thumb is: always go for simplicity! If a single agent one performs well enough, then it probably is the best to use! Multiagent is advantageous on more complex tasks, but for a simple RAG system, such as this one, multiagent is probably overkill.
The app has a simple UI: a sidebar that contains configuration options for the RAG system and documents processing, and a main section that contains the chat history with the agent. The app uses Huggingface's Inference API by default, and the LLM behind the endpoint is the Qwen2.5-32B model, which is free! You can also use OpenAI, just make sure to input your API key. You can also use Ollama but this feature is only for local deployment of the app, which you can know more about in this section.
To upload documents on the vector store, you can use a huggingface dataset, or PDFs. Just select in the Documents section in the sidebar the documents you wish to store.
A demo of the app is available here: https://agentic-rag-demo.streamlit.app/ but it is advisable to explore the app locally with a machine that has a GPU in it (NVIDIA with CUDA cores). This is because on local, inference and generating embeddings from the documents is a lot faster. Also, you can check out the model's cognitive function step by step on the console when you run locally!
This repository uses UV as its python dependency management tool. Install UV by:
curl -LsSf https://astral.sh/uv/install.sh | shInitialize virtual env and activate
uv venv
source .venv/bin/activateInstall dependencies with:
uv syncYou have the option to utilize models running on your own machine using Ollama.
To install ollama, run:
curl -fsSL https://ollama.com/install.sh | shYou can run models with ollama using the commands:
ollama run <model name>For example, if you'd like to use Meta's Llama3.2 model, run:
ollama run llama3.2This command will fetch the model (in the case above, it will fetch Llama3.2 3B model) from Ollama's model hub, then run and serve the model. The default endpoint is http://localhost:11434
Run streamlit run app.py to launch the app.
