Skip to content

Backend Inference Refactor#35

Draft
ShafathZ wants to merge 5 commits into
mainfrom
backend-inference-refactor
Draft

Backend Inference Refactor#35
ShafathZ wants to merge 5 commits into
mainfrom
backend-inference-refactor

Conversation

@ShafathZ
Copy link
Copy Markdown
Owner

Implements a new Object-Oriented and Scalable backend formation to allow for easier agentic framework implementation

  • Abstracts models into class-based system
  • Abstracts inference and model switching logic to InferenceManager
  • Implements downtime switching to prioritize models to use by FIRST ACTIVE
  • Added TODO for missing portions in pipeline

@ShafathZ ShafathZ requested a review from Suryanshg April 12, 2026 10:31
Copy link
Copy Markdown
Collaborator

@Suryanshg Suryanshg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comments for now, will be doing another round of review soon

Comment thread backend/app.py
Comment thread backend/inference_manager.py
Comment thread backend/inference_manager.py
Comment thread backend/inference_manager.py Outdated
# Use the last message as the user message (it should always be a user message)
user_query = messages[-1]['content']
retrieved_docs: List[AniZenithVectorSearchResult] = self.db_client.perform_vector_search(user_query, limit=VECTOR_SEARCH_LIMIT)
print(f"Retrieved Docs: ({len(retrieved_docs)})")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:
print(f"Retrieved Docs: ({len(retrieved_docs)})") --> print(f"Retrieved ({len(retrieved_docs)}) relevant docs")

Comment thread backend/inference_manager.py Outdated
print(f"Retrieved Docs: ({len(retrieved_docs)})")

# 2) Rerank results using the reranker based on document info and user query
with CHATBOT_PIPELINE_LATENCY_SUMMARY.labels(model=current_model.get_name(), stage="reranker").time():
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ultranit: use stage="reranking" as its a verb?

Comment thread backend/inference_manager.py Outdated

# 2) Rerank results using the reranker based on document info and user query
with CHATBOT_PIPELINE_LATENCY_SUMMARY.labels(model=current_model.get_name(), stage="reranker").time():
recommended_docs: List[AniZenithVectorSearchResult] = self.reranker.rerank(user_query, retrieved_docs, limit=RERANK_LIMIT)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename: recommended_docs --> reranked_docs

Comment thread backend/inference_manager.py Outdated

# Add base system prompt
lines.append(SYSTEM_PROMPT)
lines.append("Here are the recommendation system's top shows:\n")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you can just add this directly to the System Prompt, instead of manually doing it here?

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that, but we also might want to change how the system prompt is (for example adding more context strings), so I think it is good practice to keep it separated

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it should be added to some config system yes

Comment thread backend/inference_manager.py
Comment thread frontend/app.py Outdated
from fastapi.middleware.cors import CORSMiddleware
import logging
from prometheus.prometheus_middleware import PrometheusMiddleware, prometheus_router
from dotenv import load_dotenv
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you revert this, as you might've added it for testing locally?
The docker compose yaml is already injecting env variables using a specific frontend.env file

Comment thread backend/models.py
Comment thread backend/models.py Outdated

def stream(self, messages: List[Dict[str, str]]):
self._usage_data = None
print("Starting tokenize")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either remove this print statement, or add this to the stream() definition of HFInferenceClientModel

Comment thread backend/models.py Outdated
thread.start()

# Accumulate usage
input_token_count = inputs['input_ids'].shape[-1]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If its not too much trouble, can you also add a small inline comment to depict the shape of inputs['input_ids']? Something like:

# inputs['input_ids'] has shape (x, y, z)
input_token_count = inputs['input_ids'].shape[-1]

This improves readability of ML based code a lot (atleast for me)

Comment thread backend/models.py Outdated

def generate():
# Ensure no gradients
with torch.inference_mode():
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whats the difference between this and with torch.no_grad():?

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking at docs to see recommended ways to make inference not using pipeline faster. This was one recommendation, but I believe it is not actually doing anything differently yes.

Comment thread backend/models.py

self._usage_data = None

def stream(self, messages: List[Dict[str, str]]):
Copy link
Copy Markdown
Collaborator

@Suryanshg Suryanshg Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think this method is thread safe, as two concurrent requests can arrive anytime and the second request can overwrite self._usage_data (originally used for first request). You can have similar problems with self._thread_error variable.

I am thinking how to make it thread safe, but some basic ways are:

  1. Coding "pure functions" (if you don't know, you can look it up or we can discuss)
  2. Write non blocking code (which in some sense you already are doing it, but usage of threads in fast-api should be discouraged, as fast-api is async world and should use event loop based processing)
  3. Maybe just use regular variables instead of class variables and return the usage data value within a Tuple of something

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is not thread safe. There is a TODO in InferenceManager to add a blocking queue system. The idea is, these models only call one stream() job at once (or multiple if we have multiple models loaded in backend). I looked at some ways to do this, but it is not trivial, so we accept one request at a time for now

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: After discussion, this program requires significant work and needs an additional PR

Copy link
Copy Markdown
Collaborator

@Suryanshg Suryanshg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some more comments related to concurrency issues identified with the code

Comment thread backend/models.py
Comment thread backend/models.py
Comment thread backend/models.py
# Accumulate usage
input_token_count = inputs['input_ids'].shape[-1]
output_token_count = 0
for text in streamer:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it guaranteed that the streamer always yields individual tokens, instead of decoded strings (can be multiple tokens at once). Or maybe it can yield empty strings or combine multiple tokens before yielding.

In all these cases, the token count won't exactly match the actual discrete outputs...

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, TODO: Add test case to prove in model test cases

Comment thread tests/test_chat_models.py
collected_result = ""
for result in chat_with_llm(messages=[{"role": "user", "content": TEST_USER_MESSAGE}],
use_local_model=use_local_model):
collected_result = result
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently this does not call the local model, so test is invalid. Needs monkeypatch fix

@ShafathZ ShafathZ marked this pull request as draft April 17, 2026 22:39
@ShafathZ ShafathZ added the WIP PR is on hold for future Sprint label Apr 17, 2026
@ShafathZ ShafathZ self-assigned this Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

WIP PR is on hold for future Sprint

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants