Backend Inference Refactor by ShafathZ · Pull Request #35 · ShafathZ/AniZenithProject

ShafathZ · 2026-04-12T10:31:06Z

Implements a new Object-Oriented and Scalable backend formation to allow for easier agentic framework implementation

Abstracts models into class-based system
Abstracts inference and model switching logic to InferenceManager
Implements downtime switching to prioritize models to use by FIRST ACTIVE
Added TODO for missing portions in pipeline

Suryanshg

General comments for now, will be doing another round of review soon

Suryanshg · 2026-04-13T21:25:53Z

+                # Use the last message as the user message (it should always be a user message)
+                user_query = messages[-1]['content']
+                retrieved_docs: List[AniZenithVectorSearchResult] = self.db_client.perform_vector_search(user_query, limit=VECTOR_SEARCH_LIMIT)
+                print(f"Retrieved Docs: ({len(retrieved_docs)})")


nit:
print(f"Retrieved Docs: ({len(retrieved_docs)})") --> print(f"Retrieved ({len(retrieved_docs)}) relevant docs")

Suryanshg · 2026-04-13T21:27:07Z

+                print(f"Retrieved Docs: ({len(retrieved_docs)})")
+
+            # 2) Rerank results using the reranker based on document info and user query
+            with CHATBOT_PIPELINE_LATENCY_SUMMARY.labels(model=current_model.get_name(), stage="reranker").time():


ultranit: use stage="reranking" as its a verb?

Suryanshg · 2026-04-13T21:27:50Z

+
+            # 2) Rerank results using the reranker based on document info and user query
+            with CHATBOT_PIPELINE_LATENCY_SUMMARY.labels(model=current_model.get_name(), stage="reranker").time():
+                recommended_docs: List[AniZenithVectorSearchResult] = self.reranker.rerank(user_query, retrieved_docs, limit=RERANK_LIMIT)


rename: recommended_docs --> reranked_docs

Suryanshg · 2026-04-13T21:30:11Z

+
+        # Add base system prompt
+        lines.append(SYSTEM_PROMPT)
+        lines.append("Here are the recommendation system's top shows:\n")


maybe you can just add this directly to the System Prompt, instead of manually doing it here?

I was thinking that, but we also might want to change how the system prompt is (for example adding more context strings), so I think it is good practice to keep it separated

But it should be added to some config system yes

Suryanshg · 2026-04-13T21:34:13Z

 from fastapi.middleware.cors import CORSMiddleware
 import logging
 from prometheus.prometheus_middleware import PrometheusMiddleware, prometheus_router
+from dotenv import load_dotenv


Can you revert this, as you might've added it for testing locally?
The docker compose yaml is already injecting env variables using a specific frontend.env file

Suryanshg · 2026-04-13T23:30:19Z

+
+    def stream(self, messages: List[Dict[str, str]]):
+        self._usage_data = None
+        print("Starting tokenize")


Either remove this print statement, or add this to the stream() definition of HFInferenceClientModel

Suryanshg · 2026-04-13T23:35:48Z

+        thread.start()
+
+        # Accumulate usage
+        input_token_count = inputs['input_ids'].shape[-1]


If its not too much trouble, can you also add a small inline comment to depict the shape of inputs['input_ids']? Something like:

# inputs['input_ids'] has shape (x, y, z) input_token_count = inputs['input_ids'].shape[-1]

This improves readability of ML based code a lot (atleast for me)

Suryanshg · 2026-04-13T23:38:27Z

+
+        def generate():
+            # Ensure no gradients
+            with torch.inference_mode():


Whats the difference between this and with torch.no_grad():?

I was looking at docs to see recommended ways to make inference not using pipeline faster. This was one recommendation, but I believe it is not actually doing anything differently yes.

Suryanshg · 2026-04-14T00:41:56Z

+
+        self._usage_data = None
+
+    def stream(self, messages: List[Dict[str, str]]):


I do not think this method is thread safe, as two concurrent requests can arrive anytime and the second request can overwrite self._usage_data (originally used for first request). You can have similar problems with self._thread_error variable.

I am thinking how to make it thread safe, but some basic ways are:

Coding "pure functions" (if you don't know, you can look it up or we can discuss)

Write non blocking code (which in some sense you already are doing it, but usage of threads in fast-api should be discouraged, as fast-api is async world and should use event loop based processing)

Maybe just use regular variables instead of class variables and return the usage data value within a Tuple of something

Yes it is not thread safe. There is a TODO in InferenceManager to add a blocking queue system. The idea is, these models only call one stream() job at once (or multiple if we have multiple models loaded in backend). I looked at some ways to do this, but it is not trivial, so we accept one request at a time for now

TODO: After discussion, this program requires significant work and needs an additional PR

Suryanshg

Added some more comments related to concurrency issues identified with the code

Suryanshg · 2026-04-14T14:00:10Z

+        # Accumulate usage
+        input_token_count = inputs['input_ids'].shape[-1]
+        output_token_count = 0
+        for text in streamer:


Is it guaranteed that the streamer always yields individual tokens, instead of decoded strings (can be multiple tokens at once). Or maybe it can yield empty strings or combine multiple tokens before yielding.

In all these cases, the token count won't exactly match the actual discrete outputs...

As discussed, TODO: Add test case to prove in model test cases

ShafathZ · 2026-04-14T19:07:43Z

    collected_result = ""
-    for result in chat_with_llm(messages=[{"role": "user", "content": TEST_USER_MESSAGE}],
-                                use_local_model=use_local_model):
-        collected_result = result


Currently this does not call the local model, so test is invalid. Needs monkeypatch fix

ShafathZ added 3 commits April 12, 2026 04:07

Constructed new easily scalable and modifiable backend structure

e6ab551

Fixed backend local model code issues, working but slow on CPU

0610e9e

Improved scalability defining usage class, improved comments

5e9217e

ShafathZ requested a review from Suryanshg April 12, 2026 10:31

Suryanshg reviewed Apr 13, 2026

View reviewed changes

Suryanshg requested changes Apr 14, 2026

View reviewed changes

Addressed some PR comments

ba723d2

ShafathZ commented Apr 14, 2026

View reviewed changes

Addressed more PR comments

015f071

ShafathZ marked this pull request as draft April 17, 2026 22:39

ShafathZ added the WIP PR is on hold for future Sprint label Apr 17, 2026

ShafathZ self-assigned this Apr 17, 2026


		self._usage_data = None

		def stream(self, messages: List[Dict[str, str]]):

Conversation

ShafathZ commented Apr 12, 2026

Uh oh!

Suryanshg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Suryanshg Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Suryanshg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Suryanshg Apr 14, 2026 •

edited

Loading