Skip to content

feat(ai_provider): add retry handling, exponential backoff, and timeout resilience#245

Open
Harshit-Maurya838 wants to merge 3 commits into
imDarshanGK:mainfrom
Harshit-Maurya838:feature/improve-llm-timeout-retry-handling-203
Open

feat(ai_provider): add retry handling, exponential backoff, and timeout resilience#245
Harshit-Maurya838 wants to merge 3 commits into
imDarshanGK:mainfrom
Harshit-Maurya838:feature/improve-llm-timeout-retry-handling-203

Conversation

@Harshit-Maurya838
Copy link
Copy Markdown

Which issue does this PR close?

Closes #203


Description

This PR improves the reliability and observability of LLM integrations in ai_provider.py by introducing configurable retry handling, exponential backoff, structured logging, and comprehensive timeout/error handling while maintaining backward compatibility.

Key Improvements

Retry & Backoff Support

  • Added configurable retry handling for:

    • request timeouts
    • connection failures
    • HTTP 429 rate limits
    • HTTP 5xx provider errors
  • Implemented exponential backoff using:

    LLM_RETRY_BACKOFF * (2 ** attempt)

Improved Logging & Observability

  • Replaced print-based logging with structured logging

  • Added provider-level observability including:

    • provider name
    • retry attempt count
    • latency metrics
    • failure type
    • final request status

Better Error Handling

  • Added granular handling for:

    • httpx.HTTPStatusError
    • httpx.RequestError
    • timeout and connection failures
  • Prevented retries for invalid client-side requests (400/401)

  • Preserved graceful fallback behavior and offline compatibility

Configuration Updates

Added new environment variables:

  • LLM_MAX_RETRIES
  • LLM_RETRY_BACKOFF

Updated:

  • backend/app/config.py
  • .env.example

Documentation

  • Added an "LLM Provider Reliability" section in README
  • Documented retry behavior, fallback handling, and observability improvements

Unit Tests

Added comprehensive test coverage in:

  • backend/tests/test_ai_provider.py

Test scenarios include:

  • successful responses
  • retry exhaustion on timeout
  • retries after HTTP 5xx failures
  • no retry for HTTP 400 errors
  • retry handling for HTTP 429
  • provider-disabled behavior

Type of Change

  • Bug fix
  • Enhancement
  • Tests added/updated
  • Documentation updated

Verification

pytest backend/tests/test_ai_provider.py -v

Test Results

7 passed in 0.58s

Notes

  • The existing call_llm() return signature (str | None) was intentionally preserved to avoid breaking changes.
  • Provider names are inferred internally from LLM_BASE_URL to keep configuration minimal and backward-compatible.

@Harshit-Maurya838
Copy link
Copy Markdown
Author

@imDarshanGK Conflicts has been resolved and all points in issue has been implemented. Please review it.

@Harshit-Maurya838
Copy link
Copy Markdown
Author

@imDarshanGK Please review this PR i resolved the issue and resolve the merge conflicts also

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add detailed error messages and retry guidance for LLM provider timeouts

1 participant