Add dynamic batching, FP16, and /metrics to the mBERT API by ajamous · Pull Request #177 · TelecomsXChangeAPi/OpenTextShield

ajamous · 2026-04-15T16:35:51Z

Why

Recent GPU load tests on a g4dn.4xlarge (Tesla T4) showed the API serialises one HTTP request per GPU forward pass and pads every input to the full 512-token max length. The result: the GPU sits idle ~93% of the time, sustained throughput caps at ~5 MPS, and a 600 MPS burst takes ~80 minutes to drain — despite the hardware being capable of hundreds of MPS. See the test report in the accompanying issue/discussion for the full numbers.

This PR captures the highest-leverage optimisations identified in that analysis and makes them tunable via environment variables so the bridge team can dial them in per deployment.

What's in the PR

Dynamic batching

New DynamicBatcher in src/api_interface/services/batching_service.py
Coalesces concurrent single-message requests into padded batches (one tokenizer call, one forward pass, results split back to each caller's asyncio.Future)
Configurable collection window (batch_wait_ms) and max_batch_size
In-memory metrics: request/batch counters, queue depth, batch-size histogram, inference time

FP16 + sequence-length fixes

FP16 weights on CUDA (guarded; CPU/MPS keep FP32) for ~2× throughput on tensor-core GPUs
max_text_length lowered from 512 → 96 with padding='longest' so short SMS no longer waste ~10× the FLOPs
torch.inference_mode() in place of torch.no_grad()

Observability

GET /metrics Prometheus-compatible endpoint (no extra dep — hand-rolled exposition format)
Series: ots_requests_total, ots_batches_total, ots_inference_seconds_total, ots_queue_depth, ots_last_batch_size, ots_batch_size_bucket{le="N"}, ots_api_info{device,fp16,max_text_length,version}
Designed for the ots-bridge to scrape and drive adaptive concurrency

Config knobs (all `OTS_`-prefixed env vars)

Setting	Default	Description
`batching_enabled`	`true`	Master switch
`max_batch_size`	`32`	Max requests per forward pass
`batch_wait_ms`	`15`	Collection window
`max_text_length`	`96`	Token truncation
`use_fp16`	`true`	FP16 on CUDA only

Tests

10 new pytest tests (8 batcher + 2 metrics endpoint) — all passing
Exercise real async logic against a stub torch.nn.Module (no mBERT weights needed)
Run locally: pytest src/api_interface/tests/ --asyncio-mode=auto

Docs

CLAUDE.md — new "Dynamic Batching" and "Observability: /metrics" sections
README.md — updated Performance section + Health Checks section

Test plan

Unit tests: 10/10 passing on CPU
FastAPI app imports cleanly with /metrics registered in route table
Single-request HTTP contract unchanged (/predict/ response shape identical)
Backward-compatible: OTS_BATCHING_ENABLED=false falls back to original per-request path
GPU benchmark against v2.8-amd64 baseline on a T4 (needs environment with mBERT weights + CUDA — cannot be done in CI)
Verify ots-bridge can raise its concurrency throttle (25 → 200+) against the new image

Backward compatibility

/predict/ request/response shape is unchanged
Per-request thread-pool fallback preserved for batching_enabled=false
No new runtime dependencies
Existing /health, /audit, /feedback, TMForum endpoints untouched

Expected performance impact (from analysis)

Configuration	Est. per-batch GPU time (T4, FP16)	Effective MPS
Today (batch=1, 512 tokens)	~460 ms	~5 MPS
After this PR (batch=32, 96 tokens, FP16)	~20–40 ms	~800–1500 MPS

Numbers to be validated by the bridge team's benchmark run against the built image.

Files changed

New: src/api_interface/services/batching_service.py, src/api_interface/routers/metrics.py, src/api_interface/tests/{__init__,conftest,test_batching_service,test_metrics_endpoint}.py
Modified: src/api_interface/services/{model_loader,prediction_service}.py, src/api_interface/config/settings.py, src/api_interface/main.py, src/api_interface/routers/__init__.py, CLAUDE.md, README.md

…RT/training/mbert-mlx-apple-silicon/jinja2-3.1.4 Bump jinja2 from 3.1.3 to 3.1.4 in /src/mBERT/training/mbert-mlx-apple-silicon

Bumps [urllib3](https://github.com/urllib3/urllib3) from 1.26.18 to 1.26.19. - [Release notes](https://github.com/urllib3/urllib3/releases) - [Changelog](https://github.com/urllib3/urllib3/blob/main/CHANGES.rst) - [Commits](urllib3/urllib3@1.26.18...1.26.19) --- updated-dependencies: - dependency-name: urllib3 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>

…RT/training/mbert-mlx-apple-silicon/urllib3-1.26.19 Bump urllib3 from 1.26.18 to 1.26.19 in /src/mBERT/training/mbert-mlx-apple-silicon

…ja2-3.1.4 Bump jinja2 from 3.1.3 to 3.1.4 in /src

Bumps [scikit-learn](https://github.com/scikit-learn/scikit-learn) from 1.2.0 to 1.5.0. - [Release notes](https://github.com/scikit-learn/scikit-learn/releases) - [Commits](scikit-learn/scikit-learn@1.2.0...1.5.0) --- updated-dependencies: - dependency-name: scikit-learn dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>

…RT/training/mbert-mlx-apple-silicon/scikit-learn-1.5.0 Bump scikit-learn from 1.2.0 to 1.5.0 in /src/mBERT/training/mbert-mlx-apple-silicon

…T/training/bert-mlx-apple-silicon/jinja2-3.1.4 Bump jinja2 from 3.1.3 to 3.1.4 in /src/BERT/training/bert-mlx-apple-silicon

…RT/training/mbert-mlx-apple-silicon/werkzeug-3.0.3 Bump werkzeug from 2.3.8 to 3.0.3 in /src/mBERT/training/mbert-mlx-apple-silicon

…T/training/bert-mlx-apple-silicon/werkzeug-3.0.3 Bump werkzeug from 2.3.8 to 3.0.3 in /src/BERT/training/bert-mlx-apple-silicon

fix url

Correct some broken urls

NumPy <2.4 stored the full BSD license text in the package metadata License field, which starts with "All rights reserved." — causing SPDX scanners to misclassify it as a restrictive license. NumPy 2.4+ uses the PEP 639 License-Expression field with proper SPDX identifiers (BSD-3-Clause), eliminating the false positive. Updated across all 6 requirements files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Bumps [filelock](https://github.com/tox-dev/py-filelock) from 3.13.1 to 3.20.3. - [Release notes](https://github.com/tox-dev/py-filelock/releases) - [Changelog](https://github.com/tox-dev/filelock/blob/main/docs/changelog.rst) - [Commits](tox-dev/filelock@3.13.1...3.20.3) --- updated-dependencies: - dependency-name: filelock dependency-version: 3.20.3 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>

Bumps [werkzeug](https://github.com/pallets/werkzeug) from 3.0.6 to 3.1.5. - [Release notes](https://github.com/pallets/werkzeug/releases) - [Changelog](https://github.com/pallets/werkzeug/blob/main/CHANGES.rst) - [Commits](pallets/werkzeug@3.0.6...3.1.5) --- updated-dependencies: - dependency-name: werkzeug dependency-version: 3.1.5 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>

NumPy <2.4 stored the full BSD license text in the package metadata License field, which starts with "All rights reserved." — causing SPDX scanners to misclassify it as a restrictive license. NumPy 2.4+ uses the PEP 639 License-Expression field with proper SPDX identifiers (BSD-3-Clause), eliminating the false positive. Updated across all 6 requirements files.

Bumps [flask](https://github.com/pallets/flask) from 2.2.5 to 3.1.3. - [Release notes](https://github.com/pallets/flask/releases) - [Changelog](https://github.com/pallets/flask/blob/main/CHANGES.rst) - [Commits](pallets/flask@2.2.5...3.1.3) --- updated-dependencies: - dependency-name: flask dependency-version: 3.1.3 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>

- Offload synchronous mBERT inference to ThreadPoolExecutor(max_workers=4) via asyncio.run_in_executor, keeping FastAPI event loop responsive - Make audit logging fire-and-forget (non-blocking) in prediction router - On CUDA GPUs, GIL release during kernel execution enables concurrent inferences within a single worker (benchmarked 1.8x on Apple MPS) - Health checks remain responsive (~6ms) during concurrent inference load - Add hardware specification sheet (MD + DOCX) for VMware deployments targeting 25 TPS / <500ms latency with three configuration options

SMPP Interface (src/smpp_interface/): - SMPP-to-SMPP proxy that classifies SMS via OTS mBERT API before forwarding - Supports bind_transceiver auth, upstream connection pooling, DLR relay - UDH/multipart SMS passthrough, round-robin multi-API load balancing - Production hardened: exponential backoff reconnect, message_store size caps, client keepalive, graceful shutdown, structured logging - Full test suites: 47 basic + 32 advanced tests (UDH, emoji, async, benchmarks) - Documentation with architecture diagrams (README.md + Word export) Platform scaling (v2.7): - nginx.conf upgraded for 10x API instance load balancing with least_conn - docker-compose configs for 2x and 10x horizontal scaling - Load balancer test scripts and benchmark tooling - GPU verification, stress testing, and throughput benchmarking - Deployment guides, release notes, and performance analysis reports Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Rewrite openapi.yaml: 1 endpoint -> 16 endpoints covering all API families (core, TMF922, TMF688) with full schema definitions - Fix missing get_model_version() in tmforum_service.py that caused 500 on every TMF922 job creation - Fix double-timezone parsing bug in audit_service.py that silently broke all TMF688 event listing

…llib3-2.6.3 Dependency update: urllib3 2.6.0 -> 2.6.3 in src/

…yptography-46.0.5 Dependency update: cryptography 44.0.1 -> 46.0.5 in src/

…ck-3.20.3 Dependency update: filelock 3.13.1 -> 3.20.3 in root and src/

…lelock-3.20.3 Dependency update: numpy >=2.4.2, filelock 3.20.3 across all requirements files

…ERT/training/model-training/fonttools-4.61.0 Bump fonttools from 4.43.0 to 4.61.0 in /src/mBERT/training/model-training

…ERT/training/model-training/starlette-0.49.1 Bump starlette from 0.40.0 to 0.49.1 in /src/mBERT/training/model-training

…ERT/training/model-training/torch-2.8.0 Bump torch from 2.7.1 to 2.8.0 in /src/mBERT/training/model-training

…ERT/training/model-training/requests-2.32.4 Bump requests from 2.32.3 to 2.32.4 in /src/mBERT/training/model-training

…ERT/training/model-training/aiohttp-3.13.3 Bump aiohttp from 3.12.14 to 3.13.3 in /src/mBERT/training/model-training

…ERT/training/model-training/urllib3-2.6.3 Bump urllib3 from 2.2.3 to 2.6.3 in /src/mBERT/training/model-training

…ERT/training/model-training/filelock-3.20.3 Bump filelock from 3.13.1 to 3.20.3 in /src/mBERT/training/model-training

…ERT/training/model-training/sentencepiece-0.2.1 Bump sentencepiece from 0.1.99 to 0.2.1 in /src/mBERT/training/model-training

…ERT/training/model-training/protobuf-5.29.6 Bump protobuf from 5.29.5 to 5.29.6 in /src/mBERT/training/model-training

…ERT/training/model-training/cryptography-46.0.5 Bump cryptography from 43.0.1 to 46.0.5 in /src/mBERT/training/model-training

…ERT/training/model-training/pillow-12.1.1 Bump pillow from 10.3.0 to 12.1.1 in /src/mBERT/training/model-training

…ERT/training/model-training/flask-3.1.3 Bump flask from 2.2.5 to 3.1.3 in /src/mBERT/training/model-training

…ERT/training/model-training/werkzeug-3.1.5 Bump werkzeug from 3.0.6 to 3.1.5 in /src/mBERT/training/model-training

Pre-existing dependency conflict: fastapi 0.115.14 requires starlette <0.47.0 but requirements.txt pinned starlette 0.49.1.

…eprecated encode_plus The Docker builds using requirements-security.txt had transformers>=4.53.0 (unpinned upper bound), which pulled a newer incompatible version where BertTokenizer.encode_plus was removed. This caused all classification requests to fail with: "BertTokenizer has no attribute encode_plus" Changes: - Pin transformers==4.53.0 in requirements-security.txt (matches requirements.txt) - Add upper bounds to torch, huggingface-hub, safetensors, numpy, peft to prevent similar untested major version upgrades from breaking Docker builds - Replace all tokenizer.encode_plus() calls with tokenizer() across the codebase (the __call__ method is the modern, forward-compatible API that accepts identical parameters) Affected files: prediction_service.py, test_sms.py, stressTest_500.py, stressTest_1000_mlx.py, train_ots.py, train_ots_improved.py, compare_models.py, train_incremental.py, train_enhanced_multilingual.py https://claude.ai/code/session_018ERyTcyaXRpheHinYgiypJ

…endencies-VlFFc

Under load, the API currently serialises one HTTP request per GPU forward pass and pads every input to the full 512-token max length. Recent load tests on a g4dn.4xlarge (Tesla T4) confirmed this leaves the GPU idle ~93% of the time and caps sustained throughput at ~5 MPS — a 600 MPS burst takes ~80 minutes to drain even though the hardware can do hundreds of MPS. This change introduces: - DynamicBatcher service that coalesces concurrent single-message requests into padded batches (configurable max size / wait window). One tokenizer call, one forward pass, results split back to each caller's asyncio Future. - FP16 weights on CUDA for ~2x throughput on T4/A10/L4 tensor cores, guarded so CPU/MPS keep FP32. - max_text_length lowered from 512 -> 96 with dynamic padding ('longest') so short SMS no longer waste ~10x the FLOPs. - torch.inference_mode() in place of torch.no_grad() for a small but free speedup and cleaner semantics. - /metrics Prometheus-compatible endpoint (no extra dep) exposing request/batch counters, queue depth, batch-size histogram, and inference time, so ots-bridge can drive adaptive concurrency. All new knobs are env-var tunable: OTS_BATCHING_ENABLED, OTS_MAX_BATCH_SIZE, OTS_BATCH_WAIT_MS, OTS_MAX_TEXT_LENGTH, OTS_USE_FP16. Docs updated in CLAUDE.md and README.md.

Covers the async batching logic end-to-end against a stub model + tokenizer (no mBERT weights required): - single request returns the correct label - 8 concurrent submissions coalesce into one batch - max_batch_size is respected (10 requests => batches of <=4) - partial batches flush after batch_wait_ms, not later - model errors propagate to every future in the batch - metrics counters increment correctly - shutdown fails in-flight requests instead of hanging - init_batcher honours OTS_BATCHING_ENABLED=false The /metrics endpoint is exercised with FastAPI TestClient in both the disabled-batcher and active-batcher states, asserting the Prometheus exposition format (counters, gauges, histogram buckets). Run: pytest src/api_interface/tests/ --asyncio-mode=auto

ajamous and others added 30 commits September 9, 2024 15:10

Merge pull request #54 from TelecomsXChangeAPi/dependabot/pip/src/mBE…

cb1da4d

…RT/training/mbert-mlx-apple-silicon/jinja2-3.1.4 Bump jinja2 from 3.1.3 to 3.1.4 in /src/mBERT/training/mbert-mlx-apple-silicon

Merge pull request #72 from TelecomsXChangeAPi/dependabot/pip/src/mBE…

8c55b7d

…RT/training/mbert-mlx-apple-silicon/urllib3-1.26.19 Bump urllib3 from 1.26.18 to 1.26.19 in /src/mBERT/training/mbert-mlx-apple-silicon

Merge pull request #53 from TelecomsXChangeAPi/dependabot/pip/src/jin…

839dd32

…ja2-3.1.4 Bump jinja2 from 3.1.3 to 3.1.4 in /src

Merge pull request #53 from TelecomsXChangeAPi/dependabot/pip/src/jin…

c5b6ba8

…ja2-3.1.4 Bump jinja2 from 3.1.3 to 3.1.4 in /src

Merge pull request #73 from TelecomsXChangeAPi/dependabot/pip/src/mBE…

6f5cee9

…RT/training/mbert-mlx-apple-silicon/scikit-learn-1.5.0 Bump scikit-learn from 1.2.0 to 1.5.0 in /src/mBERT/training/mbert-mlx-apple-silicon

Merge pull request #52 from TelecomsXChangeAPi/dependabot/pip/src/BER…

831756b

…T/training/bert-mlx-apple-silicon/jinja2-3.1.4 Bump jinja2 from 3.1.3 to 3.1.4 in /src/BERT/training/bert-mlx-apple-silicon

Merge pull request #51 from TelecomsXChangeAPi/dependabot/pip/src/mBE…

589ac1f

…RT/training/mbert-mlx-apple-silicon/werkzeug-3.0.3 Bump werkzeug from 2.3.8 to 3.0.3 in /src/mBERT/training/mbert-mlx-apple-silicon

Merge pull request #50 from TelecomsXChangeAPi/dependabot/pip/src/BER…

f5a4548

…T/training/bert-mlx-apple-silicon/werkzeug-3.0.3 Bump werkzeug from 2.3.8 to 3.0.3 in /src/BERT/training/bert-mlx-apple-silicon

Create CONTRIBUTING.md

9ea2922

Create CONTRIBUTING.md

84ed93a

Update CONTRIBUTING.md

e44d3b1

fix url

Update CONTRIBUTING.md

f824ceb

fix url

Update CONTRIBUTING.md

6fb69af

Update CONTRIBUTING.md

55aaf8a

Update CONTRIBUTING.md

d4af1c9

Correct some broken urls

Update CONTRIBUTING.md

b0acef8

Correct some broken urls

Create CODE_OF_CONDUCT.md

3674ac6

Create CODE_OF_CONDUCT.md

787ecc0

updated code for docker env

3163806

updated code for docker env

b6ec983

initial Docketfile

9afcc03

initial Docketfile

53774e1

Initial Start app script

2231179

Initial Start app script

e18fd25

Use Ubuntu base image #1

30c8804

Use Ubuntu base image #1

2a1ac41

#1 OS packages

e270fae

#1 OS packages

c74f0b1

ajamous and others added 30 commits February 18, 2026 17:30

Merge pull request #148 from TelecomsXChangeAPi/dependabot/pip/src/ur…

4b90684

…llib3-2.6.3 Dependency update: urllib3 2.6.0 -> 2.6.3 in src/

Merge pull request #154 from TelecomsXChangeAPi/dependabot/pip/src/cr…

a788808

…yptography-46.0.5 Dependency update: cryptography 44.0.1 -> 46.0.5 in src/

Merge pull request #150 from TelecomsXChangeAPi/dependabot/pip/filelo…

3ef4b37

…ck-3.20.3 Dependency update: filelock 3.13.1 -> 3.20.3 in root and src/

Merge pull request #156 from TelecomsXChangeAPi/dependabot/pip/src/fi…

5573313

…lelock-3.20.3 Dependency update: numpy >=2.4.2, filelock 3.20.3 across all requirements files

Merge pull request #139 from TelecomsXChangeAPi/dependabot/pip/src/mB…

3705175

…ERT/training/model-training/fonttools-4.61.0 Bump fonttools from 4.43.0 to 4.61.0 in /src/mBERT/training/model-training

Merge pull request #140 from TelecomsXChangeAPi/dependabot/pip/src/mB…

10d81aa

…ERT/training/model-training/starlette-0.49.1 Bump starlette from 0.40.0 to 0.49.1 in /src/mBERT/training/model-training

Merge pull request #142 from TelecomsXChangeAPi/dependabot/pip/src/mB…

abb387a

…ERT/training/model-training/torch-2.8.0 Bump torch from 2.7.1 to 2.8.0 in /src/mBERT/training/model-training

Merge pull request #143 from TelecomsXChangeAPi/dependabot/pip/src/mB…

6153169

…ERT/training/model-training/requests-2.32.4 Bump requests from 2.32.3 to 2.32.4 in /src/mBERT/training/model-training

Merge pull request #146 from TelecomsXChangeAPi/dependabot/pip/src/mB…

1652fe3

…ERT/training/model-training/aiohttp-3.13.3 Bump aiohttp from 3.12.14 to 3.13.3 in /src/mBERT/training/model-training

Merge pull request #147 from TelecomsXChangeAPi/dependabot/pip/src/mB…

01dcfa5

…ERT/training/model-training/urllib3-2.6.3 Bump urllib3 from 2.2.3 to 2.6.3 in /src/mBERT/training/model-training

Merge pull request #149 from TelecomsXChangeAPi/dependabot/pip/src/mB…

a696320

…ERT/training/model-training/filelock-3.20.3 Bump filelock from 3.13.1 to 3.20.3 in /src/mBERT/training/model-training

Merge pull request #151 from TelecomsXChangeAPi/dependabot/pip/src/mB…

09068a7

…ERT/training/model-training/sentencepiece-0.2.1 Bump sentencepiece from 0.1.99 to 0.2.1 in /src/mBERT/training/model-training

Merge pull request #152 from TelecomsXChangeAPi/dependabot/pip/src/mB…

e051413

…ERT/training/model-training/protobuf-5.29.6 Bump protobuf from 5.29.5 to 5.29.6 in /src/mBERT/training/model-training

Merge pull request #153 from TelecomsXChangeAPi/dependabot/pip/src/mB…

2382a02

…ERT/training/model-training/cryptography-46.0.5 Bump cryptography from 43.0.1 to 46.0.5 in /src/mBERT/training/model-training

Merge pull request #155 from TelecomsXChangeAPi/dependabot/pip/src/mB…

637b6bc

…ERT/training/model-training/pillow-12.1.1 Bump pillow from 10.3.0 to 12.1.1 in /src/mBERT/training/model-training

Merge pull request #158 from TelecomsXChangeAPi/dependabot/pip/src/mB…

42eda02

…ERT/training/model-training/flask-3.1.3 Bump flask from 2.2.5 to 3.1.3 in /src/mBERT/training/model-training

Merge pull request #157 from TelecomsXChangeAPi/dependabot/pip/src/mB…

aa957ce

…ERT/training/model-training/werkzeug-3.1.5 Bump werkzeug from 3.0.6 to 3.1.5 in /src/mBERT/training/model-training

Bump fastapi 0.115.14 -> 0.135.1 to resolve starlette 0.49.1 conflict

bdd178c

Pre-existing dependency conflict: fastapi 0.115.14 requires starlette <0.47.0 but requirements.txt pinned starlette 0.49.1.

Merge pull request #174 from TelecomsXChangeAPi/claude/fix-docker-dep…

e8c246c

…endencies-VlFFc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dynamic batching, FP16, and /metrics to the mBERT API#177

Add dynamic batching, FP16, and /metrics to the mBERT API#177
ajamous wants to merge 625 commits intomainfrom
feat/mbert-dynamic-batching

ajamous commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ajamous commented Apr 15, 2026

Why

What's in the PR

Dynamic batching

FP16 + sequence-length fixes

Observability

Config knobs (all OTS_-prefixed env vars)

Tests

Docs

Test plan

Backward compatibility

Expected performance impact (from analysis)

Files changed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Config knobs (all `OTS_`-prefixed env vars)