Add dynamic batching, FP16, and /metrics to the mBERT API#177
Open
Add dynamic batching, FP16, and /metrics to the mBERT API#177
Conversation
…RT/training/mbert-mlx-apple-silicon/jinja2-3.1.4 Bump jinja2 from 3.1.3 to 3.1.4 in /src/mBERT/training/mbert-mlx-apple-silicon
Bumps [urllib3](https://github.com/urllib3/urllib3) from 1.26.18 to 1.26.19. - [Release notes](https://github.com/urllib3/urllib3/releases) - [Changelog](https://github.com/urllib3/urllib3/blob/main/CHANGES.rst) - [Commits](urllib3/urllib3@1.26.18...1.26.19) --- updated-dependencies: - dependency-name: urllib3 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>
…RT/training/mbert-mlx-apple-silicon/urllib3-1.26.19 Bump urllib3 from 1.26.18 to 1.26.19 in /src/mBERT/training/mbert-mlx-apple-silicon
…ja2-3.1.4 Bump jinja2 from 3.1.3 to 3.1.4 in /src
…ja2-3.1.4 Bump jinja2 from 3.1.3 to 3.1.4 in /src
Bumps [scikit-learn](https://github.com/scikit-learn/scikit-learn) from 1.2.0 to 1.5.0. - [Release notes](https://github.com/scikit-learn/scikit-learn/releases) - [Commits](scikit-learn/scikit-learn@1.2.0...1.5.0) --- updated-dependencies: - dependency-name: scikit-learn dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>
…RT/training/mbert-mlx-apple-silicon/scikit-learn-1.5.0 Bump scikit-learn from 1.2.0 to 1.5.0 in /src/mBERT/training/mbert-mlx-apple-silicon
…T/training/bert-mlx-apple-silicon/jinja2-3.1.4 Bump jinja2 from 3.1.3 to 3.1.4 in /src/BERT/training/bert-mlx-apple-silicon
…RT/training/mbert-mlx-apple-silicon/werkzeug-3.0.3 Bump werkzeug from 2.3.8 to 3.0.3 in /src/mBERT/training/mbert-mlx-apple-silicon
…T/training/bert-mlx-apple-silicon/werkzeug-3.0.3 Bump werkzeug from 2.3.8 to 3.0.3 in /src/BERT/training/bert-mlx-apple-silicon
fix url
fix url
Correct some broken urls
Correct some broken urls
NumPy <2.4 stored the full BSD license text in the package metadata License field, which starts with "All rights reserved." — causing SPDX scanners to misclassify it as a restrictive license. NumPy 2.4+ uses the PEP 639 License-Expression field with proper SPDX identifiers (BSD-3-Clause), eliminating the false positive. Updated across all 6 requirements files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Bumps [filelock](https://github.com/tox-dev/py-filelock) from 3.13.1 to 3.20.3. - [Release notes](https://github.com/tox-dev/py-filelock/releases) - [Changelog](https://github.com/tox-dev/filelock/blob/main/docs/changelog.rst) - [Commits](tox-dev/filelock@3.13.1...3.20.3) --- updated-dependencies: - dependency-name: filelock dependency-version: 3.20.3 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>
Bumps [werkzeug](https://github.com/pallets/werkzeug) from 3.0.6 to 3.1.5. - [Release notes](https://github.com/pallets/werkzeug/releases) - [Changelog](https://github.com/pallets/werkzeug/blob/main/CHANGES.rst) - [Commits](pallets/werkzeug@3.0.6...3.1.5) --- updated-dependencies: - dependency-name: werkzeug dependency-version: 3.1.5 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>
NumPy <2.4 stored the full BSD license text in the package metadata License field, which starts with "All rights reserved." — causing SPDX scanners to misclassify it as a restrictive license. NumPy 2.4+ uses the PEP 639 License-Expression field with proper SPDX identifiers (BSD-3-Clause), eliminating the false positive. Updated across all 6 requirements files.
Bumps [flask](https://github.com/pallets/flask) from 2.2.5 to 3.1.3. - [Release notes](https://github.com/pallets/flask/releases) - [Changelog](https://github.com/pallets/flask/blob/main/CHANGES.rst) - [Commits](pallets/flask@2.2.5...3.1.3) --- updated-dependencies: - dependency-name: flask dependency-version: 3.1.3 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>
- Offload synchronous mBERT inference to ThreadPoolExecutor(max_workers=4) via asyncio.run_in_executor, keeping FastAPI event loop responsive - Make audit logging fire-and-forget (non-blocking) in prediction router - On CUDA GPUs, GIL release during kernel execution enables concurrent inferences within a single worker (benchmarked 1.8x on Apple MPS) - Health checks remain responsive (~6ms) during concurrent inference load - Add hardware specification sheet (MD + DOCX) for VMware deployments targeting 25 TPS / <500ms latency with three configuration options
SMPP Interface (src/smpp_interface/): - SMPP-to-SMPP proxy that classifies SMS via OTS mBERT API before forwarding - Supports bind_transceiver auth, upstream connection pooling, DLR relay - UDH/multipart SMS passthrough, round-robin multi-API load balancing - Production hardened: exponential backoff reconnect, message_store size caps, client keepalive, graceful shutdown, structured logging - Full test suites: 47 basic + 32 advanced tests (UDH, emoji, async, benchmarks) - Documentation with architecture diagrams (README.md + Word export) Platform scaling (v2.7): - nginx.conf upgraded for 10x API instance load balancing with least_conn - docker-compose configs for 2x and 10x horizontal scaling - Load balancer test scripts and benchmark tooling - GPU verification, stress testing, and throughput benchmarking - Deployment guides, release notes, and performance analysis reports Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Rewrite openapi.yaml: 1 endpoint -> 16 endpoints covering all API families (core, TMF922, TMF688) with full schema definitions - Fix missing get_model_version() in tmforum_service.py that caused 500 on every TMF922 job creation - Fix double-timezone parsing bug in audit_service.py that silently broke all TMF688 event listing
…llib3-2.6.3 Dependency update: urllib3 2.6.0 -> 2.6.3 in src/
…yptography-46.0.5 Dependency update: cryptography 44.0.1 -> 46.0.5 in src/
…ck-3.20.3 Dependency update: filelock 3.13.1 -> 3.20.3 in root and src/
…lelock-3.20.3 Dependency update: numpy >=2.4.2, filelock 3.20.3 across all requirements files
…ERT/training/model-training/fonttools-4.61.0 Bump fonttools from 4.43.0 to 4.61.0 in /src/mBERT/training/model-training
…ERT/training/model-training/starlette-0.49.1 Bump starlette from 0.40.0 to 0.49.1 in /src/mBERT/training/model-training
…ERT/training/model-training/torch-2.8.0 Bump torch from 2.7.1 to 2.8.0 in /src/mBERT/training/model-training
…ERT/training/model-training/requests-2.32.4 Bump requests from 2.32.3 to 2.32.4 in /src/mBERT/training/model-training
…ERT/training/model-training/aiohttp-3.13.3 Bump aiohttp from 3.12.14 to 3.13.3 in /src/mBERT/training/model-training
…ERT/training/model-training/urllib3-2.6.3 Bump urllib3 from 2.2.3 to 2.6.3 in /src/mBERT/training/model-training
…ERT/training/model-training/filelock-3.20.3 Bump filelock from 3.13.1 to 3.20.3 in /src/mBERT/training/model-training
…ERT/training/model-training/sentencepiece-0.2.1 Bump sentencepiece from 0.1.99 to 0.2.1 in /src/mBERT/training/model-training
…ERT/training/model-training/protobuf-5.29.6 Bump protobuf from 5.29.5 to 5.29.6 in /src/mBERT/training/model-training
…ERT/training/model-training/cryptography-46.0.5 Bump cryptography from 43.0.1 to 46.0.5 in /src/mBERT/training/model-training
…ERT/training/model-training/pillow-12.1.1 Bump pillow from 10.3.0 to 12.1.1 in /src/mBERT/training/model-training
…ERT/training/model-training/flask-3.1.3 Bump flask from 2.2.5 to 3.1.3 in /src/mBERT/training/model-training
…ERT/training/model-training/werkzeug-3.1.5 Bump werkzeug from 3.0.6 to 3.1.5 in /src/mBERT/training/model-training
Pre-existing dependency conflict: fastapi 0.115.14 requires starlette <0.47.0 but requirements.txt pinned starlette 0.49.1.
…eprecated encode_plus The Docker builds using requirements-security.txt had transformers>=4.53.0 (unpinned upper bound), which pulled a newer incompatible version where BertTokenizer.encode_plus was removed. This caused all classification requests to fail with: "BertTokenizer has no attribute encode_plus" Changes: - Pin transformers==4.53.0 in requirements-security.txt (matches requirements.txt) - Add upper bounds to torch, huggingface-hub, safetensors, numpy, peft to prevent similar untested major version upgrades from breaking Docker builds - Replace all tokenizer.encode_plus() calls with tokenizer() across the codebase (the __call__ method is the modern, forward-compatible API that accepts identical parameters) Affected files: prediction_service.py, test_sms.py, stressTest_500.py, stressTest_1000_mlx.py, train_ots.py, train_ots_improved.py, compare_models.py, train_incremental.py, train_enhanced_multilingual.py https://claude.ai/code/session_018ERyTcyaXRpheHinYgiypJ
…endencies-VlFFc
Under load, the API currently serialises one HTTP request per GPU forward
pass and pads every input to the full 512-token max length. Recent load
tests on a g4dn.4xlarge (Tesla T4) confirmed this leaves the GPU idle
~93% of the time and caps sustained throughput at ~5 MPS — a 600 MPS
burst takes ~80 minutes to drain even though the hardware can do
hundreds of MPS.
This change introduces:
- DynamicBatcher service that coalesces concurrent single-message
requests into padded batches (configurable max size / wait window).
One tokenizer call, one forward pass, results split back to each
caller's asyncio Future.
- FP16 weights on CUDA for ~2x throughput on T4/A10/L4 tensor cores,
guarded so CPU/MPS keep FP32.
- max_text_length lowered from 512 -> 96 with dynamic padding
('longest') so short SMS no longer waste ~10x the FLOPs.
- torch.inference_mode() in place of torch.no_grad() for a small but
free speedup and cleaner semantics.
- /metrics Prometheus-compatible endpoint (no extra dep) exposing
request/batch counters, queue depth, batch-size histogram, and
inference time, so ots-bridge can drive adaptive concurrency.
All new knobs are env-var tunable: OTS_BATCHING_ENABLED,
OTS_MAX_BATCH_SIZE, OTS_BATCH_WAIT_MS, OTS_MAX_TEXT_LENGTH, OTS_USE_FP16.
Docs updated in CLAUDE.md and README.md.
Covers the async batching logic end-to-end against a stub model +
tokenizer (no mBERT weights required):
- single request returns the correct label
- 8 concurrent submissions coalesce into one batch
- max_batch_size is respected (10 requests => batches of <=4)
- partial batches flush after batch_wait_ms, not later
- model errors propagate to every future in the batch
- metrics counters increment correctly
- shutdown fails in-flight requests instead of hanging
- init_batcher honours OTS_BATCHING_ENABLED=false
The /metrics endpoint is exercised with FastAPI TestClient in both
the disabled-batcher and active-batcher states, asserting the
Prometheus exposition format (counters, gauges, histogram buckets).
Run:
pytest src/api_interface/tests/ --asyncio-mode=auto
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Recent GPU load tests on a
g4dn.4xlarge(Tesla T4) showed the API serialises one HTTP request per GPU forward pass and pads every input to the full 512-token max length. The result: the GPU sits idle ~93% of the time, sustained throughput caps at ~5 MPS, and a 600 MPS burst takes ~80 minutes to drain — despite the hardware being capable of hundreds of MPS. See the test report in the accompanying issue/discussion for the full numbers.This PR captures the highest-leverage optimisations identified in that analysis and makes them tunable via environment variables so the bridge team can dial them in per deployment.
What's in the PR
Dynamic batching
DynamicBatcherinsrc/api_interface/services/batching_service.pyasyncio.Future)batch_wait_ms) andmax_batch_sizeFP16 + sequence-length fixes
max_text_lengthlowered from 512 → 96 withpadding='longest'so short SMS no longer waste ~10× the FLOPstorch.inference_mode()in place oftorch.no_grad()Observability
GET /metricsPrometheus-compatible endpoint (no extra dep — hand-rolled exposition format)ots_requests_total,ots_batches_total,ots_inference_seconds_total,ots_queue_depth,ots_last_batch_size,ots_batch_size_bucket{le="N"},ots_api_info{device,fp16,max_text_length,version}ots-bridgeto scrape and drive adaptive concurrencyConfig knobs (all
OTS_-prefixed env vars)batching_enabledtruemax_batch_size32batch_wait_ms15max_text_length96use_fp16trueTests
torch.nn.Module(no mBERT weights needed)pytest src/api_interface/tests/ --asyncio-mode=autoDocs
CLAUDE.md— new "Dynamic Batching" and "Observability: /metrics" sectionsREADME.md— updated Performance section + Health Checks sectionTest plan
/metricsregistered in route table/predict/response shape identical)OTS_BATCHING_ENABLED=falsefalls back to original per-request pathv2.8-amd64baseline on a T4 (needs environment with mBERT weights + CUDA — cannot be done in CI)ots-bridgecan raise its concurrency throttle (25 → 200+) against the new imageBackward compatibility
/predict/request/response shape is unchangedbatching_enabled=false/health,/audit,/feedback, TMForum endpoints untouchedExpected performance impact (from analysis)
Numbers to be validated by the bridge team's benchmark run against the built image.
Files changed
src/api_interface/services/batching_service.py,src/api_interface/routers/metrics.py,src/api_interface/tests/{__init__,conftest,test_batching_service,test_metrics_endpoint}.pysrc/api_interface/services/{model_loader,prediction_service}.py,src/api_interface/config/settings.py,src/api_interface/main.py,src/api_interface/routers/__init__.py,CLAUDE.md,README.md