On-Device-Real-Estate-Assistant is an on-device real-estate assistant prototype. The project keeps a domain FLAN-T5 question-answering model, exports it to ONNX for Android inference, and benchmarks multiple optimization strategies on an Android ARM64 environment.
Team: Phong Cao, Trang Tran, Mai Do
School: Worcester Polytechnic Institute
The current repository is organized as a runnable project, not as a notebook dump. The final benchmark aggregate is results/all_benchmarks.json, and the generated report plots are in benchmarks/visualizations/tradeoff_plots.
- Run a real-estate question-answering model locally on a phone which is limited resources.
- Compare model optimization strategies for on-device deployment.
- Measure both answer quality and device efficiency.
- Package the Android inference path with ONNX Runtime.
- Keep the final benchmark result reproducible and easy to inspect.
app/android/ Android app project
models/
flan_t5_zillow_final1/ Hugging Face FLAN-T5 model assets
whisper_model/ Whisper speech model assets
export_to_onnx.py PyTorch/Hugging Face -> ONNX export script
zillow_flan_t5_finetune.ipynb Fine-tune Flan T5 base model
benchmarks/
data/flan_t5_baseline/ Fixed QA pair cache and eval split
requirements.txt Python benchmark dependencies
visualizations/ Plot generator and final SVG charts
results/
all_benchmarks.json Final Android benchmark aggregate
src/
benchmarking/ Benchmark runner, split builder, metrics
optimization/ Pruning/quantization strategy code
The full project pipeline is:
User input
-> typed text
-> voice input -> phone speech-to-text -> text
Text prompt
-> FLAN-T5 real-estate question-answering model
-> optional optimization experiments
-> quantization: FP16, BF16, INT8
-> pruning: attention, MLP, global unstructured pruning
-> combined pruning + quantization
Selected / exported model
-> models/export_to_onnx.py
-> ONNX encoder and decoder files
-> app/android/app/src/main/assets/onnx_model/
Android phone
-> ONNX Runtime Android
-> local inference
-> generated answer displayed in the app
The Android app does not run PyTorch or TensorFlow directly. It loads the exported .onnx encoder and decoder files through ONNX Runtime Android. Because ONNX Runtime expects numeric tensors instead of raw text, the app also bundles the matching FLAN-T5 tokenizer file, spiece.model. A small native C++ SentencePiece bridge loads that file, converts user text into the token IDs expected by the ONNX model, and decodes generated token IDs back into readable text.
The project starts with fine-tuning a base FLAN-T5 model on the real estate Q&A domain:
Setup:
python3 -m venv .venv
source .venv/bin/activate
pip install torch transformers datasets evaluate rouge-scoreFine-tuning process (see models/zillow_flan_t5_finetune.ipynb):
- Load the
zillow/real_estate_v1dataset from Hugging Face - Extract Q&A pairs with conversational context from raw messages
- Create train/validation/test splits (80% / 10% / 10%)
- Tokenize inputs and targets separately with max lengths 512 / 256
- Train using
Seq2SeqTrainerwith:- Base model:
google/flan-t5-base - Optimizer: AdamW with learning rate
2e-5 - Scheduler: Cosine with warmup
- Epochs: 20, batch size: 16
- Evaluation metric: ROUGE-L F1
- Base model:
- Evaluate on test set using ROUGE-1, ROUGE-2, ROUGE-L metrics
- Save fine-tuned model to
models/flan_t5_zillow_final1/
The fine-tuned model serves as the baseline for all subsequent optimization experiments.
The benchmark compares optimization families that are common for on-device transformer deployment:
- Quantization:
fp16,bf16, andint8 - Pruning: unstructured attention, MLP, and global pruning
- Combined pipelines: pruning plus quantization
Each model is evaluated against the same fixed benchmark split:
- Pair cache: benchmarks/data/flan_t5_baseline/qa_pairs.jsonl
- Split manifest: benchmarks/data/flan_t5_baseline/split_manifest.json
- Source dataset:
zillow/real_estate_v1 - Eval split:
10% - Split seed:
42
Measured quality metrics:
- Token F1
- ROUGE-L F1
- Exact match
- Eval loss when available
Measured efficiency metrics:
- Disk size
- Parameter count
- Model load time
- RSS memory before/after load
- Mean, P50, and P95 latency
- Examples per second
- Generated tokens per second
To run the benchmark on a physical Android device using Termux:
Setup:
-
Install Termux from F-Droid or Google Play Store
-
Open Termux and update packages:
pkg update && pkg upgrade -
Install Python and required build tools:
pkg install python python-dev clang
-
Create and activate a Python virtual environment:
python -m venv /data/data/com.termux/files/home/benchmark_env source /data/data/com.termux/files/home/benchmark_env/bin/activate -
Install Python benchmark dependencies:
pip install --upgrade pip pip install -r benchmarks/requirements.txt
Transfer Project Files:
-
Copy the project to Termux storage (use ADB or file transfer):
adb push /path/to/On-Device-Real-Estate-Assistant /sdcard/
Then in Termux:
cp -r /sdcard/On-Device-Real-Estate-Assistant ~/ cd ~/On-Device-Real-Estate-Assistant
Run Benchmark:
-
Run the benchmark harness on a specific model:
python -m src.benchmarking.benchmark_flan_t5 \ --model-path models/flan_t5_zillow_final1 \ --split-manifest benchmarks/data/flan_t5_baseline/split_manifest.json \ --output-dir benchmarks/runs/termux_results \ --device auto
-
For baseline model only:
python -m src.benchmarking.benchmark_flan_t5 \ --model-path models/flan_t5_zillow_final1 \ --split-manifest benchmarks/data/flan_t5_baseline/split_manifest.json \ --output-dir benchmarks/runs/baseline_results \ --device cpu
-
Retrieve results:
adb pull /sdcard/On-Device-Real-Estate-Assistant/benchmarks/runs/termux_results /local/path/
Notes:
- ARM64 Termux environment is significantly slower than x86_64 systems
- Expected baseline latency on ARM64: ~5-10 seconds per inference (vs. 1-2 seconds on desktop)
- Allow 30+ minutes for a full benchmark run on a single model
- Monitor device temperature; add breaks between runs if needed
- Use
--device cputo force CPU inference if GPU is unavailable
The final retained benchmark file is:
results/all_benchmarks.json
The fastest model in the final Android benchmark is:
FP16 + Unstructured MLP L1 10 Prune
It reaches 120.6 ms mean latency, 180.8 ms P95 latency, and 8.29 examples/sec. Its Token F1 is 0.1851, which retains about 81.0% of the baseline Token F1 while cutting disk size by about 50.5%.
The highest-quality model is the full baseline:
Baseline
It has the best Token F1 at 0.2285, but it is large and slow on Android ARM64 Termux: 311.11 MB, 5476.5 ms mean latency, and only 0.1826 examples/sec.
The smallest models are:
Dynamic INT8 PTQ
Unstructured MLP L1 10 Prune + Dynamic INT8 PTQ
Both are 126.58 MB, a 59.3% disk-size reduction from baseline. Dynamic INT8 PTQ preserves 0.2193 Token F1, or about 96.0% of baseline quality, but it is much slower than the baseline in this run at 15236.9 ms mean latency.
The best simple balanced choice is:
FP16
It keeps 0.2208 Token F1, or 96.6% of baseline, reduces disk size by 50.5%, and improves latency from 5476.5 ms to 162.9 ms.
The best practical interpretation is:
FP16andBF16provide the strongest quality/latency/size tradeoff in the retained results.FP16 + Unstructured MLP L1 10 Pruneis the fastest model, but it gives up more quality than plainFP16.Dynamic INT8 PTQgives the best compression while preserving quality, but the Android ARM64 CPU latency is too high for interactive use.- Structured pruning reduces parameter count and can improve latency, but the quality/size tradeoff is weaker than plain
FP16. - The current best deployment candidate is
FP16if latency and quality are both important.
| Model | Family | Token F1 | Retention | Size MB | Size Red. | Mean ms | P95 ms | Examples/s |
|---|---|---|---|---|---|---|---|---|
FP16 + Unstructured MLP L1 10 Prune |
combined | 0.1851 | 81.0% | 154.01 | 50.5% | 120.6 | 180.8 | 8.2905 |
Unstructured MLP L1 10 Prune |
prune | 0.2146 | 93.9% | 307.93 | 1.0% | 135.7 | 190.7 | 7.3673 |
Structured MLP Intermediate 10 Prune + FP16 |
combined | 0.1776 | 77.7% | 149.00 | 52.1% | 162.9 | 273.0 | 6.1389 |
FP16 |
quant | 0.2208 | 96.6% | 154.01 | 50.5% | 162.9 | 271.5 | 6.1392 |
Unstructured Attention L1 10 Prune |
prune | 0.2074 | 90.8% | 307.93 | 1.0% | 164.1 | 235.4 | 6.0929 |
BF16 + Unstructured MLP L1 10 Prune |
combined | 0.1868 | 81.8% | 154.01 | 50.5% | 169.1 | 287.4 | 5.9147 |
Structured Attention Head 1 Prune |
prune | 0.2052 | 89.8% | 295.35 | 5.1% | 192.5 | 394.1 | 5.1953 |
Structured MLP Intermediate 10 Prune |
prune | 0.2017 | 88.3% | 297.90 | 4.2% | 192.7 | 295.4 | 5.1901 |
BF16 |
quant | 0.2221 | 97.2% | 154.01 | 50.5% | 241.1 | 394.2 | 4.1471 |
Unstructured Global L1 10 Prune |
prune | 0.1976 | 86.5% | 307.93 | 1.0% | 335.8 | 1739.0 | 2.9779 |
Structured Attention Head 1 Prune + BF16 |
combined | 0.1809 | 79.2% | 147.72 | 52.5% | 459.6 | 920.1 | 2.1756 |
Unstructured Global L1 20 Prune |
prune | 0.1664 | 72.8% | 307.93 | 1.0% | 1237.9 | 6331.2 | 0.8078 |
Baseline |
baseline | 0.2285 | 100.0% | 311.11 | 0.0% | 5476.5 | 9020.4 | 0.1826 |
Dynamic INT8 PTQ |
quant | 0.2193 | 96.0% | 126.58 | 59.3% | 15236.9 | 19735.6 | 0.0656 |
Unstructured MLP L1 10 Prune + Dynamic INT8 PTQ |
combined | 0.1884 | 82.5% | 126.58 | 59.3% | 15430.8 | 19503.6 | 0.0648 |
python3 -m venv .venv
source .venv/bin/activate
pip install -r benchmarks/requirements.txtpython benchmarks/visualizations/generate_tradeoff_plots.pyOutput:
benchmarks/visualizations/tradeoff_plots/
Open the generated HTML report:
benchmarks/visualizations/tradeoff_plots/index.html
python -m src.benchmarking.benchmark_flan_t5 \
--model-path models/flan_t5_zillow_final1 \
--split-manifest benchmarks/data/flan_t5_baseline/split_manifest.json \
--output-dir benchmarks/runs/flan_t5_baseline/results \
--device autoThis downloads/rebuilds the benchmark pair cache from zillow/real_estate_v1.
python -m src.benchmarking.build_flan_t5_split \
--output-dir benchmarks/data/flan_t5_baselinepython models/export_to_onnx.pyOutput:
app/android/app/src/main/assets/onnx_model/
cd app/android
./gradlew assembleDebugThe Android app loads:
app/android/app/src/main/assets/onnx_model/encoder_model.onnx
app/android/app/src/main/assets/onnx_model/decoder_model.onnx
app/android/app/src/main/assets/onnx_model/decoder_with_past_model.onnx
The Android app uses:
implementation("com.microsoft.onnxruntime:onnxruntime-android:1.17.3")At runtime, MainActivity.kt copies the onnx_model asset folder into app storage, creates ONNX Runtime sessions for the encoder and decoder, tokenizes with SentencePiece, and decodes greedily token by token.
The app provides two ways to input questions:
- Text Input — Type your real-estate question directly using the phone keyboard
- Voice Input — Tap the microphone button to record speech, which is converted to text using Google Speech Recognition. This service works offline if you have downloaded the English language package beforehand.
Once the input text is ready, the question is sent to the FP16-quantized FLAN-T5 model running on-device. The model processes the input and generates a detailed answer specific to the real-estate domain. The answer is then displayed on the screen in real-time.
- The app currently runs FLAN-T5 ONNX inference, but the retained benchmark results include separate Android/Termux-style PyTorch measurements for optimized artifacts.
- Whisper assets are retained, but the current Android app uses Android speech recognition for microphone input rather than the bundled Whisper models.
- The measured INT8 models preserve better Token F1 but are too slow in the retained benchmark.
- The faster FP16/BF16-style variants need quality debugging before they are useful.
- The benchmark quality scores are low overall, so future work should improve prompt parity, decoding settings, and evaluation data quality.
- Align Android decoding settings with the Python benchmark settings.
- Benchmark the exact ONNX Android app path, not only Android/Termux model artifacts.
- Investigate why FP16/BF16 variants produce near-zero retained Token F1.
- Add an end-to-end scripted smoke test for export -> Android asset validation.
- Evaluate ONNX Runtime execution providers and decoder-with-past usage for latency.
- Decide whether Whisper should be integrated directly or removed from the active scope.


