- hardware: jetson orin nano
- docker: jetson-containers (nanollm)
git clone https://github.com/Desire32/jetson-tensorrt-onnx-research.git
- Example commands:
python3 main.py (loads tinyllama with mlc + INT3 from config by default)python3 main.py --api hf (loads tinyllama from config without quantization)python3 main.py --model "X" (choose your own llm model to load)python3 main.py --embed "Y" (choose your own embed model to load)python3 main.py --model "X" --quantization "q8f16_ft" (load your llm model with specific quantization type)
-
--model— HuggingFace repo/model name, or path to HF model checkpoint The task was done using:TinyLlama/TinyLlama-1.1B-Chat-v1.0princeton-nlp/Sheared-LLaMA-1.3B
-
--quantization— type of quantization (WORKS ONLY FOR MLC)q0f32,q0f16Plain modelq3f16_0,q3f16_1INT3q4f16_0,q4f16_1,q4f16_ftINT4q8f16_ftINT8
-
--max_new_tokens— max number of new tokens -
--embed— embed models Was tested with:sentence-transformers/all-MiniLM-L6-v2sentence-transformers/all-MiniLM-L12-v2
-
--api— backend / API
Возможные значения:mlcmain one to useawqneeds to have last version of JetsonPackhfbase version without any modifications
-
--test-mode— debug purposes with predefined templates -
--chunk_size— dividing text on chunks -
--chunk_overlap— overlap between -
--top_k— nearest doc pos for retrieval
-
ValueErrorcaused by a critical security vulnerability intorch.loadPyTorch version requirement:torch >= 2.6(does not apply tosafetensors) Related vulnerability: CVE-2025-32434
https://nvd.nist.gov/vuln/detail/CVE-2025-32434 -
Sweep objective: maximize
tokens_per_secondunder limited memory -
Key hyperparameters:
- KV-cache configuration
- TensorRT / ONNX execution parameters
-
Techniques to further optimize (PROBABLY):
- Huffman / entropy coding
- Sparse weights
- Weight pruning (
torch.nn.utils.prune)