This repository is to Optimize LLM performance using Multi-processing and Multi-threading in C Language.
Multiprocessing
├── common.h # common and helper macro defns, read through first
├── main.c
├── inference.c # [your task] template for inference child process implementation
├── Makefile # makefile for the project
├── model.h # GPT model definition, modification not allowed
└── avg_cpu_use.py # Utility to parse the log and calculate average cpu usageOptimization of Performance of LLM using Multiprocessing - Divide User Prompt Acception and Inference
-
The LLM used is based on SmolLM by HuggingfaceTB.
-
Llama3, an open-source variation of GPT, and complete single-thread LLM inference engine as the startpoint is provided.
-
Inference framework used is based on the open-source project llama2.c by Andrej Karpathy.
$ wget -O model.bin https://huggingface.co/huangs0/llama2.c/resolve/main/model.bin
$ wget -O tokenizer.bin https://huggingface.co/huangs0/llama2.c/resolve/main/tokenizer.bin
or with Makefile (recommended)
$ make prepare$ gcc -o inference inference.c -O3 -lm
or with Makefile (recommended)
$ make -B inference
$ make -B inference # -B:= make, force rebuild
# or manually
$ gcc -o inference inference.c -o3 -lmPlease use -lm flag to link math library and -o3 flag to apply the best optimization allowed within C standard.
$ ./main <seed> 2>log
# Put your prompt when >>> appears
All information about the statistics of a process are found using /proc/{pid}/stat and saved in log.txt
| Item | Description |
|---|---|
| pid | Process ID |
| tcomm | Executable Name |
| state | Running Status |
| policy | Scheduling Policy |
| nice | Nice value |
| vsize | Virtual Memory Size |
| task_cpu | CPU id of the process scheduled to |
| utime | Running time of process spent in user mode, unit is 10ms |
| stime | Running time of process spent in system mode, unit is 10ms |
please check /proc/pid/stat manpage for more information.
Before the first generation, main process set the scheduling policy and nice value of the inference process using SYS_sched_settattr
- Normal Policies:
- SCHED_OTHER: default scheduling policies of Linux. Also named SCHED_NORMAL
- SCHED_BATCH: for non-interactive cpu-intensive workload.
- SCHED:IDLE: for low priority background task
- Real-time Policies:
- SCHED_FIFO: First-In-First-Out Policy with Preemption
- SCHED_RR: Round-Robin Polciy
- SCHED_DEADLINE: Earliest Deadline First with Preemption
For Normal Policies, thier scheduling prioruty is congifured via nice value, an integer between -20 (highest) and +19 (lowest) with 0 as the default priority
Please check SYS_shced_setattr manpage for more information.
Thus to accelerate GPT and get a faster response, it is critical to have faster matrix-vector-multiplication and faster multi-head attention computation.
$ make prepare # will download if not existed
# or manually download via wget, will force repeated download, not recommended
$ wget -O model.bin https://huggingface.co/huangs0/smollm/resolve/main/model.bin
$ wget -O tokenizer.bin https://huggingface.co/huangs0/smollm/resolve/main/tokenizer.bin$ make -B
# or manually via gcc
$ gcc -o parallel parallel.c -O2 -lm -lpthread