Skip to content

This repository is to optimize the throughput of Large Language Model LLAMA2, open source from Meta, using multi-processing and multi-threading in C Language.

Notifications You must be signed in to change notification settings

dae9999nam/LLM_C

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM - Large Language Model

This repository is to Optimize LLM performance using Multi-processing and Multi-threading in C Language.

1. Multi-Processing

1.1 File Structure

Multiprocessing
├── common.h # common and helper macro defns, read through first
├── main.c
├── inference.c # [your task] template for inference child process implementation
├── Makefile # makefile for the project
├── model.h # GPT model definition, modification not allowed
└── avg_cpu_use.py # Utility to parse the log and calculate average cpu usage

1.2 Objective

Optimization of Performance of LLM using Multiprocessing - Divide User Prompt Acception and Inference

1.3 Model Description

  • The LLM used is based on SmolLM by HuggingfaceTB.

  • Llama3, an open-source variation of GPT, and complete single-thread LLM inference engine as the startpoint is provided.

  • Inference framework used is based on the open-source project llama2.c by Andrej Karpathy.

1.4 How it works

Please download the model and tokenizer to the same folder:
$ wget -O model.bin https://huggingface.co/huangs0/llama2.c/resolve/main/model.bin
$ wget -O tokenizer.bin https://huggingface.co/huangs0/llama2.c/resolve/main/tokenizer.bin
or with Makefile (recommended)
$ make prepare
Compile it with Level-3 Optimization and link math library (-lm, Linux built-in)
$ gcc -o inference inference.c -O3 -lm
or with Makefile (recommended)
$ make -B inference

Compile and run the inference program.

$ make -B inference # -B:= make, force rebuild
# or  manually
$ gcc -o inference inference.c -o3 -lm

Please use -lm flag to link math library and -o3 flag to apply the best optimization allowed within C standard.

$ ./main <seed> 2>log
# Put your prompt when >>> appears

Main process collects the running status of inference process.

All information about the statistics of a process are found using /proc/{pid}/stat and saved in log.txt

Informations are as follow:

Item Description
pid Process ID
tcomm Executable Name
state Running Status
policy Scheduling Policy
nice Nice value
vsize Virtual Memory Size
task_cpu CPU id of the process scheduled to
utime Running time of process spent in user mode, unit is 10ms
stime Running time of process spent in system mode, unit is 10ms

please check /proc/pid/stat manpage for more information.

1.5 Scheduling Polciy, Nice, Priority Setting

Before the first generation, main process set the scheduling policy and nice value of the inference process using SYS_sched_settattr

  • Normal Policies:
    • SCHED_OTHER: default scheduling policies of Linux. Also named SCHED_NORMAL
    • SCHED_BATCH: for non-interactive cpu-intensive workload.
    • SCHED:IDLE: for low priority background task
  • Real-time Policies:
    • SCHED_FIFO: First-In-First-Out Policy with Preemption
    • SCHED_RR: Round-Robin Polciy
    • SCHED_DEADLINE: Earliest Deadline First with Preemption

For Normal Policies, thier scheduling prioruty is congifured via nice value, an integer between -20 (highest) and +19 (lowest) with 0 as the default priority

Please check SYS_shced_setattr manpage for more information.

2. Multi-Threading

GPT leverages multi-head attention, a mechanism to adopt important information from history tokens.

Thus to accelerate GPT and get a faster response, it is critical to have faster matrix-vector-multiplication and faster multi-head attention computation.

2.1 Prepare Environment

Download model files

$ make prepare # will download if not existed
# or manually download via wget, will force repeated download, not recommended
$ wget -O model.bin https://huggingface.co/huangs0/smollm/resolve/main/model.bin
$ wget -O tokenizer.bin https://huggingface.co/huangs0/smollm/resolve/main/tokenizer.bin

Compile and run the inference program

$ make -B
 # or manually via gcc
$ gcc -o parallel parallel.c -O2 -lm -lpthread

To Be Updated

About

This repository is to optimize the throughput of Large Language Model LLAMA2, open source from Meta, using multi-processing and multi-threading in C Language.

Topics

Resources

Stars

Watchers

Forks