LLM - Large Language Model

This repository is to Optimize LLM performance using Multi-processing and Multi-threading in C Language.

1. Multi-Processing

1.1 File Structure

Multiprocessing
├── common.h # common and helper macro defns, read through first
├── main.c
├── inference.c # [your task] template for inference child process implementation
├── Makefile # makefile for the project
├── model.h # GPT model definition, modification not allowed
└── avg_cpu_use.py # Utility to parse the log and calculate average cpu usage

1.2 Objective

Optimization of Performance of LLM using Multiprocessing - Divide User Prompt Acception and Inference

1.3 Model Description

The LLM used is based on SmolLM by HuggingfaceTB.
Llama3, an open-source variation of GPT, and complete single-thread LLM inference engine as the startpoint is provided.
Inference framework used is based on the open-source project llama2.c by Andrej Karpathy.

1.4 How it works

Please download the model and tokenizer to the same folder:

$ wget -O model.bin https://huggingface.co/huangs0/llama2.c/resolve/main/model.bin
$ wget -O tokenizer.bin https://huggingface.co/huangs0/llama2.c/resolve/main/tokenizer.bin
or with Makefile (recommended)
$ make prepare

Compile it with Level-3 Optimization and link math library (-lm, Linux built-in)

$ gcc -o inference inference.c -O3 -lm
or with Makefile (recommended)
$ make -B inference

Compile and run the inference program.

$ make -B inference # -B:= make, force rebuild
# or  manually
$ gcc -o inference inference.c -o3 -lm

Please use -lm flag to link math library and -o3 flag to apply the best optimization allowed within C standard.

$ ./main <seed> 2>log
# Put your prompt when >>> appears

Main process collects the running status of inference process.

All information about the statistics of a process are found using /proc/{pid}/stat and saved in log.txt

Informations are as follow:

Item	Description
pid	Process ID
tcomm	Executable Name
state	Running Status
policy	Scheduling Policy
nice	Nice value
vsize	Virtual Memory Size
task_cpu	CPU id of the process scheduled to
utime	Running time of process spent in user mode, unit is 10ms
stime	Running time of process spent in system mode, unit is 10ms

please check /proc/pid/stat manpage for more information.

1.5 Scheduling Polciy, Nice, Priority Setting

Before the first generation, main process set the scheduling policy and nice value of the inference process using `SYS_sched_settattr`

Normal Policies:
- SCHED_OTHER: default scheduling policies of Linux. Also named SCHED_NORMAL
- SCHED_BATCH: for non-interactive cpu-intensive workload.
- SCHED:IDLE: for low priority background task
Real-time Policies:
- SCHED_FIFO: First-In-First-Out Policy with Preemption
- SCHED_RR: Round-Robin Polciy
- SCHED_DEADLINE: Earliest Deadline First with Preemption

For Normal Policies, thier scheduling prioruty is congifured via nice value, an integer between -20 (highest) and +19 (lowest) with 0 as the default priority

Please check SYS_shced_setattr manpage for more information.

2. Multi-Threading

GPT leverages multi-head attention, a mechanism to adopt important information from history tokens.

Thus to accelerate GPT and get a faster response, it is critical to have faster matrix-vector-multiplication and faster multi-head attention computation.

2.1 Prepare Environment

Download model files

$ make prepare # will download if not existed
# or manually download via wget, will force repeated download, not recommended
$ wget -O model.bin https://huggingface.co/huangs0/smollm/resolve/main/model.bin
$ wget -O tokenizer.bin https://huggingface.co/huangs0/smollm/resolve/main/tokenizer.bin

Compile and run the inference program

$ make -B
 # or manually via gcc
$ gcc -o parallel parallel.c -O2 -lm -lpthread

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
Multiprocessing		Multiprocessing
Multithreading		Multithreading
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM - Large Language Model

This repository is to Optimize LLM performance using Multi-processing and Multi-threading in C Language.

1. Multi-Processing

1.1 File Structure

1.2 Objective

1.3 Model Description

1.4 How it works

Please download the model and tokenizer to the same folder:

Compile it with Level-3 Optimization and link math library (-lm, Linux built-in)

Compile and run the inference program.

Please use -lm flag to link math library and -o3 flag to apply the best optimization allowed within C standard.

Main process collects the running status of inference process.

All information about the statistics of a process are found using /proc/{pid}/stat and saved in log.txt

Informations are as follow:

1.5 Scheduling Polciy, Nice, Priority Setting

Before the first generation, main process set the scheduling policy and nice value of the inference process using `SYS_sched_settattr`

For Normal Policies, thier scheduling prioruty is congifured via nice value, an integer between -20 (highest) and +19 (lowest) with 0 as the default priority

2. Multi-Threading

GPT leverages multi-head attention, a mechanism to adopt important information from history tokens.

Thus to accelerate GPT and get a faster response, it is critical to have faster matrix-vector-multiplication and faster multi-head attention computation.

2.1 Prepare Environment

Download model files

Compile and run the inference program

To Be Updated

About

Uh oh!

Languages

dae9999nam/LLM_C

Folders and files

Latest commit

History

Repository files navigation

LLM - Large Language Model

This repository is to Optimize LLM performance using Multi-processing and Multi-threading in C Language.

1. Multi-Processing

1.1 File Structure

1.2 Objective

1.3 Model Description

1.4 How it works

Please download the model and tokenizer to the same folder:

Compile it with Level-3 Optimization and link math library (-lm, Linux built-in)

Compile and run the inference program.

Please use -lm flag to link math library and -o3 flag to apply the best optimization allowed within C standard.

Main process collects the running status of inference process.

All information about the statistics of a process are found using /proc/{pid}/stat and saved in log.txt

Informations are as follow:

1.5 Scheduling Polciy, Nice, Priority Setting

Before the first generation, main process set the scheduling policy and nice value of the inference process using SYS_sched_settattr

For Normal Policies, thier scheduling prioruty is congifured via nice value, an integer between -20 (highest) and +19 (lowest) with 0 as the default priority

2. Multi-Threading

GPT leverages multi-head attention, a mechanism to adopt important information from history tokens.

Thus to accelerate GPT and get a faster response, it is critical to have faster matrix-vector-multiplication and faster multi-head attention computation.

2.1 Prepare Environment

Download model files

Compile and run the inference program

To Be Updated

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages

Before the first generation, main process set the scheduling policy and nice value of the inference process using `SYS_sched_settattr`