Skip to content

sjy-dv/GXpert

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GXpert

logo GXpert is an example that mimics the MoE (Mixture of Experts) architecture using Go. It is not a framework or a new solution, but simply a methodological demonstration.

In this example, vLLM is used as the model backend, and OpenAI's function_tool is employed for intelligent inference.

Alternatives are also available. Due to limited Triton support on Windows, we had to serve two vLLM instances, but it is also feasible to serve a single instance via Triton backed by vLLM.

Instead of function inference via OpenAI, if there are clear criteria, RAG (Retrieval-Augmented Generation) using a vector database can be utilized. Additionally, since this is merely a function separation example, encoder-based architectures like BERT can also be used. Other options include models like T5, LLaMA, or Gemma 1B as alternative implementations.

When Use?

Currently available MoE-based LLMs, such as Mixtral-8x7B, are far too heavy for individuals or small organizations to serve in practice.

While 7B models are still popular in smaller communities, there's growing interest in 1B and 3B models as LLMs become increasingly intelligent. Of course, tools like mergekit or mergoo can be used to simulate MoE architectures by merging models, but they come with limitations based on model types. In particular, the inability to merge models like Qwen2.5VL and LLaMA 3.3 is a significant drawback.

GXpert's approach is free from such model type restrictions. It supports integration with vision models and other diverse architectures. While this approach may resemble MCP (Model Context Protocol), its internal mechanism is fundamentally different.

As a result, it's well-suited for deploying multiple specialized small-scale models (e.g., 1B, 3B) across different domains, all within a unified platform that can handle simultaneous inference efficiently.

Example

[
  {
    "expert_name": "expert1",
    "expert_description": "This function provides Kubernetes command-line instructions such as kubectl. It does not directly generate or write YAML",
    "model": "devJy/GXpert-Example-Model1",
    "template": "chat-template",
    "trigger_instruction": "You are an assistant that aids with Kubernetes commands.",
    "temperature": 0.9,
    "max_tokens": 128,
    "host": "http://localhost",
    "port": "10101"
  },
  {
    "expert_name": "expert2",
    "expert_description": "This function generates the Kubernetes YAML file that the user intends to create.",
    "model": "devJy/GXpert-Example-Model2",
    "template": "prompt",
    "trigger_instruction": "Write YAML",
    "temperature": 0.9,
    "max_tokens": 256,
    "host": "http://localhost",
    "port": "10201"
  }
]
  • expert_name: Acts as the identifier for the expert, similar to the name of a function.

  • expert_description: A description of the expert, which can also serve as the function description when registering with services like OpenAI.

  • model: Specifies the model name used for requests in vLLM or Triton backends.

  • template: Optional. Refers to the prompt structure, typically divided into chat-template and prompt. Selection depends on how instructions were handled during fine-tuning.

  • trigger_instruction: Represents the system prompt or trigger instruction used during fine-tuning.

CAUTION The two Expert models used in the example were tuned on very limited data and with minimal sophistication, purely for demonstration purposes. As a result, following the example as-is may lead to many errors. Please understand that these models are not intended to showcase performance, but rather to illustrate the structure and concept of the system.

  • Expert1 – Tuned for responding to basic Kubernetes command queries
  • Expert2 – Tuned for generating Kubernetes YAML configurations

How it works

To run vLLM directly, simply provide your Hugging Face token in the docker-compose file and start the service.

docker-compose up

The model environment will be automatically set up during startup, and the server itself has no special dependencies.

Apart from the openapi-key, no additional configuration is required — just run:

go run cma/main.go

This will launch the server and enable immediate use.

Register Function Tool

toolCall = func(q string) openai.ChatCompletionNewParams {
		return openai.ChatCompletionNewParams{
			Messages: []openai.ChatCompletionMessageParamUnion{
				openai.UserMessage(q),
			},
			Tools: func() []openai.ChatCompletionToolParam {
				expertParams := make([]openai.ChatCompletionToolParam, 0, len(experts))
				for _, expert := range experts {
					expertParams = append(expertParams, openai.ChatCompletionToolParam{
						Function: openai.FunctionDefinitionParam{
							Name:        expert.ExpertName,
							Description: openai.String(expert.ExpertDescription),
							Parameters:  nil,
						},
					})
					log.Info().Msgf("openai %s_tool register complete", expert.ExpertName)
				}
				return expertParams
			}(),
			Seed:  openai.Int(0),
			Model: openai.ChatModelGPT4o,
		}
	}
...
completion, err := agent.Chat.Completions.New(context.Background(), toolCall(q))

Register VLLM API

func buildVllmEndPoint(experts []models.Expert) error {
	for _, expert := range experts {
		addr := fmt.Sprintf("%s:%s", expert.Host, expert.Port)
		switch expert.Template {
		case "chat-template":
			vllmRequest[expert.ExpertName] = func(q string) (string, error) {
				payload := xform{}
				payload.Model = expert.Model
				payload.Temperature = expert.Temperature
				payload.MaxTokens = expert.MaxTokens
				if expert.TriggerInstruction != "" {
					payload.Messages = append(payload.Messages, RoleTable{
						Role:    "system",
						Content: expert.TriggerInstruction,
					})
				}
				payload.Messages = append(payload.Messages, RoleTable{
					Role:    "user",
					Content: q,
				})
				fmt.Println("prepare payload : ", payload)
				serializer, err := json.Marshal(payload)
				if err != nil {
					return "", err
				}
				c := &http.Client{Timeout: 45 * time.Second}

				req, err := http.NewRequest("POST", fmt.Sprintf("%s/v1/chat/completions", addr), bytes.NewBuffer(serializer))
				if err != nil {
					return "", err
				}
				req.Header.Set("Content-Type", "application/json")

				resp, err := c.Do(req)
				if err != nil {
					return "", err
				}
				defer resp.Body.Close()
				respBody, err := io.ReadAll(resp.Body)
				if err != nil {
					return "", err
				}
				out := VlResponse{}
				err = json.Unmarshal(respBody, &out)
				if err != nil {
					return "", err
				}
				return out.Choices[0].Message.Content, nil
			}
			log.Info().Msgf("[%s] gating layer add complete\nLayerDescription:%s", expert.ExpertName, expert.ExpertDescription)
		case "prompt":
			vllmRequest[expert.ExpertName] = func(q string) (string, error) {
				payload := xform{}
				payload.Model = expert.Model
				payload.Temperature = expert.Temperature
				payload.MaxTokens = expert.MaxTokens
				if expert.TriggerInstruction != "" {
					payload.Prompt = fmt.Sprintf("%s%s", expert.TriggerInstruction, q)
				} else {
					payload.Prompt = q
				}
				fmt.Println("prepare payload : ", payload)
				serializer, err := json.Marshal(payload)
				if err != nil {
					return "", err
				}
				c := &http.Client{Timeout: 45 * time.Second}

				req, err := http.NewRequest("POST", fmt.Sprintf("%s/v1/completions", addr), bytes.NewBuffer(serializer))
				if err != nil {
					return "", err
				}
				req.Header.Set("Content-Type", "application/json")

				resp, err := c.Do(req)
				if err != nil {
					return "", err
				}
				defer resp.Body.Close()
				respBody, err := io.ReadAll(resp.Body)
				if err != nil {
					return "", err
				}
				out := VlResponse{}
				err = json.Unmarshal(respBody, &out)
				if err != nil {
					return "", err
				}
				return out.Choices[0].Text, nil
			}
			log.Info().Msgf("[%s] gating layer add complete\nLayerDescription:%s", expert.ExpertName, expert.ExpertDescription)
		default:
			log.Error().Msg("unsupported [unknown] template")
			return errors.New("unsupported template type")
		}
	}
	return nil
}

Internally, the system is designed to be flexibly extensible based on the expert_spec.json configuration.

Each expert's endpoint is built at server startup. While it is certainly possible to support dynamic endpoint updates after startup, managing the spec.json in such cases can become complex and error-prone.

Moreover, dynamically adding or removing endpoints during runtime would require concurrency control mechanisms (e.g., Mutex) to ensure thread safety, which would introduce additional locking overhead.

Therefore, in the current architecture, the system favors a lightweight and flexible structure where all expert functions are read and initialized once at startup, allowing the use of a simple map without mutexes for efficient access and expansion.

There are two available APIs:

/v1/direct/call Sends a request to a specific expert selected explicitly by the user.

Example 1 Case

e1 e1_1 e1_2

Example 2 Case

e2 e2_1

/v1/inference/smartcall Sends a request without specifying an expert. The system automatically selects the most appropriate expert based on the input and returns the result.

Example 1 Case

s1

Example 2 Case

s2

About

A sample architecture that mimics MoE (Mixture of Experts) using Go.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors