GXpert

GXpert is an example that mimics the MoE (Mixture of Experts) architecture using Go. It is not a framework or a new solution, but simply a methodological demonstration.

In this example, vLLM is used as the model backend, and OpenAI's function_tool is employed for intelligent inference.

Alternatives are also available. Due to limited Triton support on Windows, we had to serve two vLLM instances, but it is also feasible to serve a single instance via Triton backed by vLLM.

Instead of function inference via OpenAI, if there are clear criteria, RAG (Retrieval-Augmented Generation) using a vector database can be utilized. Additionally, since this is merely a function separation example, encoder-based architectures like BERT can also be used. Other options include models like T5, LLaMA, or Gemma 1B as alternative implementations.

When Use?

Currently available MoE-based LLMs, such as Mixtral-8x7B, are far too heavy for individuals or small organizations to serve in practice.

While 7B models are still popular in smaller communities, there's growing interest in 1B and 3B models as LLMs become increasingly intelligent. Of course, tools like mergekit or mergoo can be used to simulate MoE architectures by merging models, but they come with limitations based on model types. In particular, the inability to merge models like Qwen2.5VL and LLaMA 3.3 is a significant drawback.

GXpert's approach is free from such model type restrictions. It supports integration with vision models and other diverse architectures. While this approach may resemble MCP (Model Context Protocol), its internal mechanism is fundamentally different.

As a result, it's well-suited for deploying multiple specialized small-scale models (e.g., 1B, 3B) across different domains, all within a unified platform that can handle simultaneous inference efficiently.

Example

[
  {
    "expert_name": "expert1",
    "expert_description": "This function provides Kubernetes command-line instructions such as kubectl. It does not directly generate or write YAML",
    "model": "devJy/GXpert-Example-Model1",
    "template": "chat-template",
    "trigger_instruction": "You are an assistant that aids with Kubernetes commands.",
    "temperature": 0.9,
    "max_tokens": 128,
    "host": "http://localhost",
    "port": "10101"
  },
  {
    "expert_name": "expert2",
    "expert_description": "This function generates the Kubernetes YAML file that the user intends to create.",
    "model": "devJy/GXpert-Example-Model2",
    "template": "prompt",
    "trigger_instruction": "Write YAML",
    "temperature": 0.9,
    "max_tokens": 256,
    "host": "http://localhost",
    "port": "10201"
  }
]

expert_name: Acts as the identifier for the expert, similar to the name of a function.
expert_description: A description of the expert, which can also serve as the function description when registering with services like OpenAI.
model: Specifies the model name used for requests in vLLM or Triton backends.
template: Optional. Refers to the prompt structure, typically divided into chat-template and prompt. Selection depends on how instructions were handled during fine-tuning.
trigger_instruction: Represents the system prompt or trigger instruction used during fine-tuning.

CAUTION The two Expert models used in the example were tuned on very limited data and with minimal sophistication, purely for demonstration purposes. As a result, following the example as-is may lead to many errors. Please understand that these models are not intended to showcase performance, but rather to illustrate the structure and concept of the system.

Expert1 – Tuned for responding to basic Kubernetes command queries
Expert2 – Tuned for generating Kubernetes YAML configurations

How it works

To run vLLM directly, simply provide your Hugging Face token in the docker-compose file and start the service.

docker-compose up

The model environment will be automatically set up during startup, and the server itself has no special dependencies.

Apart from the openapi-key, no additional configuration is required — just run:

go run cma/main.go

This will launch the server and enable immediate use.

Register Function Tool

toolCall = func(q string) openai.ChatCompletionNewParams {
		return openai.ChatCompletionNewParams{
			Messages: []openai.ChatCompletionMessageParamUnion{
				openai.UserMessage(q),
			},
			Tools: func() []openai.ChatCompletionToolParam {
				expertParams := make([]openai.ChatCompletionToolParam, 0, len(experts))
				for _, expert := range experts {
					expertParams = append(expertParams, openai.ChatCompletionToolParam{
						Function: openai.FunctionDefinitionParam{
							Name:        expert.ExpertName,
							Description: openai.String(expert.ExpertDescription),
							Parameters:  nil,
						},
					})
					log.Info().Msgf("openai %s_tool register complete", expert.ExpertName)
				}
				return expertParams
			}(),
			Seed:  openai.Int(0),
			Model: openai.ChatModelGPT4o,
		}
	}
...
completion, err := agent.Chat.Completions.New(context.Background(), toolCall(q))

Register VLLM API

func buildVllmEndPoint(experts []models.Expert) error {
	for _, expert := range experts {
		addr := fmt.Sprintf("%s:%s", expert.Host, expert.Port)
		switch expert.Template {
		case "chat-template":
			vllmRequest[expert.ExpertName] = func(q string) (string, error) {
				payload := xform{}
				payload.Model = expert.Model
				payload.Temperature = expert.Temperature
				payload.MaxTokens = expert.MaxTokens
				if expert.TriggerInstruction != "" {
					payload.Messages = append(payload.Messages, RoleTable{
						Role:    "system",
						Content: expert.TriggerInstruction,
					})
				}
				payload.Messages = append(payload.Messages, RoleTable{
					Role:    "user",
					Content: q,
				})
				fmt.Println("prepare payload : ", payload)
				serializer, err := json.Marshal(payload)
				if err != nil {
					return "", err
				}
				c := &http.Client{Timeout: 45 * time.Second}

				req, err := http.NewRequest("POST", fmt.Sprintf("%s/v1/chat/completions", addr), bytes.NewBuffer(serializer))
				if err != nil {
					return "", err
				}
				req.Header.Set("Content-Type", "application/json")

				resp, err := c.Do(req)
				if err != nil {
					return "", err
				}
				defer resp.Body.Close()
				respBody, err := io.ReadAll(resp.Body)
				if err != nil {
					return "", err
				}
				out := VlResponse{}
				err = json.Unmarshal(respBody, &out)
				if err != nil {
					return "", err
				}
				return out.Choices[0].Message.Content, nil
			}
			log.Info().Msgf("[%s] gating layer add complete\nLayerDescription:%s", expert.ExpertName, expert.ExpertDescription)
		case "prompt":
			vllmRequest[expert.ExpertName] = func(q string) (string, error) {
				payload := xform{}
				payload.Model = expert.Model
				payload.Temperature = expert.Temperature
				payload.MaxTokens = expert.MaxTokens
				if expert.TriggerInstruction != "" {
					payload.Prompt = fmt.Sprintf("%s%s", expert.TriggerInstruction, q)
				} else {
					payload.Prompt = q
				}
				fmt.Println("prepare payload : ", payload)
				serializer, err := json.Marshal(payload)
				if err != nil {
					return "", err
				}
				c := &http.Client{Timeout: 45 * time.Second}

				req, err := http.NewRequest("POST", fmt.Sprintf("%s/v1/completions", addr), bytes.NewBuffer(serializer))
				if err != nil {
					return "", err
				}
				req.Header.Set("Content-Type", "application/json")

				resp, err := c.Do(req)
				if err != nil {
					return "", err
				}
				defer resp.Body.Close()
				respBody, err := io.ReadAll(resp.Body)
				if err != nil {
					return "", err
				}
				out := VlResponse{}
				err = json.Unmarshal(respBody, &out)
				if err != nil {
					return "", err
				}
				return out.Choices[0].Text, nil
			}
			log.Info().Msgf("[%s] gating layer add complete\nLayerDescription:%s", expert.ExpertName, expert.ExpertDescription)
		default:
			log.Error().Msg("unsupported [unknown] template")
			return errors.New("unsupported template type")
		}
	}
	return nil
}

Internally, the system is designed to be flexibly extensible based on the expert_spec.json configuration.

Each expert's endpoint is built at server startup. While it is certainly possible to support dynamic endpoint updates after startup, managing the spec.json in such cases can become complex and error-prone.

Moreover, dynamically adding or removing endpoints during runtime would require concurrency control mechanisms (e.g., Mutex) to ensure thread safety, which would introduce additional locking overhead.

Therefore, in the current architecture, the system favors a lightweight and flexible structure where all expert functions are read and initialized once at startup, allowing the use of a simple map without mutexes for efficient access and expansion.

There are two available APIs:

/v1/direct/call Sends a request to a specific expert selected explicitly by the user.

Example 1 Case

Example 2 Case

/v1/inference/smartcall Sends a request without specifying an expert. The system automatically selects the most appropriate expert based on the input and returns the result.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
adapters		adapters
assets		assets
cmd		cmd
models		models
README.md		README.md
docker-compose.yaml		docker-compose.yaml
expert_spec.json		expert_spec.json
go.mod		go.mod
go.sum		go.sum
http.py		http.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GXpert

When Use?

Example

How it works

There are two available APIs:

Example 1 Case

Example 2 Case

Example 1 Case

Example 2 Case

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GXpert

When Use?

Example

How it works

There are two available APIs:

Example 1 Case

Example 2 Case

Example 1 Case

Example 2 Case

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages