A small, educational AI coding agent built with Python and Google's Gemini API. Give it a natural-language prompt and it will autonomously explore a sandboxed project directory, read files, write or modify code, and run Python scripts — all by calling a small set of tools you've given it permission to use.
This repo ships with a tiny calculator/ app as a demo "workspace" so you can immediately try the agent on something concrete (for example: "fix the bug in the calculator's precedence handling" or "add a power operator to the calculator").
- What this project teaches
- Quick start
- Usage examples
- How it works (high-level)
- Architecture in detail
- The agent loop, step by step
- The four tools
- Security: the working directory sandbox
- Project layout
- Configuration
- Running the unit tests
- Extending the agent
- Troubleshooting
- Glossary
If you're new to LLM-powered "agents", this codebase is a minimal but realistic example of the core pattern that powers tools like Cursor, Claude Code, and OpenAI's function-calling demos. By reading the code you'll learn:
- How a Large Language Model (LLM) can be turned into an agent by giving it tools (also called functions) it can call.
- How function calling works with the Google Gemini API (
google-genaiSDK). - How to design a small agent loop that keeps calling the model until it produces a final answer.
- How to sandbox an agent so it can only touch files inside a designated working directory.
- How to keep a running conversation history that includes both the model's messages and tool results.
The whole thing is < 300 lines of Python, so it's easy to read end-to-end.
- Python 3.13+ (see
.python-version/pyproject.toml) - A Google Gemini API key — get one free from Google AI Studio.
- Recommended:
uvfor fast, reproducible installs.pipworks too.
git clone <your-fork-url> boots-agent
cd boots-agent
# Using uv (recommended — uv.lock is included)
uv sync
# Or using pip
python -m venv .venv
source .venv/bin/activate # on Windows: .venv\Scripts\activate
pip install "google-genai==1.12.1" "python-dotenv==1.1.0"Create a .env file in the project root:
echo 'GEMINI_API_KEY=your_key_here' > .env.env is git-ignored, so your key stays local.
uv run main.py "list the files in the calculator project and explain what it does"(or python main.py "..." if you're using a regular venv)
Add --verbose to see every tool call, token count, and intermediate response:
uv run main.py --verbose "run the tests in the calculator and tell me if they pass"The agent works on whatever directory is configured as WORKING_DIR in config.py — by default, the bundled calculator/. Try prompts like:
"What files are in this project? Give me a quick summary of each.""Run the unit tests and tell me which ones pass.""There's a bug somewhere in the calculator. Find it and fix it.""Add a new operator '%' (modulo) to the calculator and update the tests.""Refactor render.py to also support a plain-text output format, controlled by a parameter."
Behind the scenes the agent will: list files → read the relevant ones → run them if needed → edit them → re-run to verify → report back to you.
At its core the agent is a small loop around a single function call:
┌───────────────────────────────┐
prompt ──▶ │ send conversation to Gemini │
└───────────────┬───────────────┘
│
┌───────────────▼───────────────┐
│ Did the model call a tool? │
└───────┬───────────────┬───────┘
no │ │ yes
▼ ▼
print final run the tool locally,
response & append its output to
exit the conversation,
loop again
Each iteration the LLM sees:
- the original user prompt,
- everything it has said so far,
- and the results of every tool call it has made.
This is how it can do multi-step work like "open three files, find the bug, write a fix, and run the tests" from a single prompt.
The repo is split into four conceptual layers:
| Layer | Files | Responsibility |
|---|---|---|
| Entry point | main.py |
Parse CLI args, set up the Gemini client, run the agent loop. |
| Tool dispatcher | call_function.py |
Maps tool names the LLM emits to real Python functions; injects the sandbox directory. |
| Tools | functions/*.py |
Concrete implementations: list files, read file, run python, write file. Each file also exports a Schema that tells Gemini how to call it. |
| Prompt / config | prompts.py, config.py |
The system prompt that gives the agent its "personality" and the constants that define the sandbox. |
| Demo workspace | calculator/ |
A sample app the agent operates on — the sandbox. |
def main():
parser = argparse.ArgumentParser(description="AI Code Assistant")
parser.add_argument("user_prompt", type=str, help="Prompt to send to Gemini")
parser.add_argument("--verbose", action="store_true", help="Enable verbose output")
args = parser.parse_args()
load_dotenv()
api_key = os.environ.get("GEMINI_API_KEY")
if not api_key:
raise RuntimeError("GEMINI_API_KEY environment variable not set")
client = genai.Client(api_key=api_key)
messages = [types.Content(role="user", parts=[types.Part(text=args.user_prompt)])]
if args.verbose:
print(f"User prompt: {args.user_prompt}\n")
for _ in range(20):
result = generate_content(client, messages, args.verbose)
if result:
break
else:
print("Maximum iterations (20) reached without a final response.")
sys.exit(1)Key things to notice:
messagesis a list ofContentobjects — this is the conversation history. We seed it with the user's prompt.- The loop runs at most 20 iterations. This is a safety net: if the model gets stuck in a tool-calling loop, we bail instead of burning your API quota.
generate_content(...)returnsTruewhen the model has produced a normal text answer (we're done) andFalsewhen it called a tool (we need another iteration so the model can react to the tool's result).
def generate_content(client, messages, verbose):
response = client.models.generate_content(
model="gemini-2.5-flash",
contents=messages,
config=types.GenerateContentConfig(
tools=[available_functions], system_instruction=system_prompt
),
)
if response.candidates:
for candidate in response.candidates:
if candidate.content:
messages.append(candidate.content)
if not response.usage_metadata:
raise RuntimeError("Gemini API response appears to be malformed")
if verbose:
print("Prompt tokens:", response.usage_metadata.prompt_token_count)
print("Response tokens:", response.usage_metadata.candidates_token_count)
if not response.function_calls:
print("Response:")
print(response.text)
return True
function_responses = []
for function_call in response.function_calls:
result = call_function(function_call, verbose)
if (
not result.parts
or not result.parts[0].function_response
or not result.parts[0].function_response.response
):
raise RuntimeError(f"Empty function response for {function_call.name}")
if verbose:
print(f"-> {result.parts[0].function_response.response}")
function_responses.append(result.parts[0])
if not function_responses:
raise RuntimeError("No function responses generated, exiting.")
messages.append(types.Content(role="user", parts=function_responses))
return FalseWhat's happening:
- We send the whole conversation plus the tool schemas (
available_functions) and the system prompt to Gemini. - Gemini's response always becomes part of the conversation, whether it's text or a tool call — so the model sees its own reasoning later.
- If there are no function calls, the model has produced a normal answer. Print it and return
True. - If there are function calls, we run them locally with
call_function(...), then append all the results back intomessagesunder the"user"role. The next loop iteration shows those results to the model.
Suppose you run:
uv run main.py "what does main.py in the calculator do?"The loop typically goes:
- Iteration 1 — Gemini receives the prompt. It doesn't know what's in
main.py, so it emits a tool call:get_file_content(file_path="main.py"). We run it, append the file contents to the conversation. Loop continues. - Iteration 2 — Gemini sees the file contents and now has enough information. It emits a normal text response describing what
main.pydoes. We print it and exit.
For a debugging task ("find and fix the bug"), the loop typically grows to 5–10 iterations: list files → read several files → run the tests → write a fixed file → re-run the tests → final summary.
The --verbose flag is great for actually seeing this loop in motion — try it.
The agent is only as powerful as the tools you give it. This project ships with four, each defined in functions/ as a pair of (Python function, Gemini schema).
| Tool | What it does | Required args |
|---|---|---|
get_files_info |
Lists files in a directory; reports size and is_dir. |
(none — defaults to .) |
get_file_content |
Reads up to MAX_CHARS (10 000) characters of a text file. |
file_path |
run_python_file |
Runs a .py file via subprocess, captures stdout/stderr. |
file_path (+ optional args) |
write_file |
Writes/overwrites a text file; creates parent dirs as needed. | file_path, content |
They are registered together in call_function.py:
available_functions = types.Tool(
function_declarations=[
schema_get_files_info,
schema_get_file_content,
schema_run_python_file,
schema_write_file,
]
)
function_map = {
"get_files_info": get_files_info,
"get_file_content": get_file_content,
"run_python_file": run_python_file,
"write_file": write_file,
}available_functions is what we hand to Gemini so it knows what tools exist and how to call them. function_map is what we use locally to actually invoke the right Python function when Gemini emits a call.
Gemini doesn't read your Python source. It only knows about your tools through the schema you declare. For example:
schema_get_files_info = types.FunctionDeclaration(
name="get_files_info",
description="Lists files in a specified directory relative to the working directory, providing file size and directory status",
parameters=types.Schema(
type=types.Type.OBJECT,
properties={
"directory": types.Schema(
type=types.Type.STRING,
description="Directory path to list files from, relative to the working directory (default is the working directory itself)",
),
},
),
)The description fields are how the model decides when to use a tool, and the parameters block tells it the exact shape of the arguments to pass. Clear, accurate descriptions are the single biggest factor in agent quality.
The agent is given file-system access — that's the whole point — but it should never be able to read /etc/passwd or write outside the project. Every tool defends against this the same way:
def get_file_content(working_directory, file_path):
try:
abs_working_dir = os.path.abspath(working_directory)
abs_file_path = os.path.normpath(os.path.join(abs_working_dir, file_path))
if os.path.commonpath([abs_working_dir, abs_file_path]) != abs_working_dir:
return f'Error: Cannot read "{file_path}" as it is outside the permitted working directory'The pattern is:
- Resolve the working directory to an absolute path.
- Join the user-supplied (LLM-supplied!)
file_pathonto it and normalize (os.path.normpathcollapses.., etc.). - Use
os.path.commonpathto verify the result still lives inside the working directory.
Crucially, the LLM does not get to choose the working directory — it's injected by call_function.py from config.WORKING_DIR:
args = dict(function_call.args) if function_call.args else {}
args["working_directory"] = WORKING_DIR
result = function_map[function_name](**args)So even if the model tries to pass working_directory="/etc", that argument gets overwritten before the function is called. The system prompt also tells the model not to bother:
All file paths you provide in function calls must be relative to the
working directory. Do not include the working directory itself in the
paths; it is injected automatically for security reasons.
⚠️ Caveats. This sandbox stops path traversal. It does not stop a malicious script being written and then executed viarun_python_file— that script will run with the same permissions as your user. Don't point this agent at a directory you don't trust, and don't expose it to untrusted users.
boots-agent/
├── main.py # Entry point + agent loop
├── call_function.py # Tool registry & dispatcher
├── config.py # MAX_CHARS, WORKING_DIR
├── prompts.py # System prompt
│
├── functions/ # The tools
│ ├── get_files_info.py
│ ├── get_file_content.py
│ ├── run_python_file.py
│ └── write_file.py
│
├── calculator/ # Demo workspace the agent operates on
│ ├── main.py # CLI entry point: `python main.py "3 + 5"`
│ ├── tests.py # unittest suite
│ ├── pkg/
│ │ ├── calculator.py # Infix-expression evaluator (shunting-yard-ish)
│ │ └── render.py # JSON output formatter
│ └── lorem.txt # Filler file used by tests
│
├── test_get_files_info.py # Manual smoke tests for the tools
├── test_get_file_content.py
├── test_run_python_file.py
├── test_write_file.py
│
├── .env # Your GEMINI_API_KEY (git-ignored)
├── pyproject.toml # Project metadata + dependencies
└── uv.lock # Pinned dependency versions
Everything tweakable lives in config.py:
MAX_CHARS = 10000
WORKING_DIR = "./calculator"MAX_CHARS— maximum number of charactersget_file_contentwill return. Larger files are truncated with a notice appended, so the model knows there's more it didn't see. This keeps the context window (and your bill) under control.WORKING_DIR— the sandbox the agent is allowed to touch. Point this at any directory you want the agent to work on.
The Gemini model is hard-coded in main.py as gemini-2.5-flash. Swap it for gemini-2.5-pro for higher-quality (and pricier) reasoning.
The bundled calculator has its own unit-test suite, runnable directly:
cd calculator
python tests.pyThe smoke tests at the repo root exercise the tool layer (they're not unittest-style — they just print results):
python test_get_files_info.py
python test_get_file_content.py
python test_run_python_file.py
python test_write_file.pyEach test deliberately probes both happy paths and security boundaries (e.g. reading /bin/cat, writing to /tmp/temp.txt) so you can see the sandbox in action.
Want to give the agent a new capability? It's three steps:
- Write the function. Create a new file in
functions/, e.g.functions/delete_file.py. The function must takeworking_directoryas its first argument and apply the samecommonpathsandbox check you'll find in the existing tools. - Declare its schema. In the same file, export a
types.FunctionDeclaration(see any existing tool for a template). Be specific in thedescription— that's how the model learns when to use it. - Register it. Add the schema to
available_functionsand the function tofunction_mapincall_function.py.
That's it — the next time you run the agent, Gemini will see the new tool listed alongside the others.
Other ideas:
- Swap
gemini-2.5-flashfor an OpenAI or Anthropic model — the agent loop pattern is the same; only the SDK calls change. - Add a
--max-iterationsCLI flag instead of the hard-coded20. - Persist
messagesto disk so you can resume a session. - Stream responses instead of waiting for the whole reply each turn.
| Symptom | Likely cause / fix |
|---|---|
RuntimeError: GEMINI_API_KEY environment variable not set |
.env is missing or the key name is wrong. Make sure the file contains GEMINI_API_KEY=.... |
403 / permission denied from Gemini |
Your API key is invalid or your project doesn't have Gemini API access enabled in Google AI Studio. |
Maximum iterations (20) reached without a final response. |
The model is stuck in a tool-calling loop. Try a more specific prompt, or bump the loop limit in main.py. |
Error: "..." is not a directory / ... outside the permitted working directory |
Expected: the sandbox is blocking a path. Check WORKING_DIR in config.py. |
Output is truncated with [...File "..." truncated at 10000 characters] |
Working as designed. Raise MAX_CHARS in config.py if you really need it. |
- LLM (Large Language Model) — the underlying neural network, e.g. Gemini 2.5 Flash, that produces text from text.
- Agent — an LLM wrapped in a loop that lets it take actions (call tools) and react to their results, rather than just answering once.
- Tool / Function calling — the LLM API feature that lets the model emit a structured "I want to call
foo(x=1)" message instead of plain text. Your program runsfooand feeds the result back. - Schema — a JSON-like description of a tool's name, purpose, and parameters. The model uses it to decide when and how to call the tool.
- System prompt — instructions that sit "above" the conversation and shape how the model behaves on every turn. See
prompts.py. - Conversation history (
messages) — the running list of everything the user, the model, and the tools have said, in order. Sent in full on every API call so the model has full context. - Sandbox / working directory — the single directory the agent is allowed to read, write, or execute inside. Any path resolving outside it is rejected.