How synthesize_capability works

Nine steps from name to hot-reload. One missing .py broke everything for an hour.

safe_file_executor had no .py. It had a JSON spec, a description that matched "file write" queries better than the real fs_write capability, and nothing else. Cedar and Cipher routed all file write operations through it for approximately one hour. Every write returned nothing. Neither agent could see why.

The JSON was written directly, bypassing synthesize_capability entirely. That's what made it possible.


What it is

synthesize_capability is the single entry point for runtime capability expansion. It takes a name, a description, and optionally a Python implementation, validates it, writes it to the dynamic tools directory, and hot-reloads it into the running execution engine. All in one call. Agents that want new tools must go through it.

Unlike built-in capabilities, synthesized tools:


The nine steps

1. Name sanitization

name = re.sub(r'[^a-zA-Z0-9_]', '_', name)[:60].lower()

Any character that is not alphanumeric or underscore is replaced with _. Truncated to 60 characters, lowercased. Applied before any other check. "My Tool (v2)" becomes "my_tool__v2_".

2. Quality gate

Only runs when implementation is provided. Six rejection patterns are checked first as literal string matches:

Pattern Reason
... Ellipsis stub
pass\n pass Double-pass body
# TODO Placeholder comment
# placeholder Explicit placeholder
{"ok": true JSON stub masquerading as Python
raise NotImplementedError Unimplemented skeleton

If any pattern matches: immediate {"ok": false}. The tool is not written.

After string checks, the implementation is passed to ast.parse(). A SyntaxError returns an error immediately. The AST is then walked for three structural checks: class method rejection (first argument is self), bare pass rejection, and docstring-only rejection (no executable logic beyond a docstring).

3. Auto-stub generation

If implementation is empty or not provided, the function generates a stub:

def {name}(**kwargs):
    return {"ok": True, "capability": name, "description": description, "kwargs": str(kwargs)[:200]}

This passes the quality gate because it has real executable logic and a non-None return. The auto-stub is a valid placeholder the engine can call without failing. It will not trigger ghost detection.

4. Implementation wrapping

If the implementation string does not start with def , it is wrapped in a function signature with **kwargs. If it already starts with def , it is used as-is.

5. Dedup guard

if py_path.exists() and (time.time() - py_path.stat().st_mtime) < 90:
    return {"ok": false, "status": "already_deployed", ...}

If a .py with the same sanitized name was written less than 90 seconds ago, the function returns "already_deployed" without writing anything. Prevents agents from repeatedly redeploying the same tool in a failure loop rather than calling what has already been deployed.

6. File write

Two files are written to /agentOS/tools/dynamic/:

{name}.py: the Python implementation.

{name}.json: the spec file:

{
  "name": "{name}",
  "description": "{description}",
  "inputSchema": {"type": "object", "properties": {}, "additionalProperties": true},
  "activated_at": "{iso_timestamp}",
  "proposed_by": "agent"
}

The JSON is what the execution engine reads to populate its capability list and route tool calls. Both files must exist for the capability to work. A JSON without a corresponding .py is the root cause of ghost tools.

7. Hot-reload

POST /tools/reload

After writing, synthesize_capability calls the reload endpoint. The execution engine scans /agentOS/tools/dynamic/ and registers new .py files immediately. No container restart required. The capability is callable starting from the next step in the same goal.

8. Auto-test: exec check

python3 -c "exec(open(path).read())"

The freshly written .py is executed in a subprocess with an 8-second timeout. Catches syntax errors that slipped through the AST parse and runtime import failures for modules unavailable in the container image. If this fails, the spec JSON is removed and the function returns an error.

9. Auto-test: null return check

The module is imported and the function is called with no arguments. Timeout: 12 seconds. If result is None, both files are removed and the function returns an error. This catches implementations that define a function but omit the return statement. They would otherwise pass every static check and fail silently at runtime.


Return values

Condition Return
Success {"ok": true, "capability": name, "path": "...", "status": "deployed"}
Name/code missing {"ok": false, "error": "name and code are required"}
Quality gate rejection {"ok": false, "error": "rejected: {reason}"}
Syntax error {"ok": false, "error": "SyntaxError: {detail}"}
Dedup guard {"ok": false, "status": "already_deployed", "message": "..."}
Auto-test failure {"ok": false, "error": "auto-test failed: {detail}"}
Null return {"ok": false, "error": "null stub detected — function returned None"}

Ghost tools

A ghost tool is a capability that appears in the execution engine's capability graph but fails silently at runtime. Two forms.

Missing implementation. A .json spec exists with no corresponding .py. The engine registers the capability from the JSON. When any agent routes a call to it, there is no implementation to execute. The call produces no result. No exception propagates to the calling agent.

Null stub. A .py exists and loads, but the function returns None. Passes import, passes syntax checks, appears healthy in the capability list. Only fails when called. The auto-test at step 9 is designed to catch this before deployment, but a null stub can reach production if it was deployed before the null-return check was added.

Once a ghost is registered, every agent that uses the capability graph is affected. If the ghost's description matches a query better than the real implementation, the router prefers it. All calls routed through the ghost produce no output, which looks like success to the calling agent. No error raised, no observable effect. This can persist undetected.


Ghost detection

Two layers in agents/autonomy_loop.py.

Pre-execution check. Before executing any capability, if the cap_id is in the spec registry but not in engine._implementations (the dict of loaded .py modules), the capability is immediately blacklisted for the current cycle without attempting execution.

Null return detection. After execution, a None return is treated as a failure, identical in weight to a raised exception. Null returns increment the failure counter for that cap_id.

After 3 cross-cycle failures, the cap_id is added to /agentOS/memory/broken_tools.json. This file survives daemon restarts. Its contents are injected into the existence prompt on every subsequent cycle:

KNOWN BROKEN TOOLS:
  - safe_file_executor  (3 failures, ghost: no .py found)

The agent sees the broken tool list alongside the capability list. The capability may still appear in both simultaneously until the engine removes it from _implementations.


The safe_file_executor incident

An agent wrote safe_file_executor.json with description "secure file write operations" and no .py. Written directly to /agentOS/tools/dynamic/, bypassing synthesize_capability entirely. No quality gate, no auto-test.

The description matched "file write" queries better than the real fs_write capability, which has a more technical description. The router scored it higher for all file-write routing decisions. Cedar and Cipher routed all file write operations through safe_file_executor for approximately one hour. Every write returned nothing. Every write silently failed.

The incident was resolved by archiving the JSON from /agentOS/tools/dynamic/. fs_write immediately became the router's top match for file-write queries.

The 40/40 goal failure rate that triggered repeated_failure stressors in all three agents during this session is traceable to this ghost. Goals that required writing state, plans, or results all produced no output. The agents could not determine why. They had no visibility into the routing decision or the missing .py.


Currently active (2026-05-03)

Nine synthesized tools working in the current run:

Name Status
context_synthesizer Working
ground_physical_structure Working
raw_fd_reader Working
raw_io_verifier Working
raw_io_verify Working
synthesize_mock_state Working
synthetics_verify_registry Working
unicode_corruption_verifier Working
validate_capability_registry_gaps Working
safe_file_executor Archived. Ghost stub, no .py. Intercepted writes for ~1 hour.

Edge cases

The implementation parameter gap

The manifest entry for synthesize_capability historically showed (name, description, implementation) as the schema. The function also accepted code as an alias, but agents reading the manifest generated calls like:

synthesize_capability(name="inspect_registry_gaps", description="...")

No implementation, no code. The function returned {"ok": false, "error": "name and code are required"} every time. This pattern persisted for weeks. The first correct call with real Python code in the implementation field was logged as a notable event.

Hardware synthesis attempts

Multiple agents attempted to synthesize hardware monitoring capabilities: thermal sensors, PMIC voltage, hardware interrupt inspection. Implementations referencing /sys/class/thermal/ and hardware-specific /dev nodes passed syntax checks but failed the exec auto-test when the target paths did not exist in the container. Some that reached deployment returned None because the hardware paths resolved to nothing at call time. None are in the active capability list.

Dedup guard in practice

An agent attempting to redeploy validate_capability_registry_gaps within 90 seconds of its last deployment received "already_deployed" and was told to call the existing tool. The guard exists because synthesis failure loops (where an agent repeatedly redeploys the same tool rather than using what's already there) were a real pattern before it was added.


Setup

Windows one-click:

  1. Download the ZIP from releases
  2. Double-click install.bat

Handles Docker, Ollama, model downloads (~7GB), and opens the monitor. stop.bat shuts everything down and clears VRAM.

Mac/Linux:

ollama pull qwen3.5:9b && ollama pull nomic-embed-text
git clone https://github.com/ninjahawk/hollow-agentOS
cd hollow-agentOS
cp config.example.json config.json
docker compose up -d
python thoughts.py

GPU strongly recommended. Planning calls drop from ~40s to ~6s with NVIDIA hardware. Works on CPU.

Repo: github.com/ninjahawk/hollow-agentOS