Skip to main content

Inference Options

MemantoClaw supports multiple inference providers. During onboarding, the memantoclaw onboard wizard presents a list of providers to choose from. Your selection determines where the agent’s inference traffic is routed.

How Inference Routing Works

The agent inside the sandbox talks to inference.local. It never connects to a provider directly. OpenShell intercepts inference traffic on the host and forwards it to the provider you selected. Provider credentials stay entirely on the host.

Provider Options

ProviderDescription
Moorcheh Routed InferenceThe native MemantoClaw experience. Routes inference directly using your MOORCHEH_API_KEY.
NVIDIA EndpointsRoutes to models hosted on build.nvidia.com. (e.g., Nemotron 3 Super)
OpenAIRoutes to the OpenAI API.
AnthropicRoutes to the Anthropic Messages API.
Google GeminiRoutes to Google’s OpenAI-compatible endpoint.

Switching Inference Models at Runtime

You can change the active inference model while the sandbox is running. No restart is required. Switching happens through the OpenShell inference route.
# Example for NVIDIA Endpoints
openshell inference set --provider nvidia-prod --model nvidia/nemotron-3-super-120b-a12b

# Example for OpenAI
openshell inference set --provider openai-api --model gpt-5.4

# Example for Anthropic
openshell inference set --provider anthropic-prod --model claude-sonnet-4-6

Cross-Provider Switching

Switching to a different provider family requires updating both the gateway route and the sandbox config.
openshell inference set --provider anthropic-prod --model claude-sonnet-4-6 --no-verify
export MEMANTOCLAW_MODEL_OVERRIDE="anthropic/claude-sonnet-4-6"
export MEMANTOCLAW_INFERENCE_API_OVERRIDE="anthropic-messages"
memantoclaw onboard --resume --recreate-sandbox

Using a Local Inference Server

MemantoClaw can route inference to a model server running on your machine.

Ollama

Ollama is the default local option. The wizard detects it automatically. On Linux with Docker, the sandbox reaches Ollama through http://host.openshell.internal:11434. Make sure Ollama listens on 0.0.0.0:11434.

OpenAI/Anthropic Compatible Servers

Works with vLLM, TensorRT-LLM, llama.cpp, LocalAI, etc. Select “Other OpenAI-compatible endpoint” and enter your base URL (e.g., http://localhost:8000/v1). The wizard will probe /v1/responses and fall back to /v1/chat/completions if streaming events are incompatible.

Experimental Local vLLM & NVIDIA NIM

Set MEMANTOCLAW_EXPERIMENTAL=1 to enable vLLM auto-detection on localhost:8000 or NIM container management on hosts with NIM-capable NVIDIA GPUs.
MEMANTOCLAW_EXPERIMENTAL=1 memantoclaw onboard

Timeout Configuration

Local inference requests use a default timeout of 180 seconds. Increase it if needed:
export MEMANTOCLAW_LOCAL_INFERENCE_TIMEOUT=300
memantoclaw onboard


For complete, unabridged technical details on this topic, refer to the official NVIDIA NemoClaw Documentation. Portions of this guide are summarized and adapted from NVIDIA Corporation (Copyright © 2026), licensed under the Apache License, Version 2.0.