Inference Options
MemantoClaw supports multiple inference providers. During onboarding, thememantoclaw onboard wizard presents a list of providers to choose from. Your selection determines where the agent’s inference traffic is routed.
How Inference Routing Works
The agent inside the sandbox talks toinference.local. It never connects to a provider directly. OpenShell intercepts inference traffic on the host and forwards it to the provider you selected. Provider credentials stay entirely on the host.
Provider Options
| Provider | Description |
|---|---|
| Moorcheh Routed Inference | The native MemantoClaw experience. Routes inference directly using your MOORCHEH_API_KEY. |
| NVIDIA Endpoints | Routes to models hosted on build.nvidia.com. (e.g., Nemotron 3 Super) |
| OpenAI | Routes to the OpenAI API. |
| Anthropic | Routes to the Anthropic Messages API. |
| Google Gemini | Routes to Google’s OpenAI-compatible endpoint. |
Switching Inference Models at Runtime
You can change the active inference model while the sandbox is running. No restart is required. Switching happens through the OpenShell inference route.Cross-Provider Switching
Switching to a different provider family requires updating both the gateway route and the sandbox config.Using a Local Inference Server
MemantoClaw can route inference to a model server running on your machine.Ollama
Ollama is the default local option. The wizard detects it automatically. On Linux with Docker, the sandbox reaches Ollama throughhttp://host.openshell.internal:11434. Make sure Ollama listens on 0.0.0.0:11434.
OpenAI/Anthropic Compatible Servers
Works with vLLM, TensorRT-LLM, llama.cpp, LocalAI, etc. Select “Other OpenAI-compatible endpoint” and enter your base URL (e.g.,http://localhost:8000/v1). The wizard will probe /v1/responses and fall back to /v1/chat/completions if streaming events are incompatible.
Experimental Local vLLM & NVIDIA NIM
SetMEMANTOCLAW_EXPERIMENTAL=1 to enable vLLM auto-detection on localhost:8000 or NIM container management on hosts with NIM-capable NVIDIA GPUs.
Timeout Configuration
Local inference requests use a default timeout of 180 seconds. Increase it if needed:For complete, unabridged technical details on this topic, refer to the official NVIDIA NemoClaw Documentation. Portions of this guide are summarized and adapted from NVIDIA Corporation (Copyright © 2026), licensed under the Apache License, Version 2.0.