On-Prem Configuration

The on-prem wizard wires sensible defaults, but every value is editable. This page is the reference for where each setting lives, what controls it, and how to change it after install.

Configuration Surfaces

On-prem reads from three places. They are evaluated in this order — later sources override earlier ones:

Environment variables (or a project .env) — highest precedence.
~/.memanto/on-prem/state.json — set by the on-prem onboarding wizard.
Built-in defaults in Memanto’s Settings model.

The shared ~/.memanto/config.yaml is owned by the cloud backend; the on-prem wizard does not write into it. The only config.yaml key on-prem touches is backend: on-prem (so subsequent CLI runs know which backend to dispatch to).

Environment Variables

These are the on-prem-relevant variables in Settings. Defaults shown.

Variable	Default	Purpose
`MEMANTO_BACKEND`	`cloud`	Set to `on-prem` to route all Moorcheh calls to the local server. The wizard sets this for you; you can also set it inline (`MEMANTO_BACKEND=on-prem memanto status`).
`MOORCHEH_ONPREM_URL`	`http://localhost:8080`	Base URL of the Moorcheh on-prem server. Override if you’ve remapped the port or are running Memanto in a different container.
`MOORCHEH_ONPREM_EMBEDDING_PROVIDER`	(empty)	Surfaced in `memanto status` so you can see at a glance what provider is in use. Auto-populated from `state.json`.
`MOORCHEH_ONPREM_TIMEOUT`	`300`	HTTP read timeout in seconds for the on-prem Moorcheh client. Default is high because first-call LLM cold-starts on Ollama can take 1–2 minutes (model load).
`MOORCHEH_API_KEY`	(empty)	Not required on-prem. The on-prem stack does not consult this.
`HOST`	`0.0.0.0`	Bind host for Memanto’s own REST server (`memanto serve`).
`PORT`	`8000`	Bind port for Memanto’s own REST server.
`ANSWER_MODEL`	`anthropic.claude-sonnet-4-6`	Cloud default. On-prem, the active LLM is sourced from `state.json` (`llm_model`); this env var is ignored unless `state.json` is empty.
`ANSWER_TEMPERATURE`	`0.7`	LLM temperature for `answer.generate`. Honored on both backends.
`ANSWER_LIMIT`	`15`	Number of context memories passed to the LLM for `answer`.
`ANSWER_THRESHOLD`	`0.01`	Confidence threshold for memory relevance during `answer`.
`RECALL_LIMIT`	`10`	Default Top-N results returned by `recall`.
`SUMMARY_MODEL`	`anthropic.claude-sonnet-4-6`	Same backend-awareness rule as `ANSWER_MODEL`.
`ALLOWED_ORIGINS`	`*`	CORS origins for Memanto’s REST API. Restrict in production.
`LOG_LEVEL`	`INFO`	`DEBUG`, `INFO`, `WARNING`, `ERROR`.

Setting Env Vars

For a single command:

MEMANTO_BACKEND=on-prem memanto status

In a project .env (loaded automatically):

MEMANTO_BACKEND=on-prem
MOORCHEH_ONPREM_URL=http://moorcheh.internal:8080
MOORCHEH_ONPREM_TIMEOUT=600
LOG_LEVEL=DEBUG

Globally for your shell (Linux/macOS):

export MEMANTO_BACKEND=on-prem

On Windows PowerShell:

$env:MEMANTO_BACKEND = "on-prem"

On-Prem State File

~/.memanto/on-prem/state.json is the source of truth for on-prem configuration. It is written by the wizard and read by both the CLI and the embedded server. Example contents:

{
  "installed_at": "2026-06-09T14:32:11Z",
  "embedding_provider": "ollama",
  "embedding_model": "nomic-embed-text",
  "llm_provider": "ollama",
  "llm_model": "qwen2.5",
  "url": "http://localhost:8080"
}

Key	Used for
`url`	Exported as `MOORCHEH_ONPREM_URL` at process startup.
`embedding_provider`, `embedding_model`	Re-used when re-onboarding on-prem (lets you switch cloud↔on-prem without re-picking a provider).
`llm_provider`, `llm_model`	Sent as `ai_model` to the on-prem server on every `answer.generate` call. If empty/missing, the on-prem server falls back to whatever LLM is configured in `~/.moorcheh/config.json`.
`installed_at`	Metadata, useful for support diagnostics.

You can edit this file by hand. After saving, restart memanto serve (or run any CLI command) to reload.

Moorcheh Server Config

~/.moorcheh/config.json is owned by the moorcheh-client package. The Memanto wizard writes the full embedding + LLM block there before calling moorcheh up, so the on-prem server has both ready on first boot. Schema:

{
  "embedding": {
    "provider": "ollama",
    "model": "nomic-embed-text",
    "api_key": null,
    "base_url": "http://ollama:11434"
  },
  "llm": {
    "provider": "ollama",
    "model": "qwen2.5",
    "api_key": null,
    "base_url": "http://ollama:11434"
  }
}

To switch the on-prem server to a different provider after install:

Stop the stack: moorcheh down.
Edit ~/.moorcheh/config.json (or use moorcheh configure interactively).
Restart: moorcheh up.
Update ~/.memanto/on-prem/state.json to match (embedding_provider, llm_model, etc.) — Memanto reads its model id from there.

Provider Reference

Ollama (Local, Recommended for Air-Gap)

Embedding model: nomic-embed-text (default; ~270 MB).
LLM model: qwen2.5 (default; ~4.7 GB) — change with any ollama pull-able model.
API key: none.
Where it runs: sibling Docker container started by moorcheh up.

OpenAI

Embedding model: text-embedding-3-small (default; cheaper) or text-embedding-3-large.
LLM model: gpt-4o-mini (default), gpt-4o, etc.
API key: required; stored in ~/.moorcheh/config.json under embedding.api_key / llm.api_key.

Cohere

Embedding model: embed-english-v3.0 (default) or embed-multilingual-v3.0.
LLM model: command-r-plus-08-2024 (default).
API key: required.

Answer & Recall Tuning

These knobs work identically on cloud and on-prem.

Setting	Env var	Default	What it does
Answer model	`ANSWER_MODEL` (cloud) / `state.json: llm_model` (on-prem)	—	Which LLM `answer` calls.
Answer temperature	`ANSWER_TEMPERATURE`	`0.7`	Higher = more creative, lower = more deterministic.
Answer context size	`ANSWER_LIMIT`	`15`	How many memories to pass as context. Lower for faster answers, higher for better grounding.
Answer threshold	`ANSWER_THRESHOLD`	`0.01`	Memories below this similarity are dropped.
Recall top-N	`RECALL_LIMIT`	`10`	Default page size for `recall`. Override per-call with `--limit`.

Timeouts

Ollama cold-starts can be slow on first call after moorcheh up. Memanto sets the on-prem client’s read timeout to 300 seconds by default so an initial answer.generate doesn’t fail with a ReadTimeout. Override:

export MOORCHEH_ONPREM_TIMEOUT=600

After the first call the model stays resident in Ollama’s RAM and subsequent calls return in under seconds.

Disk Locations Recap

Path	Owner	Editable?
`~/.memanto/on-prem/state.json`	Memanto CLI	Yes — hand-edit then restart CLI/server.
`~/.memanto/.env`	Memanto CLI	Yes — but on-prem does not need a `MOORCHEH_API_KEY`.
`~/.moorcheh/config.json`	`moorcheh-client`	Yes via `moorcheh configure` or by hand.
`~/.moorcheh/uploads/`	`moorcheh-client`	Append-only; staging for `memanto upload` files.

Next Steps

Backend Switching — toggle between cloud and on-prem without losing state.
Self-Hosting Memanto Server — run memanto serve under Docker/Compose/systemd.
Kubernetes Deployment — manifests for a clustered on-prem deployment.
Security & Operations — production hardening checklist.

​On-Prem Configuration

​Configuration Surfaces

​Environment Variables

​Setting Env Vars

​On-Prem State File

​Moorcheh Server Config

​Provider Reference

​Ollama (Local, Recommended for Air-Gap)

​OpenAI

​Cohere

​Answer & Recall Tuning

​Timeouts

​Disk Locations Recap

​Next Steps