Skip to main content

Kubernetes Deployment

Memanto is a single-binary FastAPI app — there are no leader-election, sticky-session, or shared-volume concerns. The deployment manifests below are intentionally minimal; adapt the resources, security context, and ingress to fit your cluster. The manifests use the cloud backend by default. The “On-Prem Backend” section at the bottom shows how to add the Moorcheh server (and optionally Ollama) as sibling pods.

Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: memanto
  labels:
    app: memanto
spec:
  replicas: 3
  selector:
    matchLabels:
      app: memanto
  template:
    metadata:
      labels:
        app: memanto
    spec:
      containers:
        - name: memanto
          image: memanto:latest
          ports:
            - containerPort: 8000
          env:
            - name: MOORCHEH_API_KEY
              valueFrom:
                secretKeyRef:
                  name: memanto-secrets
                  key: moorcheh-api-key
            - name: LOG_LEVEL
              value: "INFO"
            - name: ALLOWED_ORIGINS
              value: "https://app.yourdomain.com"
          livenessProbe:
            httpGet:
              path: /ready          # lightweight, no Moorcheh dependency
              port: 8000
            initialDelaySeconds: 10
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /health         # gates traffic on Moorcheh connectivity
              port: 8000
            initialDelaySeconds: 5
            periodSeconds: 5
          resources:
            requests:
              memory: "512Mi"
              cpu: "250m"
            limits:
              memory: "1Gi"
              cpu: "500m"
          securityContext:
            runAsNonRoot: true
            runAsUser: 1001
            readOnlyRootFilesystem: true
            allowPrivilegeEscalation: false
            capabilities:
              drop: ["ALL"]
Probe note: /ready is the right liveness probe because it always returns 200 once the process is up — using /health would restart pods every time Moorcheh has a hiccup. /health is the right readiness probe because it gates traffic on actual Moorcheh connectivity.

Service

apiVersion: v1
kind: Service
metadata:
  name: memanto
spec:
  selector:
    app: memanto
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8000
  type: ClusterIP

Secret

apiVersion: v1
kind: Secret
metadata:
  name: memanto-secrets
type: Opaque
data:
  moorcheh-api-key: <base64-encoded-key>
Create it from a literal:
kubectl create secret generic memanto-secrets \
  --from-literal=moorcheh-api-key="mk_your_api_key"
Or via external-secrets, Vault, or your cloud’s secret manager. Don’t bake API keys into images.

Ingress (TLS Termination)

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: memanto
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
    nginx.ingress.kubernetes.io/proxy-body-size: "25m"   # for file uploads
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - memanto.yourdomain.com
      secretName: memanto-tls
  rules:
    - host: memanto.yourdomain.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: memanto
                port:
                  number: 80

Horizontal Pod Autoscaler

Memanto is I/O-bound, so scale on CPU:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: memanto
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: memanto
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

On-Prem Backend on Kubernetes

To run Moorcheh inside the same cluster as Memanto, add a Moorcheh deployment and point MOORCHEH_ONPREM_URL at the in-cluster Service.
apiVersion: apps/v1
kind: Deployment
metadata:
  name: moorcheh
spec:
  replicas: 1                       # Moorcheh on-prem is a single-instance service today
  selector:
    matchLabels:
      app: moorcheh
  template:
    metadata:
      labels:
        app: moorcheh
    spec:
      containers:
        - name: moorcheh
          image: moorcheh/moorcheh:latest
          ports:
            - containerPort: 8080
          volumeMounts:
            - name: data
              mountPath: /data
            - name: config
              mountPath: /root/.moorcheh
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 5
          resources:
            requests:
              memory: "1Gi"
              cpu: "500m"
            limits:
              memory: "4Gi"
              cpu: "2"
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: moorcheh-data
        - name: config
          configMap:
            name: moorcheh-config
---
apiVersion: v1
kind: Service
metadata:
  name: moorcheh
spec:
  selector:
    app: moorcheh
  ports:
    - protocol: TCP
      port: 8080
      targetPort: 8080
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: moorcheh-data
spec:
  accessModes: ["ReadWriteOnce"]
  resources:
    requests:
      storage: 20Gi
And the ConfigMap for the embedding / LLM block (replace with your provider of choice):
apiVersion: v1
kind: ConfigMap
metadata:
  name: moorcheh-config
data:
  config.json: |
    {
      "embedding": {
        "provider": "openai",
        "model": "text-embedding-3-small",
        "api_key": "<filled-in-by-init-container-or-external-secret>"
      },
      "llm": {
        "provider": "openai",
        "model": "gpt-4o-mini",
        "api_key": "<filled-in-by-init-container-or-external-secret>"
      }
    }
For real deployments, mount provider API keys via Secret + an init container or external-secrets operator instead of embedding them in the ConfigMap. Then update the Memanto Deployment env block to point at the in-cluster Moorcheh:
env:
  - name: MEMANTO_BACKEND
    value: "on-prem"
  - name: MOORCHEH_ONPREM_URL
    value: "http://moorcheh:8080"
  - name: MOORCHEH_ONPREM_TIMEOUT
    value: "300"
Remove the MOORCHEH_API_KEY env var when on-prem — it isn’t consulted in on-prem mode.

Ollama on Kubernetes (Optional)

If you want fully local inference, add an Ollama Deployment with persistent storage for model weights:
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
        - name: ollama
          image: ollama/ollama:latest
          ports:
            - containerPort: 11434
          volumeMounts:
            - name: models
              mountPath: /root/.ollama
          resources:
            requests:
              memory: "8Gi"
              cpu: "2"
            limits:
              memory: "16Gi"
              cpu: "4"
              # nvidia.com/gpu: 1   # if you have GPU nodes + nvidia-device-plugin
      volumes:
        - name: models
          persistentVolumeClaim:
            claimName: ollama-models
---
apiVersion: v1
kind: Service
metadata:
  name: ollama
spec:
  selector:
    app: ollama
  ports:
    - port: 11434
      targetPort: 11434
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-models
spec:
  accessModes: ["ReadWriteOnce"]
  resources:
    requests:
      storage: 50Gi
Update the Moorcheh ConfigMap to use "base_url": "http://ollama:11434" for both embedding and LLM blocks. After deploy, pull your models once:
kubectl exec -it deployment/ollama -- ollama pull nomic-embed-text
kubectl exec -it deployment/ollama -- ollama pull qwen2.5
Restrict who can reach Memanto and Moorcheh:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: memanto-allow-ingress
spec:
  podSelector:
    matchLabels:
      app: memanto
  policyTypes: ["Ingress"]
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: ingress-nginx
      ports:
        - protocol: TCP
          port: 8000
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: moorcheh-only-from-memanto
spec:
  podSelector:
    matchLabels:
      app: moorcheh
  policyTypes: ["Ingress"]
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: memanto
      ports:
        - protocol: TCP
          port: 8080

Applying

kubectl apply -f deployment.yaml
kubectl apply -f service.yaml
kubectl apply -f secret.yaml
kubectl apply -f ingress.yaml
kubectl apply -f hpa.yaml

# Optional on-prem additions:
kubectl apply -f moorcheh.yaml
kubectl apply -f ollama.yaml
kubectl apply -f networkpolicy.yaml

kubectl get pods -l app=memanto -w
kubectl logs -l app=memanto -f

Cloud Platform Quick-Reference

The Memanto image is generic — it runs anywhere that can host a container.
PlatformNotes
AWS ECS / FargatePull image from ECR. Inject MOORCHEH_API_KEY via AWS Secrets Manager. ALB → port 8000.
Google Cloud RunInject API key via Secret Manager. Set --port 8000. Concurrency 80–100 per instance is a reasonable starting point.
Azure Container Instances / AppsInject API key via Azure Key Vault references.
DigitalOcean App PlatformSet the env var MOORCHEH_API_KEY in the App definition.
Fly.ioflyctl secrets set MOORCHEH_API_KEY=…. Set [http_service] internal_port = 8000.
For on-prem on managed platforms, the Moorcheh container needs persistent storage and an internal endpoint your Memanto service can reach — most platforms make this straightforward via “sidecar” or “internal service” features.

Next Steps