← Back to Articles
· kubernetes

Self-Hosted AI on Kubernetes : GPU Inference with Ollama and Open WebUI

A step-by-step companion to Episode 7 of the Kubernetes on Raspberry Pi series. Covers exposing an external GPU machine as a Kubernetes service using EndpointSlices, and deploying Open WebUI.

Watch the Video kubernetesollamaaillmopen-webuigpuraspberry-pihomelabexternal-service

The Raspberry Pi cluster is great for orchestration, but it has no GPU. Running large language models on ARM Cortex cores isn't practical. What we do have is a Windows PC with an RTX 4090 sitting on the same network, already running Ollama. The question is how to connect the cluster to it cleanly: without complicated driver passthrough, without hacks, and in a way that looks native to anything running inside Kubernetes.

This is the companion article to Episode 7 of the Kubernetes on Raspberry Pi series. We expose Ollama as a native Kubernetes service using EndpointSlices, then deploy Open WebUI to put a polished chat interface on top, accessible at https://ai.spatacoli.xyz.

All configs are in the kubernetes-series GitHub repo under video-07-gpu-inference-ollama/.

The Setup

Component Details
GPU machine Windows PC, RTX 4090, IP 10.51.50.13
Ollama Running on Windows, listening on 0.0.0.0:11434
Open WebUI Deployed in the cluster, connects to Ollama
Access URL https://ai.spatacoli.xyz

Before anything else, Ollama must be listening on all interfaces, not just localhost. Set the environment variable in Ollama's settings:

OLLAMA_HOST=0.0.0.0

Also ensure Windows Firewall allows inbound traffic on port 11434. Talos doesn't allow SSH onto nodes, so we test from inside the cluster using a temporary pod:

kubectl run -it --rm curl-test \
  --image=curlimages/curl \
  --restart=Never \
  -n default \
  -- http://10.51.50.13:11434/api/tags

You should get a JSON response listing available models. If it times out, check the Windows Firewall before proceeding.

External Services and EndpointSlices

This episode introduces a pattern that makes Kubernetes much more flexible: External Services. Instead of pointing a Service at pods inside the cluster, we point it at an IP address outside it. From the perspective of any pod in the cluster, Ollama looks identical to an internal service. It's reachable by DNS name, not raw IP.

The mechanism is an EndpointSlice with a manual address. Create the namespace, Service, and EndpointSlice together:

# ollama-external.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: ai
---
apiVersion: v1
kind: Service
metadata:
  name: ollama
  namespace: ai
spec:
  ports:
    - port: 11434
      targetPort: 11434
---
apiVersion: discovery.k8s.io/v1
kind: EndpointSlice
metadata:
  name: ollama
  namespace: ai
  labels:
    kubernetes.io/service-name: ollama
addressType: IPv4
ports:
  - port: 11434
    protocol: TCP
endpoints:
  - addresses:
      - 10.51.50.13
kubectl apply -f ollama-external.yaml

Any pod in the cluster can now reach Ollama at http://ollama.ai.svc.cluster.local:11434, as if it were running inside Kubernetes.

Deploying Open WebUI

Open WebUI is a polished chat interface that connects to Ollama (or any OpenAI-compatible API). The key environment variable OLLAMA_BASE_URL points it at our external service DNS name:

# open-webui.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: open-webui
  namespace: ai
spec:
  replicas: 1
  selector:
    matchLabels:
      app: open-webui
  template:
    metadata:
      labels:
        app: open-webui
    spec:
      containers:
        - name: open-webui
          image: ghcr.io/open-webui/open-webui:latest
          env:
            - name: OLLAMA_BASE_URL
              value: http://ollama.ai.svc.cluster.local:11434
          ports:
            - containerPort: 8080
          volumeMounts:
            - name: webui-data
              mountPath: /app/backend/data
      volumes:
        - name: webui-data
          persistentVolumeClaim:
            claimName: open-webui-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: open-webui
  namespace: ai
spec:
  selector:
    app: open-webui
  ports:
    - port: 8080
      targetPort: 8080
  type: ClusterIP

Open WebUI stores conversations, settings, and user accounts in /app/backend/data, so it needs its own PVC:

# open-webui-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: open-webui-pvc
  namespace: ai
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: nfs
  resources:
    requests:
      storage: 5Gi

Adding an Ingress

The Ingress follows the same pattern as Episodes 5 and 6. We reference letsencrypt-prod in the annotation so cert-manager handles the TLS certificate automatically, no separate Certificate resource needed:

# open-webui-ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: open-webui
  namespace: ai
  annotations:
    traefik.ingress.kubernetes.io/router.entrypoints: websecure
    traefik.ingress.kubernetes.io/router.tls: "true"
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  ingressClassName: traefik
  tls:
    - hosts:
        - ai.spatacoli.xyz
      secretName: open-webui-tls
  rules:
    - host: ai.spatacoli.xyz
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: open-webui
                port:
                  number: 8080
kubectl apply -f open-webui-pvc.yaml
kubectl apply -f open-webui.yaml
kubectl apply -f open-webui-ingress.yaml

Open https://ai.spatacoli.xyz, create your account, and start chatting with models running on your own hardware. The cluster handles routing and TLS; the RTX 4090 handles inference.

What's Next

In Episode 8 we add centralized logging with Loki and Promtail. When something breaks, you need to know where to look.

← Back to Articles