Published on January 12, 2025
https://github.com/g-rabah/namla-usecases-/blob/main/nim-deployment.yaml
NVIDIA NIM microservices accelerate the deployment of foundation models on any cloud or data center. Each NIM is a standalone Docker container optimized for NVIDIA GPUs, leveraging NVIDIA TensorRT-LLM with specialized acceleration profiles for NVIDIA H100, A100, A10, and L40S GPUs.
In this guide, we’ll deploy a NIM on Namla’s Kubernetes cluster and set up two interaction/chat methods with the llm: a cli chat for quick terminal interaction and a Web UI for a ChatGPT experience.
NIMs provide a high-performance solution for AI inference. While there’s so much more to explore, you can learn more about NVIDIA NIM microservices here.
TWhile there are many ways to run a NIM—from a simple docker run command to more complex Kubernetes scenarios—at Namla, our app orchestration engine is Kubernetes-based, so this guide will focus on that. In this article, I’ll show you how simple it is to run a NIM. In future articles, I’ll dive into advanced serving techniques like the NIM Operator and KServe.
To get started, you’ll need two things:
Here’s how these secrets look in Kubernetes:
apiVersion: v1
data:
.dockerconfigjson: eyJhdXRoc*************EJOYWxaciJ9fX0=
kind: Secret
metadata:
name: ngc-secret
type: kubernetes.io/dockerconfigjson
---
apiVersion: v1
data:
NGC_API_KEY: bnZhcGktT************************ZA==
kind: Secret
metadata:
name: ngc-api-secret
The ngc-secret lets your Kubernetes cluster pull images from NVIDIA’s registry, and the ngc-api-secret holds your API key for downloading models.
We’re deploying the NIM as a Kubernetes Deployment. Here’s the manifest:
apiVersion: apps/v1
kind: Deployment
metadata:
name: llama3-8b-instruct
namespace: default
spec:
replicas: 1
selector:
matchLabels:
app: llama3-8b-instruct
template:
metadata:
labels:
app: llama3-8b-instruct
spec:
containers:
- name: llama3-8b-instruct
image: nvcr.io/nim/meta/llama3-8b-instruct:1.0.3
imagePullPolicy: IfNotPresent
env:
- name: NGC_API_KEY
valueFrom:
secretKeyRef:
name: ngc-api-secret
key: NGC_API_KEY
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: "1"
volumeMounts:
- mountPath: /opt/nim/.cache
name: nim-cache
securityContext:
runAsUser: 1000
livenessProbe:
httpGet:
path: /v1/health/live
port: 8000
initialDelaySeconds: 15
periodSeconds: 10
volumes:
- name: nim-cache
hostPath:
path: /opt/namla/nim/models
type: DirectoryOrCreate
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- gcp-a100-id321
---
apiVersion: v1
kind: Service
metadata:
name: llama3-8b-instruct-svc
namespace: default
labels:
app: llama3-8b-instruct
spec:
selector:
app: llama3-8b-instruct
ports:
- protocol: TCP
port: 8000
targetPort: 8000
type: ClusterIP
A few key points here:
Models are large, so we’re storing them in a hostPath (/opt/namla/nim/models) to avoid re-downloading every time. In the next article, we’ll explore using network storage for better caching.
We’re deploying on a specific node (gcp-a100-id321) equipped with NVIDIA GPU A100.
Method 1: CLI Tool (gpt-cli)
NVIDIA’s NIM exposes an OpenAI-compatible API, which means you can use tools like gpt-cli.
We’re deploying gpt-cli as a pod in the Namla K8s cluster using the python:3.9-slim image. Instead of building a proper Docker image, I went with the lazy option: install everything at runtime. Each pod startup handles the setup. Future Me can deal with packaging — you’re welcome!
One last thing: for the gpt-cli to connect to the NIM OpenAI endpoint, we configure two environment variables: OPENAI_API_KEY and OPENAI_BASE_URL, which point to the cluster IP of the NIM deployment.
Here’s the full gpt-cli pod manifest:
apiVersion: v1
kind: Pod
metadata:
name: gpt-command-line-pod
spec:
containers:
- name: gpt-command-line
image: python:3.9-slim
command:
- sh
- -c
- >
pip install gpt-command-line && pip install --upgrade openai gpt-cli
&& sleep infinity
env:
- name: OPENAI_API_KEY
value: "null"
- name: OPENAI_BASE_URL
value: http://llama3-8b-instruct-svc:8000/v1
resources:
requests:
memory: 128Mi
cpu: 250m
limits:
memory: 256Mi
cpu: 500m
To chat with NIM using the CLI, you need to execute a terminal session inside the pod. At Namla, this is simplified with a single click on the dashboard, which opens a direct terminal to the pod. However, if you’re using a standard Kubernetes setup, you’ll need to manually exec into the pod using the following command:
kubectl exec -it gpt-cli-pod -n nim -- /bin/bash
Method 2: Web UI (open-webui
To provide a user experience similar to ChatGPT, we deploy the OpenWebUI, offering features comparable to ChatGPT. It is configured to use NIM’s OpenAI API, exposed through the cluster IP service llama3-8b-instruct-svc at the URL http://llama3-8b-instruct-svc:8000/v1. The Open Web UI is exposed via its own cluster IP service, open-webui-service, on port 3000.
apiVersion: apps/v1
kind: Deployment
metadata:
name: open-webui-deployment
labels:
app: open-webui
spec:
replicas: 1
selector:
matchLabels:
app: open-webui
template:
metadata:
labels:
app: open-webui
spec:
containers:
- name: open-webui
image: ghcr.io/open-webui/open-webui:latest
ports:
- containerPort: 8080
volumeMounts:
- name: open-webui-storage
mountPath: /app/backend/data
resources:
requests:
memory: 512Mi
cpu: 500m
limits:
memory: 1Gi
cpu: "1"
restartPolicy: Always
volumes:
- name: open-webui-storage
hostPath:
path: /opt/namla/open-webui
type: DirectoryOrCreate
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- gcp-a100-id321
---
apiVersion: v1
kind: Service
metadata:
name: open-webui-service
spec:
selector:
app: open-webui
ports:
- protocol: TCP
port: 3000
targetPort: 8080
type: ClusterIP
For external access, we use a Cloudflared container to establish a secure tunnel back to Cloudflare, removing the need for a public IP. The tunnel is configured in the Cloudflare dashboard to connect to the Open Web UI’s cluster IP at http://open-webui-service:3000. This setup provides a user-friendly URL like www.nim.mydomain.com, ensuring seamless and secure access to the dashboard. And voilà – an instant ChatGPT-like experience!
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama-cloudflared-deployment
labels:
app: ollama-cloudflared
spec:
replicas: 1
selector:
matchLabels:
app: ollama-cloudflared
template:
metadata:
labels:
app: ollama-cloudflared
spec:
containers:
- name: ollama-cloudflared
image: cloudflare/cloudflared:latest
args:
- tunnel
- --no-autoupdate
- run
- --token
- eyJhIjoiYjdjMjJmNzJhNWNhMjM0Y2QyY2NhYjU1Mjk4***********************
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- gcp-a100-id321
In addition to the deployment of the NIM using Kubernetes, Namla offers a bunch of monitoring & observability features, that allow you to not only continuously monitor your infrastructure but also the NIM behavior itself. You have real time visibility of the resource consumption of course, like GPU load, memory & even power consumption. But you also have the consumption & real time logs of each of your containers.
In this guide, we:
In the next article, we’ll look at using network storage for better model caching. For now, enjoy experimenting with your NIM setup – and let me know how you’re using it!
Namla is a proud member of the NVIDIA Inception program.