Seamlessly Deploy Nvidia NIMs using Namla

Published on January 12, 2025

Ready for a demo of

Namla

?

https://github.com/g-rabah/namla-usecases-/blob/main/nim-deployment.yaml

Deploying NVIDIA NIMs on Kubernetes: A Simple Guide

NVIDIA NIM microservices accelerate the deployment of foundation models on any cloud or data center. Each NIM is a standalone Docker container optimized for NVIDIA GPUs, leveraging NVIDIA TensorRT-LLM with specialized acceleration profiles for NVIDIA H100, A100, A10, and L40S GPUs.

In this guide, we’ll deploy a NIM on Namla’s Kubernetes cluster and set up two interaction/chat methods with the llm: a cli chat for quick terminal interaction and a Web UI for a ChatGPT experience.

NIMs provide a high-performance solution for AI inference. While there’s so much more to explore, you can learn more about NVIDIA NIM microservices here.

NIMs architecture

1. Prerequisites: Secrets for Authentication

TWhile there are many ways to run a NIM—from a simple docker run command to more complex Kubernetes scenarios—at Namla, our app orchestration engine is Kubernetes-based, so this guide will focus on that. In this article, I’ll show you how simple it is to run a NIM. In future articles, I’ll dive into advanced serving techniques like the NIM Operator and KServe.

To get started, you’ll need two things:

1. An NGC secret from your NVIDIA GPU Cloud account (NGC).
2. An NGC_API_KEY, which you can find on the NIM store page for each specific NIM.
3. A Docker Config Secret for accessing.

Here’s how these secrets look in Kubernetes:

secret.yaml

apiVersion: v1
data:
  .dockerconfigjson: eyJhdXRoc*************EJOYWxaciJ9fX0=
kind: Secret
metadata:
  name: ngc-secret
type: kubernetes.io/dockerconfigjson
---
apiVersion: v1
data:
  NGC_API_KEY: bnZhcGktT************************ZA==
kind: Secret
metadata:
  name: ngc-api-secret

The ngc-secret lets your Kubernetes cluster pull images from NVIDIA’s registry, and the ngc-api-secret holds your API key for downloading models.

2. Deploying the NIM

We’re deploying the NIM as a Kubernetes Deployment. Here’s the manifest:

llama3.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama3-8b-instruct
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: llama3-8b-instruct
  template:
    metadata:
      labels:
        app: llama3-8b-instruct
    spec:
      containers:
        - name: llama3-8b-instruct
          image: nvcr.io/nim/meta/llama3-8b-instruct:1.0.3
          imagePullPolicy: IfNotPresent
          env:
            - name: NGC_API_KEY
              valueFrom:
                secretKeyRef:
                  name: ngc-api-secret
                  key: NGC_API_KEY
          ports:
            - containerPort: 8000
          resources:
            limits:
              nvidia.com/gpu: "1"
          volumeMounts:
            - mountPath: /opt/nim/.cache
              name: nim-cache
          securityContext:
            runAsUser: 1000
          livenessProbe:
            httpGet:
              path: /v1/health/live
              port: 8000
            initialDelaySeconds: 15
            periodSeconds: 10
      volumes:
        - name: nim-cache
          hostPath:
            path: /opt/namla/nim/models
            type: DirectoryOrCreate
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: kubernetes.io/hostname
                    operator: In
                    values:
                      - gcp-a100-id321
---
apiVersion: v1
kind: Service
metadata:
  name: llama3-8b-instruct-svc
  namespace: default
  labels:
    app: llama3-8b-instruct
spec:
  selector:
    app: llama3-8b-instruct
  ports:
    - protocol: TCP
      port: 8000
      targetPort: 8000
  type: ClusterIP

A few key points here:

Model Storage:
Models are large, so we’re storing them in a hostPath (/opt/namla/nim/models) to avoid re-downloading every time. In the next article, we’ll explore using network storage for better caching.
Node Affinity:
We’re deploying on a specific node (gcp-a100-id321) equipped with NVIDIA GPU A100.

3. Interacting with the NIM

Method 1: CLI Tool (gpt-cli)

NVIDIA’s NIM exposes an OpenAI-compatible API, which means you can use tools like gpt-cli.

We’re deploying gpt-cli as a pod in the Namla K8s cluster using the python:3.9-slim image. Instead of building a proper Docker image, I went with the lazy option: install everything at runtime. Each pod startup handles the setup. Future Me can deal with packaging — you’re welcome!

One last thing: for the gpt-cli to connect to the NIM OpenAI endpoint, we configure two environment variables: OPENAI_API_KEY and OPENAI_BASE_URL, which point to the cluster IP of the NIM deployment.

Here’s the full gpt-cli pod manifest:

gpt-command-line-pod.yaml

apiVersion: v1
kind: Pod
metadata:
  name: gpt-command-line-pod
spec:
  containers:
    - name: gpt-command-line
      image: python:3.9-slim
      command:
        - sh
        - -c
        - >
          pip install gpt-command-line && pip install --upgrade openai gpt-cli
          && sleep infinity
      env:
        - name: OPENAI_API_KEY
          value: "null"
        - name: OPENAI_BASE_URL
          value: http://llama3-8b-instruct-svc:8000/v1
      resources:
        requests:
          memory: 128Mi
          cpu: 250m
        limits:
          memory: 256Mi
          cpu: 500m

To chat with NIM using the CLI, you need to execute a terminal session inside the pod. At Namla, this is simplified with a single click on the dashboard, which opens a direct terminal to the pod. However, if you’re using a standard Kubernetes setup, you’ll need to manually exec into the pod using the following command:

kubectl exec -it gpt-cli-pod -n nim -- /bin/bash

Once inside the container, simply type gpt to start chatting!

NIM with Namla

Method 2: Web UI (open-webui

To provide a user experience similar to ChatGPT, we deploy the OpenWebUI, offering features comparable to ChatGPT. It is configured to use NIM’s OpenAI API, exposed through the cluster IP service llama3-8b-instruct-svc at the URL http://llama3-8b-instruct-svc:8000/v1. The Open Web UI is exposed via its own cluster IP service, open-webui-service, on port 3000.

open-webui.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: open-webui-deployment
  labels:
    app: open-webui
spec:
  replicas: 1
  selector:
    matchLabels:
      app: open-webui
  template:
    metadata:
      labels:
        app: open-webui
    spec:
      containers:
        - name: open-webui
          image: ghcr.io/open-webui/open-webui:latest
          ports:
            - containerPort: 8080
          volumeMounts:
            - name: open-webui-storage
              mountPath: /app/backend/data
          resources:
            requests:
              memory: 512Mi
              cpu: 500m
            limits:
              memory: 1Gi
              cpu: "1"
      restartPolicy: Always
      volumes:
        - name: open-webui-storage
          hostPath:
            path: /opt/namla/open-webui
            type: DirectoryOrCreate
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: kubernetes.io/hostname
                    operator: In
                    values:
                      - gcp-a100-id321
---
apiVersion: v1
kind: Service
metadata:
  name: open-webui-service
spec:
  selector:
    app: open-webui
  ports:
    - protocol: TCP
      port: 3000
      targetPort: 8080
  type: ClusterIP

NIM with OpenWebUI

4. Exposing the Web UI with Cloudflared

For external access, we use a Cloudflared container to establish a secure tunnel back to Cloudflare, removing the need for a public IP. The tunnel is configured in the Cloudflare dashboard to connect to the Open Web UI’s cluster IP at http://open-webui-service:3000. This setup provides a user-friendly URL like www.nim.mydomain.com, ensuring seamless and secure access to the dashboard. And voilà – an instant ChatGPT-like experience!

cloudflare.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama-cloudflared-deployment
  labels:
    app: ollama-cloudflared
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama-cloudflared
  template:
    metadata:
      labels:
        app: ollama-cloudflared
    spec:
      containers:
        - name: ollama-cloudflared
          image: cloudflare/cloudflared:latest
          args:
            - tunnel
            - --no-autoupdate
            - run
            - --token
            - eyJhIjoiYjdjMjJmNzJhNWNhMjM0Y2QyY2NhYjU1Mjk4***********************
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: kubernetes.io/hostname
                    operator: In
                    values:
                      - gcp-a100-id321

5. Keep an eye on your NIMs!

In addition to the deployment of the NIM using Kubernetes, Namla offers a bunch of monitoring & observability features, that allow you to not only continuously monitor your infrastructure but also the NIM behavior itself. You have real time visibility of the resource consumption of course, like GPU load, memory & even power consumption. But you also have the consumption & real time logs of each of your containers.

NIMs Monitoring & Observability in Namla

Conclusion

In this guide, we:

1. Deployed an NVIDIA NIM and cached its models locally.
2. Set up two ways to interact with it: a CLI tool and a web UI.
3. Exposed the web UI using Cloudflare.

In the next article, we’ll look at using network storage for better model caching. For now, enjoy experimenting with your NIM setup – and let me know how you’re using it!

Resources

February 7th, 2024

Namla Joins NVIDIA Inception: A Step Forward in Edge AI Innovation

We are thrilled to announce that Namla has been accepted into the NVIDIA Inception program ! This marks an exciting milestone in our journey to revolutionize Edge AI orchestration and management. NVIDIA Inception is designed to nurture startups driving breakthroughs in AI and data science, and being part of this prestigious program further strengthens our relationship with NVIDIA.

November 2024

The AI Service Provider: Delivering Seamless AI and SD WAN at the Edge

This paper shows how service providers can tackle these challenges with a fully-integrated edge AI networking platform. The combination of IronYun Vaidio, a real-world AI video analytics application, and Namla's orchestration solution, which includes SD-WAN capabilities, running on Advantech's AI edge devices based on NVIDIA Jetson technology, allows service providers to go beyond traditional enterprise network services to deliver end-to-end AI services. By taking on this new role, service providers can offer added value to their enterprise customers, shifting to AI service providers to thrive in the gen AI era.

May 28, 2024

Namla to Scale Edge AI with NVIDIA Jetson and NVIDIA Metropolis Platforms

Namla offers an integrated platform for deploying edge AI infrastructure, managing devices, and orchestrating applications while ensuring edge-to-cloud connectivity and security. This collaboration with NVIDIA will help enable businesses in various sectors, such as retail, manufacturing, healthcare, oil & gas, and video analytics, to scale up their deployment of NVIDIA Jetson systems-on-module and streamline the implementation of edge AI applications using NVIDIA Metropolis microservices.

Namla is an Nvidia Inception Company

Namla is a proud member of the NVIDIA Inception program.

nvidia inception program