Leveraging Kubernetes for Private LLMs: Lessons from the Edge

Leveraging Kubernetes for Private LLMs: Lessons from the Edge
Rabah GUEDREZ

Published on July 29, 2024

Leveraging Kubernetes for Private LLMs: Lessons from the Edge
Ready for a demo of
Namla
?
PrivateGPT (Namla + jetson + ollama+ llama3.1+CloudFlared)

Seeing how our clients utilize our Edge to Cloud Orchestration Platform for their distributed edge infrastructure inspired me to write this blog. In this post, I’ll showcase how you can build your private ChatGPT with the power of open-source LLMs at the edge of your network, granting it access to your data and enabling its use across various devices from anywhere, even while on vacation, without the fear of privacy concerns. This guide will help you leverage the power of edge computing to create a seamless, personalized AI experience that stays with you wherever you go.

First of all, what is Namla? Namla is a cloud to edge (or edge to cloud) native orchestration platform. It allows you to manage cloud and edge resources in a converged plane. Under the hood, we leverage Kubernetes, the leading open-source system for automating deployment, scaling, and management of containerized applications, but we add a lot of enhancements without requiring you to learn a new tool. One of our main verticals is EdgeAI. For example, we help clients deploy and manage thousands of Nvidia Jetson devices, which are small, powerful computers designed for AI applications at the edge. We work closely with Nvidia to support the latest Jetson hardware and the latest Jetpack versions, such as Jetpack 6, a comprehensive SDK that includes the latest tools and libraries for building AI applications (which has become my daily driver).

LLMs, sLLMs, and VLMs are taking the world of EdgeAI by storm. I suggest following Dustin Franklin to learn about the sheer number of use cases he's building using Jetson and JetPack. This will show you the potential of Jetson as a main platform for EdgeAI, from robots to drones to video AI.

With the release of Llama 3.1 by Meta, which is giving OpenAI a run for their money and is neck to neck with ChatGPT-4, you have no excuse if you have an Nvidia Jetson lying around to use it to build your private ChatGPT. And that’s exactly what I’m doing.

My goal is to deploy Llama 3.1 8B on a Jetson in our offices in Paris, expose it using an FQDN like privategpt.namla.ai, and use a mobile app to access it from my vacation location using the OpenAI API. Here’s the architecture I came up with:

The Stack

E2E Architecture

E2E Architecture

Steps:

1-Namla:

  • Onboard an Nvidia Jetson ORIN to Namla.
  • Join it to a Kubernetes virtual cluster and label it so the Nvidia GPU discovery daemonset gets deployed to it.
  • Set up checks: Prometheus monitoring and Nvidia GPU=1 as Kubernetes capacity.

2-Cloudflare Tunnel:

  • In the Cloudflare console, create a Cloudflare tunnel, copy the token, and generate Kubernetes manifests for deployment.
  • Create a DNS entry for the Cloudflare tunnel.

3-Android App:

  • I could have deployed a web app to use the Ollama API and have a ChatGPT-like experience in my browser using the OpenWebUI.
  • Mobile option, I found the Android app ChatBoost , which supports the OpenAI API format.
  • I chose this app because I’m used to the OpenAI API and use it in my Python scripting. The app’s compatibility made it a convenient choice for showcasing in the video below.

This setup allows me to leverage the power of Llama 3.1 on an Nvidia Jetson, creating a robust, private AI experience accessible from anywhere.

K8s manifest:

Namla Platform provides a native way to deploy Kubernetes applications using a simple manifest or Helm chart. Here's the manifest I used in Namla to deploy everything:

  • Ollama deployment
  • ClusterIP service to expose Ollama within the cluster
  • Cloudflared tunnel for secure external access

This manifest includes all the necessary components for deploying and accessing the application.

apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama-deployment
spec:
replicas: 1
selector:
	matchLabels:
	app: ollama
template:
	metadata:
	labels:
		app: ollama
	spec:
	containers:
		- name: ollama
		image: dustynv/ollama:r36.2.0
		command:
			- /bin/sh
		args:
			- -c
			- |
			if [ -f /ollama-bin/ollama ]; then
				cp /ollama-bin/ollama /usr/local/bin/ollama
				chmod 755 /usr/local/bin/ollama
			fi
			ollama serve & sleep infinity
		env:
			- name: OLLAMA_ORIGINS
			value: "*"
			- name: OLLAMA_MODELS
			value: /root/.ollama/models
			- name: OLLAMA_LOGS
			value: /root/.ollama/logs/ollama.log
		ports:
			- containerPort: 11434
		volumeMounts:
			- name: ollama-volume
			mountPath: /root/.ollama
			- name: llama3-volume
			mountPath: /root/.ollama/models/llama3_1
			- name: ollama-bin
			mountPath: /ollama-bin
		resources:
			requests:
			nvidia.com/gpu: 1
			limits:
			nvidia.com/gpu: 1
	volumes:
		- name: ollama-volume
		hostPath:
			path: /opt/namla/metropolis/ollama
			type: DirectoryOrCreate
		- name: llama3-volume
		hostPath:
			path: /opt/namla/metropolis/llama-models/models/llama3_1
			type: DirectoryOrCreate
		- name: ollama-bin
		emptyDir: {}
	affinity:
		nodeAffinity:
		requiredDuringSchedulingIgnoredDuringExecution:
			nodeSelectorTerms:
			- matchExpressions:
				- key: kubernetes.io/hostname
					operator: In
					values:
					- edge-poc-jetson-1-id498
---
apiVersion: v1
kind: Service
metadata:
name: ollama-service
spec:
selector:
	app: ollama
ports:
	- protocol: TCP
	port: 11434
	targetPort: 11434
type: ClusterIP
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: cloudflared
spec:
replicas: 1
selector:
	matchLabels:
	app: cloudflared
template:
	metadata:
	labels:
		app: cloudflared
	spec:
	containers:
		- name: cloudflared
		image: cloudflare/cloudflared:latest
		args:
			- tunnel
			- --no-autoupdate
			- run
			- --token
			-eyJhIjoiYjdjMjJmNzJhNWNh....

Notes:

  • INIT CONTAINER: I use an init container to update Ollama from Dustin's image because I prefer convenience over building my own image.
  • Ollama Deployment: This deployment sets up Ollama on the Jetson device. An init container updates the Ollama binary from Dustin's image.
  • ClusterIP Service: Exposes Ollama within the cluster, allowing internal access to the service.
  • Cloudflare Tunnel: Creates a secure tunnel from the Jetson device to Cloudflare's cloud, acting as a reverse proxy. While there are more elegant solutions using an ingress, I opted for a quick and straightforward approach.
  • Nvidia GPU Operator: we ensure the Nvidia GPU Operator is deployed to discover the GPU and add it to our Kubernetes cluster as a resource. This allows us to request the GPU as a resource for our Ollama deployment, avoiding the need to run it as a privileged pod, which, as you may know, is not a recommended practice.
Screenshot from Namla Apps

Screenshot from Namla Apps

Test & Performance:

Namla has an extensive monitoring tool for Edge hardware, encompassing traditional metrics such as CPU, RAM, disk, and networking, along with GPU-specific metrics. For instance, on Nvidia Jetson devices, we have developed a specialized exporter that allows our users to track the performance of their AI models on the GPU.

Let's start by checking that Ollama has taken almost 5GB of GPU RAM to load the LLAMA3.2 8B model, as you can see in the following figure:

ollama loads llama3.1 to GPU RAM

ollama loads llama3.1 to GPU RAM

Prompting ollama directly from the cli

Prompting ollama directly from the cli

Ps: I noticed that while the GPU RAM utilization remains stable because we loaded all the LLM model layers to the GPU, the GPU core utilization jumps to 100% while generating the response and then goes idle.

Other GPU metrics

Other GPU metrics

Want to see Namla in action? Follow these easy steps to get started:

  • Request a Namla Demo Account: Fill out our contact form to get access.
  • Access a Kubernetes Virtual Cluster: We'll provide you with a virtual cluster to explore.
  • Onboard Your NVIDIA Jetson Devices: Connect and manage your devices effortlessly.
  • Deploy Your Kubernetes Native Application Manifests: Simplify and streamline your deployment process.
  • Start Interacting with Your EdgeAI: Experience the power of AI at the edge.

Ready to dive in? Fill out the form and get started today!