Published on July 29, 2024
Seeing how our clients utilize our Edge to Cloud Orchestration Platform for their distributed edge infrastructure inspired me to write this blog. In this post, I’ll showcase how you can build your private ChatGPT with the power of open-source LLMs at the edge of your network, granting it access to your data and enabling its use across various devices from anywhere, even while on vacation, without the fear of privacy concerns. This guide will help you leverage the power of edge computing to create a seamless, personalized AI experience that stays with you wherever you go.
First of all, what is Namla? Namla is a cloud to edge (or edge to cloud) native orchestration platform. It allows you to manage cloud and edge resources in a converged plane. Under the hood, we leverage Kubernetes, the leading open-source system for automating deployment, scaling, and management of containerized applications, but we add a lot of enhancements without requiring you to learn a new tool. One of our main verticals is EdgeAI. For example, we help clients deploy and manage thousands of Nvidia Jetson devices, which are small, powerful computers designed for AI applications at the edge. We work closely with Nvidia to support the latest Jetson hardware and the latest Jetpack versions, such as Jetpack 6, a comprehensive SDK that includes the latest tools and libraries for building AI applications (which has become my daily driver).
LLMs, sLLMs, and VLMs are taking the world of EdgeAI by storm. I suggest following Dustin Franklin to learn about the sheer number of use cases he's building using Jetson and JetPack. This will show you the potential of Jetson as a main platform for EdgeAI, from robots to drones to video AI.
With the release of Llama 3.1 by Meta, which is giving OpenAI a run for their money and is neck to neck with ChatGPT-4, you have no excuse if you have an Nvidia Jetson lying around to use it to build your private ChatGPT. And that’s exactly what I’m doing.
My goal is to deploy Llama 3.1 8B on a Jetson in our offices in Paris, expose it using an FQDN like privategpt.namla.ai, and use a mobile app to access it from my vacation location using the OpenAI API. Here’s the architecture I came up with:
The Stack
1-Namla:
2-Cloudflare Tunnel:
3-Android App:
This setup allows me to leverage the power of Llama 3.1 on an Nvidia Jetson, creating a robust, private AI experience accessible from anywhere.
Namla Platform provides a native way to deploy Kubernetes applications using a simple manifest or Helm chart. Here's the manifest I used in Namla to deploy everything:
This manifest includes all the necessary components for deploying and accessing the application.
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama-deployment
spec:
replicas: 1
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
containers:
- name: ollama
image: dustynv/ollama:r36.2.0
command:
- /bin/sh
args:
- -c
- |
if [ -f /ollama-bin/ollama ]; then
cp /ollama-bin/ollama /usr/local/bin/ollama
chmod 755 /usr/local/bin/ollama
fi
ollama serve & sleep infinity
env:
- name: OLLAMA_ORIGINS
value: "*"
- name: OLLAMA_MODELS
value: /root/.ollama/models
- name: OLLAMA_LOGS
value: /root/.ollama/logs/ollama.log
ports:
- containerPort: 11434
volumeMounts:
- name: ollama-volume
mountPath: /root/.ollama
- name: llama3-volume
mountPath: /root/.ollama/models/llama3_1
- name: ollama-bin
mountPath: /ollama-bin
resources:
requests:
nvidia.com/gpu: 1
limits:
nvidia.com/gpu: 1
volumes:
- name: ollama-volume
hostPath:
path: /opt/namla/metropolis/ollama
type: DirectoryOrCreate
- name: llama3-volume
hostPath:
path: /opt/namla/metropolis/llama-models/models/llama3_1
type: DirectoryOrCreate
- name: ollama-bin
emptyDir: {}
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- edge-poc-jetson-1-id498
---
apiVersion: v1
kind: Service
metadata:
name: ollama-service
spec:
selector:
app: ollama
ports:
- protocol: TCP
port: 11434
targetPort: 11434
type: ClusterIP
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: cloudflared
spec:
replicas: 1
selector:
matchLabels:
app: cloudflared
template:
metadata:
labels:
app: cloudflared
spec:
containers:
- name: cloudflared
image: cloudflare/cloudflared:latest
args:
- tunnel
- --no-autoupdate
- run
- --token
-eyJhIjoiYjdjMjJmNzJhNWNh....
Notes:
Namla has an extensive monitoring tool for Edge hardware, encompassing traditional metrics such as CPU, RAM, disk, and networking, along with GPU-specific metrics. For instance, on Nvidia Jetson devices, we have developed a specialized exporter that allows our users to track the performance of their AI models on the GPU.
Let's start by checking that Ollama has taken almost 5GB of GPU RAM to load the LLAMA3.2 8B model, as you can see in the following figure:
Ps: I noticed that while the GPU RAM utilization remains stable because we loaded all the LLM model layers to the GPU, the GPU core utilization jumps to 100% while generating the response and then goes idle.
Want to see Namla in action? Follow these easy steps to get started:
Ready to dive in? Fill out the form and get started today!