How to Run LLMs from a USB with Llamafile

Piping hot off the virtual presses is llamafile, a new framework that let’s you run Large Language Models (LLMs) on your local machine on a single file. The aim of the project is to make LLMs more accessible and easy to distribute, but to me, the clearest most immediate value is being able to run an LLM off of a USB. This makes them completely portable for use on the go! At the same time, Large Language Models are, well, large. By putting them on a USB, you don’t have to take up a ton of room on your laptop when you want access to an offline chatbot.

This post is a tutorial for a specific use-case designed to be readable for someone who isn’t that technical. If you checked out the repo and you thought, “you know I could still use a little bit of help getting this going,” or maybe you’re new to all this and you’ve never even been to Hugging Face before, that’s okay! This post is for you. So first, let’s talk about our use-case.

You want to:

Run multiple LLMs that are bigger than the storage your computer will allow
Be able to execute your LLMs on a click
Run them on a Linux machine (this will work on other operating systems, but the tutorial is fine-tuned for Linux. If you want me to do step by step for Windows or Mac, let me know and I will!)

What will not be covered are things like GPU acceleration which for now is only supported on NVIDIA GPUs (I run AMD, so that’s that). You don’t need GPU acceleration to run most models, but they will be a little bit slower and eat up a lot of CPU usage. It still works and it’s nice to have.

So with all of that said, let’s get started!

First of all, you’re going to want Python installed on your machine. You don’t need it unless you truly don’t have enough room on your machine for any one LLM you want to install, but, like I said, these things are large, and I certainly didn’t have enough room.
Also, if you’re on Linux, go ahead and update your apt repositories (it never hurts):

sudo apt-get update
sudo apt-get upgrade

Okay! Your environment is all set up, let’s really get this going!

Plug in your USB

It sounds trite, but this is an important step. After you plug in your USB and ensure it’s mounted properly, open up your terminal and navigate into that directory. If you don’t have enough room on your machine for the models, it’s crucial you perform every other step inside of the actual USB. If you don’t, you’ll run into errors trying to download the models.

Download llamafile

Now that you’re in your USB directory, go ahead and pull llamafile off of github. You can download it and drop it in there, but I recommend curling it:

curl -L https://github.com/Mozilla-Ocho/llamafile/releases/download/0.2.1/llamafile-server-0.2.1 >llamafile

This will download the vanilla llamafile code without any weights. Doing it this way will enable you to more easily use it to run any model you want with it in the future. Besides, I figure if you felt comfortable downloading the binaries and running them directly from the repo, you probably would have just done that. If it worked, inside of your USB, you should find something that looks like this:

Now we just need to make this bad-boy executable. This will be important for later:

chmod +x llamafile

Great! We’re almost there. Now you just need a model to run it with! If you have a machine with a lot of room, you can curl it like you did with the llamafile repo directly off of Hugging Face, but if you don’t, then you can’t. That’s because, even if you are in the USB directory, the curl command will store the file in a local cache before ultimately putting it into your USB drive. If you don’t have enough room on your local machine, the download will fail. This is why I recommended installing python to manage your downloads.

But before we get into that, let’s pick a model.

Pick a model to download

There’s a lot of good options out there for a lot of different use-cases. I won’t get into choosing what model, but the popular ones are basically Llama, Mistral, Llava, and a bunch of variants on those three. What’s important is that you pick one in GGUF format. There’s a lot of them out there, and one user in particular, known only as The Bloke, has published a bunch, Find your favorite model that ends with a -GGUF extension and you’re in business.

As a side note, if you want to download a model with multimodel capabilities (media upload) like Llava, you will also need to download the mmproj gguf files to go with it. There is also an extra step required to run it, but this isn’t a big deal and don’t fear it. I will go through the example of Llava with multimodal capabilities, and you can simply remove the extra step, which I will highlight, to download a model that isn’t multimodal.

The first step is to identify the repo the model is in on Hugging Face. For this example, let’s use llava-v1.5-7B-GGUF uploaded by Justine Tunney, one of the creators of the llamafile itself!

At the top-left of every repo, you will find the model name. You can highlight and copy paste it, or simply click the copy symbol to copy it to your clipboard. You need this to download the model from Hugging Face. Now identify your model. For help picking which model to download, you can get more information from the Model Card tab.

In this case, we will download the Q4 model and the corresponding mmproj file for Q4. If you wanted to curl it, you could do it like this:

curl -o llava-v1.5-7b-Q4_K.gguf https://huggingface.co/jartine/llava-v1.5-7B-GGUF/tree/main/llava-v1.5-7b-Q4_K.gguf

Then to curl the mmproj you would just do:

curl -o llava-v1.5-7b-mmproj-Q4_0.gguf https://huggingface.co/jartine/llava-v1.5-7B-GGUF/tree/main/llava-v1.5-7b-mmproj-Q4_0.gguf

I DO NOT recommend curling the main model because it is very big, and if you don’t have a lot of free space, it will not work. I DO recommend curling the mmproj because it is quite small and the work to download it using python isn’t worth the effort. If you are not downloading a multimodal model, you can skip the mmproj step, no problem. Let’s say, for the sake of argument you don’t want to curl the main model, because you want to download it directly to your flash drive.

Use a simple python script to download the model

Create a python file, let’s call it download.py, and save it to your flash drive. For this example, this would be the contents of your python file:

from huggingface_hub import hf_hub_download

# Define the repository and filename
repo_id = "jartine/llava-v1.5-7B-GGUF"
filename = "llava-v1.5-7b-Q4_K.gguf"

# Define the destination path to your flashdrive
local_filepath = "/media/zymazza/llms/models/llava"

# Download the file
hf_hub_download(repo_id=repo_id, filename=filename, cache_dir=local_filepath)

The repo_id can be any repo you want (as long as it is for a gguf model) and the filename is the specific model from that repo you want (in this case, the Q4 model). So the repo_id is what you copy/pasted from the top-left of the Hugging Face page, and the filename comes from the list from the earlier screenshot. This script downloads the file directly to the destination. If you’re not sure what the destination is, from your terminal where you have already navigated to your flash drive, just type pwd and copy/paste the result to the local_filepath variable. In this case, I made a folder inside my flash drive called models to store all my ggufs in, and it will park the download in there.

Once it’s configured properly and saved to your flash drive, make it executable with chmod +x download.py. Then run it by typing python download.py or it may be python3 download.py depending on your installed version. This will take a good 10 minutes depending on the size of the model you’ve chosen.

Test it out

With llamafile and your models installed, you should now be able to run your models locally on your machine! The first thing you need to do is find the downloaded gguf file itself. No, it’s not the directory that appeared in your flash drive! Usually it’s going to be in a random file within the snapshots directory. Something like this:

.
└── models--jartine--llava-v1.5-7B-GGUF
    ├── blobs
    │   └── c91ebf0a628ceb25e374df23ad966cc1bf1514b33fecf4f0073f9619dec5b3f9
    ├── refs
    │   └── main
    └── snapshots
        └── a97b4a03329e8ff35bdd9ae0d44bae42d9703975
            ├── llava-v1.5-7b-mmproj-Q4_0.gguf
            └── llava-v1.5-7b-Q4_K.gguf -> ../../blobs/c91ebf0a628ceb25e374df23a

So make sure you understand and get the file path to the actual gguf. You can navigate to it in your terminal and type pwd to copy it and cd.. yourself back to the top-level directory of your flash drive, or you can copy the file-path in your file explorer, depending on your OS. Once you have it, you can run it like this:

./llamafile -m models/llava/models--jartine--llava-v1.5-7B-GGUF/snapshots/a97b4a03329e8ff35bdd9ae0d44bae42d9703975/llava-v1.5-7b-Q4_K.gguf --mmproj models/llava/models--jartine--llava-v1.5-7B-GGUF/snapshots/a97b4a03329e8ff35bdd9ae0d44bae42d9703975/llava-v1.5-7b-mmproj-Q4_0.gguf

./llamafile will run the llamafile program you got from github, and the -m flag describes the model. In this case, replace the text I have here with the actual path to your model (the gguf file). The –mmproj flag is only necessary for multimodal models like llava, and can be omitted if you’re running something like mistral. In the case of the -mmproj flag you still need to use your actual filepath. Ask chat GPT for help if you get stuck formatting this command :p

If it worked, a new window should pop open in your browser at https://localhost:8080/. It may take a few minutes, though!

Rinse and repeat for as many models as you want!

This step should be self explanatory. Go on hugging face and find more models, modify your download script with the relevant information, and go fill up your flash drive!

Launch models from your flash drive on click

To launch your models in a single click from your flash drive, create .sh files for each of your models. In this example, we’ll stick with llava, so we can call it llava.sh. Take your command that you used to start up the model from before, and paste it into your .sh file with the bash script heading. In our example, it will look like this:

#!/bin/bash
./llamafile -m models/llava/models--jartine--llava-v1.5-7B-GGUF/snapshots/a97b4a03329e8ff35bdd9ae0d44bae42d9703975/llava-v1.5-7b-Q4_K.gguf --mmproj models/llava/models--jartine--llava-v1.5-7B-GGUF/snapshots/a97b4a03329e8ff35bdd9ae0d44bae42d9703975/llava-v1.5-7b-mmproj-Q4_0.gguf

Save it to your flash drive, and navigate in there in terminal to make it executable with .chmod +x llava.sh. Now you can launch your models by clicking that .sh file! Fill up your flash drive, and run any model you want with one click! When you’re done, unplug the drive, and your computer still has all the storage it had before you started. It’s like the models were never even there!

In the end, your flash drive should look something like this. Each .sh is a completely different model I can run locally on a single click with no internet connection, right off of my USB!

2 responses to “How to Run LLMs from a USB with Llamafile”

Chad Hostak says:

December 5, 2023 at 8:50 am

Oh my gosh! Holy smokes! This is well written! I love the way it hooked me from the start. Excellent copy there and images. I’m not tech-savvy or understand technical jargon, but I understood quite a bit of this and the parts I might need more info on would be easy enough to search on google or youtube or something. I’m quite impressed with this!

- Zy says:
  
  December 15, 2023 at 10:58 pm
  
  Chad you’re the first human being to comment on one of my blog posts! Thank you! Were you able to try it out/do you have any interest in running a LLM locally?

Zy Mazza