Running Alpaca Inference on a Laptop
Researchers at Stanford Uni used trained instances of Llama (at different sizes) as a base for fine-tuned models, using a “Self-instruct” process to re-train the model based on the text-davinci-003
dataset from OpenAI.
The Stanford model is named Alpaca
This results in a Chat-GPT-like chat bot, trained for <$600 US on far lower powered hardware than OpenAI’s GPT models, & able to run inference on consumer hardware.
This raises the possibility that useful LLMs can be used locally, with no external costs (beyound hardware & power) without data-mining by large companies.
(Another notable project, aimed to bring LLMs to the masses is [[OpenAssistant]], something I intend to look into at another time)
The 7Billion parameter model that I’m testing with, is not fantastic, but is potentially usable & worth testing with.
People have already engineered packages to make this simple, so I’m following a guide to get it running on my laptop:
Installing the model/web app:
- make sure you have an up to date Node.js
- install the 7B param version of Alpaca (smallest) with:
npx dalai alpaca install 7B
- Thakes quite a while - I had a few occasions where the install looked stuck at 100% -
^c
and it shows output as it it was still doing things. By the 3rd attempt, it had got far enough that the output was showing some C
compilations & a download of the .bin
file for the weights
- serve the web app:
Note: there’s a dockerised version here: GitHub - cocktailpeanut/dalai: The simplest way to run LLaMA on your local machine
Dalai: Notes so far
The Dalai UI is rather clunky, making experimentation more difficult.
It uses prompt templates to feed the model, which get in the way of the actual text you’re inputting and adds a lot of noise around the responses.
I’ve not managed to work out what the attention across prompts is like.
he 7Billion parameter model that I’m testing with, is not fantastic, but is potentially usable & worth testing with.
Alternative UIs
Alpaca-Turbo
Provides a much smoother UX & can be fired up via docker-compose.
Need to add a copy of the pre-trained model and store it in the working directory (the root of the restored repo if you pull in the git repo)
It also presents a web UI to interact with the model.
Running in Docker, it subjectively feels slower than using the Dalai web app, with inference (2.3 GHz 8-Core Intel Core i9 [mobile] 16GB RAM - I only quote the CPU as the GPU appears not to be used in current config) taking between 40 & 70 seconds vs ~20-30 in Dalai (albeit, the prompts are not the same).
This UI gives closer to a ChatGPT type UI, but with many options for pre-appending templates to the prompts, to shape the responses, without those getting in the way of your chat.
It’s not perfect, but is easier to use than the Dalai front end.
Both Dalai and Alpaca-Turbo give access to params to vary the inference, so there will be options to change the quality of & speed up the responses.
Edit: on my laptop, 14 threads (default 4) seems to be a sweet spot. Higher than that & I ended up with it never coming back, but still using the CPU.
On Alpaca-Turbo, 14 threads gave in the region of half the time taken to respond (in the 20s range) vs. the default 4 thread config.
Again, with 14 threads, Dalai seems to be slightly quicker again (but the prompt templates are different).