Practical possibilities for LLM experiments

Ed_S · 26 March 2023 15:09

The Large Language Models such as GPT are all very large, so what possibilities do we have for experimenting with them - not merely as passive users of some commercial service (like chatGPT, or Bing), but as active experimenters?

Can we run on a laptop? What specs? What responsiveness?

Can we run on a Raspberry Pi? What kind of responsiveness?

What models can we choose between, what are they called, how big are they, where do we find them?

Is there a cloud service with cheap or free offerings for this kind of experiment?

How long does it take to get started, and where would we find a friendly guide?

Are there online communities for discussing these kinds of things?

Ed_S · 26 March 2023 15:15

Here are some links and snippets which might help answer some of the questions…

From a github issue on Facebook’s large language model:

Just a report. I’ve successfully run the LLaMA 7B model on my 4GB RAM Raspberry Pi 4. It’s super slow at about 10 sec/token. But it looks like we can run powerful cognitive pipelines on a cheap hardware. It’s awesome. Thank you!

It’s also pretty interesting to see that by harvesting the outputs of a 175B model, you can get a well-optimized 7B model to approach the much larger one in performance in some areas:

Stanford Researchers have taken some off-the-shelf powerful neural net weights (LLaMa), used the outputs from a model hosted on a commercial service (text-davinci-003 by OpenAI) to generate a bunch of instruction-following demonstrations, and smooshed these two together into one model. The result is Alpaca, a language model that gets performance that superficially seems close to GPT3 but costs a fraction as much ($600-ish; $500 for data acquisition from OpenAI and $100 for fine-tuning the model).

Running Alpaca Inference on a Laptop

Researchers at Stanford Uni used trained instances of Llama (at different sizes) as a base for fine-tuned models, using a “Self-instruct” process to re-train the model based on the text-davinci-003 dataset from OpenAI.

The Stanford model is named Alpaca

This results in a Chat-GPT-like chat bot, trained for <$600 US on far lower powered hardware than OpenAI’s GPT models, & able to run inference on consumer hardware.

This raises the possibility that useful LLMs can be used locally, with no external costs (beyound hardware & power) without data-mining by large companies.

(Another notable project, aimed to bring LLMs to the masses is [[OpenAssistant]], something I intend to look into at another time)

The 7Billion parameter model that I’m testing with, is not fantastic, but is potentially usable & worth testing with.

People have already engineered packages to make this simple, so I’m following a guide to get it running on my laptop:

Installing the model/web app:

make sure you have an up to date Node.js
install the 7B param version of Alpaca (smallest) with:
- npx dalai alpaca install 7B
- Thakes quite a while - I had a few occasions where the install looked stuck at 100% - ^c and it shows output as it it was still doing things. By the 3rd attempt, it had got far enough that the output was showing some C compilations & a download of the .bin file for the weights
serve the web app:
- npx dalai serve
- this will start a web app running entirely on you machine at: http://localhost:3000/

Note: there’s a dockerised version here: GitHub - cocktailpeanut/dalai: The simplest way to run LLaMA on your local machine

Dalai: Notes so far

The Dalai UI is rather clunky, making experimentation more difficult.

It uses prompt templates to feed the model, which get in the way of the actual text you’re inputting and adds a lot of noise around the responses.

I’ve not managed to work out what the attention across prompts is like.

he 7Billion parameter model that I’m testing with, is not fantastic, but is potentially usable & worth testing with.

Alternative UIs

Alpaca-Turbo

Provides a much smoother UX & can be fired up via docker-compose.
Need to add a copy of the pre-trained model and store it in the working directory (the root of the restored repo if you pull in the git repo)

It also presents a web UI to interact with the model.

Running in Docker, it subjectively feels slower than using the Dalai web app, with inference (2.3 GHz 8-Core Intel Core i9 [mobile] 16GB RAM - I only quote the CPU as the GPU appears not to be used in current config) taking between 40 & 70 seconds vs ~20-30 in Dalai (albeit, the prompts are not the same).

This UI gives closer to a ChatGPT type UI, but with many options for pre-appending templates to the prompts, to shape the responses, without those getting in the way of your chat.

It’s not perfect, but is easier to use than the Dalai front end.

Both Dalai and Alpaca-Turbo give access to params to vary the inference, so there will be options to change the quality of & speed up the responses.
Edit: on my laptop, 14 threads (default 4) seems to be a sweet spot. Higher than that & I ended up with it never coming back, but still using the CPU.

On Alpaca-Turbo, 14 threads gave in the region of half the time taken to respond (in the 20s range) vs. the default 4 thread config.

Again, with 14 threads, Dalai seems to be slightly quicker again (but the prompt templates are different).