I've been playing around with locally hosted #LLMs using the #Ollama#CLI tool. I've mostly been using models like mistral and dolphin-coder for assistance with textual ideas and issues. More recently I've been using the llava visual model via some simple #Bash#scripting, looping through images and creating description files. I can then grep those files for key words and note the associated filenames. Powerful stuff!
It could be used to fuel an offline assistant that would be able to easily add an appointment to your calendar, open an app, etc. without #privacy issues.
This has come to reality with this proof of concept using the Phi-2 2.8B transformer model running on /e/OS.
It is slow, so not very usable until we have dedicated chips on SoCs, but works (and #opensource !)
@gael I'd say before that happens, running an LLM on your local network is your best bet. Projects like https://ollama.com/ make that incredibly easy. #llm#ollama
Has anyone here worked much with generators in #emacs ?
I am looking for a good solution for streaming outputs in my ollama-elisp-sdk project. I think there's a good angle using generators to make a workflow fairly similar to e.g. the OpenAI API. Not sure yet though.
This past month, I was talking about how I spent $528 to buy a machine with enough guts to run more demanding AI models in Ollama. That is good and all but if you are not on that machine (or at least on the same network), it has limited utility. So, how do you use it if you are at a library or a friend’s house? I just discovered Tailscale. You install the Tailscale app on the server and all of your client devices and it creates an encrypted VPN connection between them. Each device on your “tailnet” has 4 addresses you can use to reference it:
Machine name: my-machine
FQDN: my-machine.tailnet.ts.net
IPv4: 100.X.Y.Z
IPv6: fd7a:115c:a1e0::53
If you remember Hamachi from back in the day, it is kind of the spiritual successor to that.
There is no need to poke holes in your firewall or expose your Ollama install to the public internet. There is even a client for iOS, so you can run it on your iPad. I am looking forward to playing around with it some more.
#Ollama is the easiest way to run local #AI I've tried so far. In 5 minutes you can have a chatbot running on a local model. Dozens of models and UIs to choose from.
Just the speed is not great, but what can I expect on an Intel-only laptop.
Completely forgot I had made this #fountainpen database a while ago when I was bored: https://codeberg.org/bmp/flock, it is written in Go, and was generated with #ollama if I remember correctly. Maybe I'll pick it up again, given that newer models seem to be better.
Back in December, I paid $1,425 to replace my MacBook Pro to make my LLM research at all possible. That had an M1Pro CPU and 32GB of RAM, which (as I said previously) is kind of a bare minimum spec to run a useful local AI. I quickly wished I had enough RAM to run a 70B model, but you can’t upgrade Apple products after the fact and a 70B model needs 64GB of RAM. That led me to start looking for a second-hand Linux desktop that can handle a 70B model.
The Xeon W-2125 has 8 threads and 4 cores, so I think that CPU1-CPU8 are threads. My theory going into this was that the models would go into memory and then the GPU would do all of the processing. The CPU would only be needed to serve the results back to the user. It looks like the full load is going to the CPU. For a moment, I thought that the 8 GB of video RAM was the limitation. That is why I tried running a 7b model for one of the tests. I am still not convinced that Ollama is even trying to use the GPU.
I am using a proprietary Nvidia driver for the GPU but maybe I’m missing something?
I was recently playing around with Stability AI’s Stability Cascade. I might need to run those tests on this machine to see what the result is. It may be an Ollama-specific issue.
Have any questions, comments, or concerns? Please feel free to drop a comment, below. As a blanket warning, all of these posts are personal opinions and do not reflect the views or ethics of my employer. All of this research is being done off-hours and on my own dime.
I am unsure how to fix "Unable to change power state from D3hot to DO, device inaccessible". I would have expected that installing desktop #ubuntu would have gotten easier over the years.
Well, f***. I thought that running a 70B AI model on a machine with 128 gigabytes of RAM would tax the RAM, not the CPU. Apparently that Xeon processor is the bottleneck.
I should check to make sure that the GPU is in use.
A long long time ago, @arfy made a lua script for Dolphin screen-readers that allowed you to type in plus or minus number of days and get the date. I just asked Dolphin Mixtral to do the same as an apple script using #Ollama running locally and it actually did it. It runs and works just as I wanted. Madness.
set numDays to text returned of (display dialog "Enter the number of days:" default answer "")
set targetDate to current date
set newDate to targetDate + numDays * days
display dialog "The future date will be: " & (newDate as string)
A rule-based inference engine is designed to apply predefined rules to a given set of facts or inputs to derive conclusions or make decisions. It operates by using logical rules, which are typically expressed in an “if-then” format. You can think of it as basically a very complex version of the spell check in your text editor.
What is an AI Model?
AI models employ learning algorithms that draw conclusions or predictions from past data. An AI model’s data can come from various sources such as labeled data for supervised learning, unlabeled data for unsupervised learning, or data generated through interaction with an environment for reinforcement learning. The algorithm is the step-by-step procedure or set of rules that the model follows to analyze data and make predictions. Different algorithms have different strengths and weaknesses, and some are better suited for certain types of problems than others. A model has parameters that are the aspects of the model that are learned from the training data. A model’s complexity can be measured by the number of parameters contained in it but complexity can also depend on the architecture of the model (how the parameters interact with each other) and the types of parameters used.
What is an AI client?
An AI client is how the user interfaces with the rule-based inference engine. Since you can use the engine directly, the engine itself could also be the client. For the most part, you are going to want something web-based or a graphical desktop client, though. Good examples of graphical desktop clients would be MindMac or Ollamac. A good example of a web-based client would be Ollama Web UI. A good example of an application that is both a client and a rule-based inference engine is LM Studio. Most engines have APIs and language-specific libraries, so if you want to you can even write your own client.
What is the best client to use with a Rule-Based Inference Engine?
I like MindMac. I would recommend either that or Ollama Web UI. You can even host both Ollama and Ollama Web UI together using docker compose.
What is the best Rule-Based Inference Engine?
I have tried Ollama, Llama.cpp, and LM Studio. If you are using Windows, I would recommend LM Studio. If you are using Linux or a Mac, I would recommend Ollama.
How much RAM does your computer need to run a Rule-Based Inference Engine?
The RAM requirement is dependent upon what model you are using. If you browse the Ollama library, Hugging Face, or LM Studio‘s listing of models, most listings will list a RAM requirement (example) based on the number of parameters in the model. Most 7b models can run on a minimum of 8GB of RAM while most 70b models will require 64GB of RAM. My Macbook Pro has 32GB of unified memory and struggles to run Wizard-Vicuna-Uncensored 30b. My new AI lab currently has 128GB of DDR4 RAM and I hope that it can run 70b models reliably.
Does your computer need a dedicated GPU to run a Rule-Based Inference Engine?
No, you don’t. You can use just the CPU but if you have an Nvidia GPU, it helps a lot.
I use Digital Ocean or Linode for hosting my website. Can I host my AI there, also?
Yeah, you can. The RAM requirement would make it a bit expensive, though. A virtual machine with 8GB of RAM is almost $50/mo.
Why wouldn’t you use ChatGPT, Copilot, or Bard?
When you use any of them, your interactions are used to reinforce the training of the model. That is an issue for more than the most basic prompts. In addition to that, they cost up to $30/month/user.
Why should you use an open-source LLM?
What opinion does your employer have of this research project?
You would need to direct that question to them. All of these posts should be considered personal opinions and do not reflect the views or ethics of my employer. All of this research is being done off-hours and on my own dime.
Why are you interested in this technology?
It is a new technology that I didn’t consider wasteful bullshit in the first hour of researching it.
Are you afraid that AI will take your job?
No.
What about image generation?
I used (and liked) Noiselith until it shut down. DiffusionBee works but I think that Diffusers might be the better solution. Diffusers lets you use multiple models and it is easier to use than Stable Diffusion Web UI.
You advocate for not using ChatGPT. Do you use it?
I do. ChatGPT 4 is a 1.74t model. It can do cool things. I have an API key and I use it via MindMac. Using it that way means that I pay based on how much I use it instead of using it via a Pro account, though.
Are you going to only write about AI on here, now?
Nope. I still have other interests. Expect more Vue.js posts and likely something to do with Unity or Unreal at some point.
Is this going to be the last AI FAQ post?
Nope. I still haven’t covered training or fine-tuning.
(1/3) Last Friday, I was planning to watch Masters of the Air ✈️, but my ADHD had different plans 🙃, and I ended up running a short POC and creating a tutorial for getting started with Ollama Python 🚀. The settings are available for both Docker 🐳 and locally.
TLDR: It is straightforward to run LLM models locally with the Ollama Python library. Models with up to ~7B parameters run smoothly with low compute resources.
(2/3) The tutorial focuses on the following topics:
✅ Setting up Ollama server 🦙
✅ Setting up Python environment 🐍
✅ Pulling and running LLM (examples of Mistral, Llama2, and Vicuna)
(3/3) The tutorial will get you to run Ollama inside a dockerized container. Yet, there are some missing pieces, such as mounting LLM models from the local environment to avoid downloading the models during the build time. I plan to explore this topic sometime in the coming weeks.
💰 Rune’s $100k for indie game devs
🤲 The Zed editor is now open source
🦙 Ollama’s new JS & Python libs
🤝 @tekknolagi's Scrapscript story
🗒️ Pooya Parsa's notes from a tired maintainer
🎙 hosted by @jerod
This morning, I posted a "The Basics" blog post about what I learned running #Ollama on the new (second-hand) Macbook Pro that I bought last month. I recently bought a new (second-hand) "God-mode" workstation to take this stuff a little further.
The biggest limitation of something like ChatGPT, Copilot, or Bard is that your data leaves your control when you use the AI. I believe that the future of AI is AI that remains in your control. The only issue with running your own, local AI is that a large learning model (LLM) needs a lot of resources to run. You can’t do it on your old laptop. It can be done, though. Last month, I bought a new Macbook Pro with an M1Pro CPU and 32GB of unified RAM to test this stuff out.
If you are in a similar situation, Mozilla’s Llamafile project is a good first step. A llamafile can run on multiple CPU microarchitectures. It uses Cosmopolitan Libc to provide a single 4GB executable that can run on macOS, Windows, Linux, FreeBSD, OpenBSD, and NetBSD. It contains a web client, the model file, and the rule-based inference engine. You can just download the binary, execute it, and interact with it through your web browser. This has very limited utility, though.
So, how do you get from a proof of concept to something closer to ChatGPT or Bard? You are going to need a model, a rule-based inference engine or reasoning engine, and a client.
The Rule-Based Inference Engine
A rule-based inference engine is a piece of software that derives answers or conclusions based on a set of predefined rules and facts. You load models into it and it handles the interface between the model and the client. The two major players in the space are Llama.cpp and Ollama. Getting Ollama is as easy as downloading the software and running ollama run [model] from the terminal.
You will notice that the result isn’t easy to parse. Last week, Ollama announced Python and JavaScript libraries to make it much easier.
The Models
A model consists of numerous parameters that adjust during the learning process to improve its predictions. They employ learning algorithms that draw conclusions or predictions from past data. I’m going to be honest with you. This is the bit that I understand the least. The key attributes to be aware of with models are what it is trained on, how many parameters big the model is, and the model’s benchmark numbers.
If you browse Hugging Face or the Ollama model library, you will see that there are plenty of 7b, 13b, and 70b models. That number tells you how many parameters are in the model. Generally, a 70b model is going to be more competent than a 7b model. A 7b model has 7 billion parameters whereas a 70b model has 70 billion parameters. To give you a point of comparison, ChatGPT 4 reportedly has 1.76 trillion parameters.
The number of parameters isn’t the end-all-be-all, though. There are leaderboards and benchmarks (like HellaSwag, ARC, and TruthFulQA) for determining comparative model quality.
If you are running Ollama, downloading and running a new model is as easy as browsing the model library, finding the right one for your purposes, and running ollama run [model] from the terminal. You can manage the installed models from the Ollama Web UI also, though.
The client is what the user of the AI uses to interact with the rule-based inference engine. If you are using Ollama, the Ollama Web UI is a great option. It gives you a web interface that acts and behaves a lot like the ChatGPT web interface. There are also desktop clients like Ollamac and MacGPT but my favorite so far is MindMac. It not only gives you a nice way to switch from model to model but it also gives you the ability to switch between providers (Ollama, OpenAI, Azure, etc).
I have a few big questions, right now. How well does Ollama scale from 1 user to 100 users? How do you finetune a model? How do you secure Ollama? Most interesting to me, how do you implement something like Stable Diffusion XL with this stack? I ordered a second-hand Xeon workstation off of eBay to try to answer some of these questions. In the workplace setting, I’m also curious what safeguards are needed to insulate the company from liability. These are all things that need addressing over time.
I created a new LLM / ML category here and I suspect that this won’t be my last post on the topic. As a blanket warning, all of these posts are personal opinions and do not reflect the views or ethics of my employer. All of this research is being done off-hours and on my own dime.
Have a question or comment? Please drop a comment, below.