Yesterday, we played with Llama 3 using the Ollama CLI client (or REPL). Today I figured that we would play with it using the Ollama API. The Ollama API is documented on their Github repo. Ollama has a client that runs when you run ollama run llama3 and a service that can be accessed from something like MindMac, Amallo, or Enchanted. The service is what starts when you run ollama serve.
In our first Llama 3 post, we asked the model for “a comma-delimited list of cities in Wisconsin with a population over 100,000 people”. Using Postman and the completion API endpoint, you can ask the same thing.
You will notice the stream parameter is set to false in the body. If the value is false, the response will be returned as a single response object, rather than a stream of objects. If you are using the API with a web application, you will want to ask the model for the answer as JSON and you will probably want to provide an example of how you want the answer formatted.
Last week, Meta announced Llama 3. Thanks to Ollama, you can run it pretty easily. There are 8b and 70b variants available. There are also pre-trained or instruction-tuned variants available. I am not seeing it on the Hugging Face Leader Board yet but the bit that I have played around with it has been promising.
A major release to Ollama - version 0.1.32 is out. The new version includes:
✅ Improvement of the GPU utilization and memory management to increase performance and reduce error rate
✅ Increase performance on Mac by scheduling large models between GPU and CPU
✅ Introduce native AI support in Supabase edge functions
So, my #Copilot trial just expired, and while it did cut down on some typing, it also made me feel like the quality of my code was lower, and of course it felt dirty to use it considering that it's a license whitewashing machine.
I don't think I will be paying for it, I don't think the results are worth it.
@ainmosni: there is other solution where they have free tier - #codeium
In general as non frontend dev, I like how it suggests for html and for go even just minimal placeholders function fill-ins is nice.
But as per license and not knowing where my code is sent, I'm looking for selfhosted solution. Found few options with #ollama, but unfortunately my current 10 years old HW is not enough for that :P
I've been playing around with locally hosted #LLMs using the #Ollama#CLI tool. I've mostly been using models like mistral and dolphin-coder for assistance with textual ideas and issues. More recently I've been using the llava visual model via some simple #Bash#scripting, looping through images and creating description files. I can then grep those files for key words and note the associated filenames. Powerful stuff!
It could be used to fuel an offline assistant that would be able to easily add an appointment to your calendar, open an app, etc. without #privacy issues.
This has come to reality with this proof of concept using the Phi-2 2.8B transformer model running on /e/OS.
It is slow, so not very usable until we have dedicated chips on SoCs, but works (and #opensource !)
@gael I'd say before that happens, running an LLM on your local network is your best bet. Projects like https://ollama.com/ make that incredibly easy. #llm#ollama
Has anyone here worked much with generators in #emacs ?
I am looking for a good solution for streaming outputs in my ollama-elisp-sdk project. I think there's a good angle using generators to make a workflow fairly similar to e.g. the OpenAI API. Not sure yet though.
This past month, I was talking about how I spent $528 to buy a machine with enough guts to run more demanding AI models in Ollama. That is good and all but if you are not on that machine (or at least on the same network), it has limited utility. So, how do you use it if you are at a library or a friend’s house? I just discovered Tailscale. You install the Tailscale app on the server and all of your client devices and it creates an encrypted VPN connection between them. Each device on your “tailnet” has 4 addresses you can use to reference it:
Machine name: my-machine
FQDN: my-machine.tailnet.ts.net
IPv4: 100.X.Y.Z
IPv6: fd7a:115c:a1e0::53
If you remember Hamachi from back in the day, it is kind of the spiritual successor to that.
There is no need to poke holes in your firewall or expose your Ollama install to the public internet. There is even a client for iOS, so you can run it on your iPad. I am looking forward to playing around with it some more.
#Ollama is the easiest way to run local #AI I've tried so far. In 5 minutes you can have a chatbot running on a local model. Dozens of models and UIs to choose from.
Just the speed is not great, but what can I expect on an Intel-only laptop.
Completely forgot I had made this #fountainpen database a while ago when I was bored: https://codeberg.org/bmp/flock, it is written in Go, and was generated with #ollama if I remember correctly. Maybe I'll pick it up again, given that newer models seem to be better.
Back in December, I paid $1,425 to replace my MacBook Pro to make my LLM research at all possible. That had an M1Pro CPU and 32GB of RAM, which (as I said previously) is kind of a bare minimum spec to run a useful local AI. I quickly wished I had enough RAM to run a 70B model, but you can’t upgrade Apple products after the fact and a 70B model needs 64GB of RAM. That led me to start looking for a second-hand Linux desktop that can handle a 70B model.
The Xeon W-2125 has 8 threads and 4 cores, so I think that CPU1-CPU8 are threads. My theory going into this was that the models would go into memory and then the GPU would do all of the processing. The CPU would only be needed to serve the results back to the user. It looks like the full load is going to the CPU. For a moment, I thought that the 8 GB of video RAM was the limitation. That is why I tried running a 7b model for one of the tests. I am still not convinced that Ollama is even trying to use the GPU.
I am using a proprietary Nvidia driver for the GPU but maybe I’m missing something?
I was recently playing around with Stability AI’s Stability Cascade. I might need to run those tests on this machine to see what the result is. It may be an Ollama-specific issue.
Have any questions, comments, or concerns? Please feel free to drop a comment, below. As a blanket warning, all of these posts are personal opinions and do not reflect the views or ethics of my employer. All of this research is being done off-hours and on my own dime.
I am unsure how to fix "Unable to change power state from D3hot to DO, device inaccessible". I would have expected that installing desktop #ubuntu would have gotten easier over the years.
Well, f***. I thought that running a 70B AI model on a machine with 128 gigabytes of RAM would tax the RAM, not the CPU. Apparently that Xeon processor is the bottleneck.
I should check to make sure that the GPU is in use.
A long long time ago, @arfy made a lua script for Dolphin screen-readers that allowed you to type in plus or minus number of days and get the date. I just asked Dolphin Mixtral to do the same as an apple script using #Ollama running locally and it actually did it. It runs and works just as I wanted. Madness.
set numDays to text returned of (display dialog "Enter the number of days:" default answer "")
set targetDate to current date
set newDate to targetDate + numDays * days
display dialog "The future date will be: " & (newDate as string)
A rule-based inference engine is designed to apply predefined rules to a given set of facts or inputs to derive conclusions or make decisions. It operates by using logical rules, which are typically expressed in an “if-then” format. You can think of it as basically a very complex version of the spell check in your text editor.
What is an AI Model?
AI models employ learning algorithms that draw conclusions or predictions from past data. An AI model’s data can come from various sources such as labeled data for supervised learning, unlabeled data for unsupervised learning, or data generated through interaction with an environment for reinforcement learning. The algorithm is the step-by-step procedure or set of rules that the model follows to analyze data and make predictions. Different algorithms have different strengths and weaknesses, and some are better suited for certain types of problems than others. A model has parameters that are the aspects of the model that are learned from the training data. A model’s complexity can be measured by the number of parameters contained in it but complexity can also depend on the architecture of the model (how the parameters interact with each other) and the types of parameters used.
What is an AI client?
An AI client is how the user interfaces with the rule-based inference engine. Since you can use the engine directly, the engine itself could also be the client. For the most part, you are going to want something web-based or a graphical desktop client, though. Good examples of graphical desktop clients would be MindMac or Ollamac. A good example of a web-based client would be Ollama Web UI. A good example of an application that is both a client and a rule-based inference engine is LM Studio. Most engines have APIs and language-specific libraries, so if you want to you can even write your own client.
What is the best client to use with a Rule-Based Inference Engine?
I like MindMac. I would recommend either that or Ollama Web UI. You can even host both Ollama and Ollama Web UI together using docker compose.
What is the best Rule-Based Inference Engine?
I have tried Ollama, Llama.cpp, and LM Studio. If you are using Windows, I would recommend LM Studio. If you are using Linux or a Mac, I would recommend Ollama.
How much RAM does your computer need to run a Rule-Based Inference Engine?
The RAM requirement is dependent upon what model you are using. If you browse the Ollama library, Hugging Face, or LM Studio‘s listing of models, most listings will list a RAM requirement (example) based on the number of parameters in the model. Most 7b models can run on a minimum of 8GB of RAM while most 70b models will require 64GB of RAM. My Macbook Pro has 32GB of unified memory and struggles to run Wizard-Vicuna-Uncensored 30b. My new AI lab currently has 128GB of DDR4 RAM and I hope that it can run 70b models reliably.
Does your computer need a dedicated GPU to run a Rule-Based Inference Engine?
No, you don’t. You can use just the CPU but if you have an Nvidia GPU, it helps a lot.
I use Digital Ocean or Linode for hosting my website. Can I host my AI there, also?
Yeah, you can. The RAM requirement would make it a bit expensive, though. A virtual machine with 8GB of RAM is almost $50/mo.
Why wouldn’t you use ChatGPT, Copilot, or Bard?
When you use any of them, your interactions are used to reinforce the training of the model. That is an issue for more than the most basic prompts. In addition to that, they cost up to $30/month/user.
Why should you use an open-source LLM?
What opinion does your employer have of this research project?
You would need to direct that question to them. All of these posts should be considered personal opinions and do not reflect the views or ethics of my employer. All of this research is being done off-hours and on my own dime.
Why are you interested in this technology?
It is a new technology that I didn’t consider wasteful bullshit in the first hour of researching it.
Are you afraid that AI will take your job?
No.
What about image generation?
I used (and liked) Noiselith until it shut down. DiffusionBee works but I think that Diffusers might be the better solution. Diffusers lets you use multiple models and it is easier to use than Stable Diffusion Web UI.
You advocate for not using ChatGPT. Do you use it?
I do. ChatGPT 4 is a 1.74t model. It can do cool things. I have an API key and I use it via MindMac. Using it that way means that I pay based on how much I use it instead of using it via a Pro account, though.
Are you going to only write about AI on here, now?
Nope. I still have other interests. Expect more Vue.js posts and likely something to do with Unity or Unreal at some point.
Is this going to be the last AI FAQ post?
Nope. I still haven’t covered training or fine-tuning.
(1/3) Last Friday, I was planning to watch Masters of the Air ✈️, but my ADHD had different plans 🙃, and I ended up running a short POC and creating a tutorial for getting started with Ollama Python 🚀. The settings are available for both Docker 🐳 and locally.
TLDR: It is straightforward to run LLM models locally with the Ollama Python library. Models with up to ~7B parameters run smoothly with low compute resources.
(2/3) The tutorial focuses on the following topics:
✅ Setting up Ollama server 🦙
✅ Setting up Python environment 🐍
✅ Pulling and running LLM (examples of Mistral, Llama2, and Vicuna)
(3/3) The tutorial will get you to run Ollama inside a dockerized container. Yet, there are some missing pieces, such as mounting LLM models from the local environment to avoid downloading the models during the build time. I plan to explore this topic sometime in the coming weeks.