joe, (edited ) to programming

Yesterday, we played with Llama 3 using the Ollama CLI client (or REPL). Today I figured that we would play with it using the Ollama API. The Ollama API is documented on their Github repo. Ollama has a client that runs when you run ollama run llama3 and a service that can be accessed from something like MindMac, Amallo, or Enchanted. The service is what starts when you run ollama serve.

In our first Llama 3 post, we asked the model for “a comma-delimited list of cities in Wisconsin with a population over 100,000 people”. Using Postman and the completion API endpoint, you can ask the same thing.

https://i0.wp.com/jws.news/wp-content/uploads/2024/04/Screenshot-2024-04-20-at-1.30.48%E2%80%AFPM.png?resize=1024%2C811&ssl=1

You will notice the stream parameter is set to false in the body. If the value is false, the response will be returned as a single response object, rather than a stream of objects. If you are using the API with a web application, you will want to ask the model for the answer as JSON and you will probably want to provide an example of how you want the answer formatted.

https://i0.wp.com/jws.news/wp-content/uploads/2024/04/Screenshot-2024-04-20-at-1.45.15%E2%80%AFPM.png?resize=1024%2C811&ssl=1

You can use Node and Node-fetch to do the same thing.

If you run it from the terminal, it will look like this:

https://i0.wp.com/jws.news/wp-content/uploads/2024/04/Screenshot-2024-04-20-at-2.01.19%E2%80%AFPM.png?resize=1024%2C932&ssl=1

Have any questions, comments, etc? Please feel free to drop a comment, below.

https://jws.news/2024/lets-play-more-with-llama-3/

joe, to ai

Last week, Meta announced Llama 3. Thanks to Ollama, you can run it pretty easily. There are 8b and 70b variants available. There are also pre-trained or instruction-tuned variants available. I am not seeing it on the Hugging Face Leader Board yet but the bit that I have played around with it has been promising.

Here are two basic test questions:

https://i0.wp.com/jws.news/wp-content/uploads/2024/04/Screenshot-2024-04-20-at-12.15.45%E2%80%AFPM.png?resize=989%2C1024&ssl=1

https://i0.wp.com/jws.news/wp-content/uploads/2024/04/Screenshot-2024-04-20-at-12.27.47%E2%80%AFPM.png?resize=989%2C1024&ssl=1

Have any questions, comments, etc? Please feel free to drop a comment, below.

https://jws.news/2024/lets-play-with-llama-3/

ramikrispin, to llm
@ramikrispin@mstdn.social avatar

Llama 3 is already available on Ollama 🚀👇🏼

https://ollama.com/library/llama3

#llm #llama3 #ollama #python #DataScience

ramikrispin, to datascience
@ramikrispin@mstdn.social avatar

New release to Ollama 🎉

A major release to Ollama - version 0.1.32 is out. The new version includes:
✅ Improvement of the GPU utilization and memory management to increase performance and reduce error rate
✅ Increase performance on Mac by scheduling large models between GPU and CPU
✅ Introduce native AI support in Supabase edge functions

More details on the release notes 👇🏼
https://github.com/ollama/ollama/releases

Image credit: release notes

ainmosni, to programming
@ainmosni@berlin.social avatar

So, my #Copilot trial just expired, and while it did cut down on some typing, it also made me feel like the quality of my code was lower, and of course it felt dirty to use it considering that it's a license whitewashing machine.

I don't think I will be paying for it, I don't think the results are worth it.

#Programming #AI #LLM #GoLang

0xZogG,
@0xZogG@hachyderm.io avatar

@ainmosni: there is other solution where they have free tier -
In general as non frontend dev, I like how it suggests for html and for go even just minimal placeholders function fill-ins is nice.
But as per license and not knowing where my code is sent, I'm looking for selfhosted solution. Found few options with , but unfortunately my current 10 years old HW is not enough for that :P

ironicbadger, to selfhosted
@ironicbadger@techhub.social avatar

I hooked the Enchanted LLM app up to my #selfhosted #ollama instance tonight. Running on the epyc box with an nvidia a4000.

I can’t notice a difference in speed between this and the real chat-gpt tbh. And I own the whole chain locally. Man this is cool!!

I even shared the ollama api endpoint with a buddy over Tailscale and now they’re whipping the llamas 🦙 api as well. Super fun times.

https://apps.apple.com/us/app/enchanted-llm/id6474268307

DavidMarzalC, to Symfony Spanish
@DavidMarzalC@mastodon.escepticos.es avatar

Encaramos el final de marzo con otro nuevo episodio de "Accesibilidad con Tecnologías libres", para los amigues.

En https://accesibilidadtl.gitlab.io/05 tenéis las notas del programa con los temas tratados y los enlaces.

Si seguís a este alias @6706483 , tendréis automáticamente las publicaciones del podcast dentro del Fediverso

Y por si aún no lo tenéis el feed RSS para vuestras aplicaciones es:
https://accesibilidadtl.gitlab.io/feed

En este episodio 05 participan entre otras personas:

Esperamos que os resulte interesante.

kohelet, to llm
@kohelet@mstdn.social avatar

So,
I downloaded Ollama,
installed a local LLM,
installed the Continue VSCode extension, configured it with my local LLM.

Now I just need something to do with it!
Like, any project at all.
huh.

5am, to LLMs
@5am@fosstodon.org avatar

I've been playing around with locally hosted using the tool. I've mostly been using models like mistral and dolphin-coder for assistance with textual ideas and issues. More recently I've been using the llava visual model via some simple , looping through images and creating description files. I can then grep those files for key words and note the associated filenames. Powerful stuff!

orhun, to rust
@orhun@fosstodon.org avatar

Here is how you can use ChatGPT in your terminal - with an interface! 🔥

🦾 tenere: TUI for LLMs written in Rust.

🚀 Supports ChatGPT, llama.cpp & ollama

🦀 Built with @ratatui_rs

⭐ GitHub: https://github.com/pythops/tenere

video/mp4

ironicbadger, to NixOS
@ironicbadger@techhub.social avatar

A fully self-hosted, totally offline, and local AI using #nixos #tailscale and open-webui as a front-end for #ollama

my_actual_brain, to llm
@my_actual_brain@fosstodon.org avatar

I’m surprised at the performance an can run at on my 8700-cpu.

It’s a bit slow and I don’t think it’s worth getting a gpu just to make it run at a faster speed, but maybe in time I’ll reconsider it.

If I were going to get a , what would you recommend?

gael, (edited ) to llm French
@gael@mastodon.social avatar

Imagine a running locally on a smartphone...

It could be used to fuel an offline assistant that would be able to easily add an appointment to your calendar, open an app, etc. without issues.

This has come to reality with this proof of concept using the Phi-2 2.8B transformer model running on /e/OS.

It is slow, so not very usable until we have dedicated chips on SoCs, but works (and !)

🙏 Stypox

on 👇 👇 👇

owzim,
@owzim@mastodon.social avatar

@gael I'd say before that happens, running an LLM on your local network is your best bet. Projects like https://ollama.com/ make that incredibly easy. #llm #ollama

dliden, to emacs
@dliden@emacs.ch avatar

Has anyone here worked much with generators in ?

I am looking for a good solution for streaming outputs in my ollama-elisp-sdk project. I think there's a good angle using generators to make a workflow fairly similar to e.g. the OpenAI API. Not sure yet though.

joe, to random

This past month, I was talking about how I spent $528 to buy a machine with enough guts to run more demanding AI models in Ollama. That is good and all but if you are not on that machine (or at least on the same network), it has limited utility. So, how do you use it if you are at a library or a friend’s house? I just discovered Tailscale. You install the Tailscale app on the server and all of your client devices and it creates an encrypted VPN connection between them. Each device on your “tailnet” has 4 addresses you can use to reference it:

  • Machine name: my-machine
  • FQDN: my-machine.tailnet.ts.net
  • IPv4: 100.X.Y.Z
  • IPv6: fd7a:115c:a1e0::53

If you remember Hamachi from back in the day, it is kind of the spiritual successor to that.

https://i0.wp.com/jws.news/wp-content/uploads/2024/03/Screenshot-2024-03-04-at-2.37.06%E2%80%AFPM.png?resize=1024%2C592&ssl=1

There is no need to poke holes in your firewall or expose your Ollama install to the public internet. There is even a client for iOS, so you can run it on your iPad. I am looking forward to playing around with it some more.

https://jws.news/2024/joe-discovered-tailscale/

#Hamachi #Networking #Ollama #Tailscale

sesivany, (edited ) to ai
@sesivany@floss.social avatar

is the easiest way to run local I've tried so far. In 5 minutes you can have a chatbot running on a local model. Dozens of models and UIs to choose from.
Just the speed is not great, but what can I expect on an Intel-only laptop.

bmp, to random
@bmp@mastodon.sdf.org avatar

Completely forgot I had made this database a while ago when I was bored: https://codeberg.org/bmp/flock, it is written in Go, and was generated with if I remember correctly. Maybe I'll pick it up again, given that newer models seem to be better.

davidbisset, to apple
@davidbisset@phpc.social avatar

NotesOllama uses #Ollama to talk to local LLMs in #Apple Notes on #macos Inspired by Obsidian Ollama. #AI

https://smallest.app/notesollama/

Not sure if i trust this if i don't see the code?

joe, to llm

Back in December, I paid $1,425 to replace my MacBook Pro to make my LLM research at all possible. That had an M1Pro CPU and 32GB of RAM, which (as I said previously) is kind of a bare minimum spec to run a useful local AI. I quickly wished I had enough RAM to run a 70B model, but you can’t upgrade Apple products after the fact and a 70B model needs 64GB of RAM. That led me to start looking for a second-hand Linux desktop that can handle a 70B model.

I ended up finding a 4yr-old HP Z4 G4 workstation with a Xeon® W-2125 Processor,128 GB of DDR4 2666 MHz RAM, a 512GB SAMSUNG nVme SSD, and a NVIDIA Quadro P4000 GPU with 8GB of GDDR5 GPU Memory. I bought it before Ollama released their Windows preview, so I planned to throw the latest Ubuntu LTS on it.

Going into this experiment, I was expecting that Ollama would thrash the GPU and the RAM but would use the CPU sparingly. I was not correct.

This is what the activity monitor looked like when I asked various models to tell me about themselves:

Mixtral

An ubuntu activity monitor while running mixtral

Llama2:70b

An ubuntu activity monitor while running Llama2:70b

Llama2:7b

An ubuntu activity monitor while running llama2:7b

Codellama

An ubuntu activity monitor while running codellama

The Xeon W-2125 has 8 threads and 4 cores, so I think that CPU1-CPU8 are threads. My theory going into this was that the models would go into memory and then the GPU would do all of the processing. The CPU would only be needed to serve the results back to the user. It looks like the full load is going to the CPU. For a moment, I thought that the 8 GB of video RAM was the limitation. That is why I tried running a 7b model for one of the tests. I am still not convinced that Ollama is even trying to use the GPU.

A screenshot of the "additional drivers" screen in ubuntu

I am using a proprietary Nvidia driver for the GPU but maybe I’m missing something?

I was recently playing around with Stability AI’s Stability Cascade. I might need to run those tests on this machine to see what the result is. It may be an Ollama-specific issue.

Have any questions, comments, or concerns? Please feel free to drop a comment, below. As a blanket warning, all of these posts are personal opinions and do not reflect the views or ethics of my employer. All of this research is being done off-hours and on my own dime.

https://jws.news/2024/hp-z4-g4-workstation/

ironicbadger, to selfhosted
@ironicbadger@techhub.social avatar

Did someone say self-hosted LLMs?

joe, to ubuntu
@joe@toot.works avatar

I am unsure how to fix "Unable to change power state from D3hot to DO, device inaccessible". I would have expected that installing desktop would have gotten easier over the years.

joe,
@joe@toot.works avatar

Well, f***. I thought that running a 70B AI model on a machine with 128 gigabytes of RAM would tax the RAM, not the CPU. Apparently that Xeon processor is the bottleneck.

I should check to make sure that the GPU is in use.

The resource monitor in Ubuntu 22 immediately after running code llama 70B through Ollama

FreakyFwoof, to random

A long long time ago, @arfy made a lua script for Dolphin screen-readers that allowed you to type in plus or minus number of days and get the date. I just asked Dolphin Mixtral to do the same as an apple script using #Ollama running locally and it actually did it. It runs and works just as I wanted. Madness.

set numDays to text returned of (display dialog "Enter the number of days:" default answer "")
set targetDate to current date
set newDate to targetDate + numDays * days
display dialog "The future date will be: " & (newDate as string)

joe, to ai

Back in December, I started exploring how all of this AI stuff works. Last week’s post was about the basics of how to run your AI. This week, I wanted to cover some frequently asked questions.

What is a Rule-Based Inference Engine?

A rule-based inference engine is designed to apply predefined rules to a given set of facts or inputs to derive conclusions or make decisions. It operates by using logical rules, which are typically expressed in an “if-then” format. You can think of it as basically a very complex version of the spell check in your text editor.

What is an AI Model?

AI models employ learning algorithms that draw conclusions or predictions from past data. An AI model’s data can come from various sources such as labeled data for supervised learning, unlabeled data for unsupervised learning, or data generated through interaction with an environment for reinforcement learning. The algorithm is the step-by-step procedure or set of rules that the model follows to analyze data and make predictions. Different algorithms have different strengths and weaknesses, and some are better suited for certain types of problems than others. A model has parameters that are the aspects of the model that are learned from the training data. A model’s complexity can be measured by the number of parameters contained in it but complexity can also depend on the architecture of the model (how the parameters interact with each other) and the types of parameters used.

What is an AI client?

An AI client is how the user interfaces with the rule-based inference engine. Since you can use the engine directly, the engine itself could also be the client. For the most part, you are going to want something web-based or a graphical desktop client, though. Good examples of graphical desktop clients would be MindMac or Ollamac. A good example of a web-based client would be Ollama Web UI. A good example of an application that is both a client and a rule-based inference engine is LM Studio. Most engines have APIs and language-specific libraries, so if you want to you can even write your own client.

What is the best client to use with a Rule-Based Inference Engine?

I like MindMac. I would recommend either that or Ollama Web UI. You can even host both Ollama and Ollama Web UI together using docker compose.

What is the best Rule-Based Inference Engine?

I have tried Ollama, Llama.cpp, and LM Studio. If you are using Windows, I would recommend LM Studio. If you are using Linux or a Mac, I would recommend Ollama.

How much RAM does your computer need to run a Rule-Based Inference Engine?

The RAM requirement is dependent upon what model you are using. If you browse the Ollama library, Hugging Face, or LM Studio‘s listing of models, most listings will list a RAM requirement (example) based on the number of parameters in the model. Most 7b models can run on a minimum of 8GB of RAM while most 70b models will require 64GB of RAM. My Macbook Pro has 32GB of unified memory and struggles to run Wizard-Vicuna-Uncensored 30b. My new AI lab currently has 128GB of DDR4 RAM and I hope that it can run 70b models reliably.

Does your computer need a dedicated GPU to run a Rule-Based Inference Engine?

No, you don’t. You can use just the CPU but if you have an Nvidia GPU, it helps a lot.

I use Digital Ocean or Linode for hosting my website. Can I host my AI there, also?

Yeah, you can. The RAM requirement would make it a bit expensive, though. A virtual machine with 8GB of RAM is almost $50/mo.

Why wouldn’t you use ChatGPT, Copilot, or Bard?

When you use any of them, your interactions are used to reinforce the training of the model. That is an issue for more than the most basic prompts. In addition to that, they cost up to $30/month/user.

Why should you use an open-source LLM?

What opinion does your employer have of this research project?

You would need to direct that question to them. All of these posts should be considered personal opinions and do not reflect the views or ethics of my employer. All of this research is being done off-hours and on my own dime.

Why are you interested in this technology?

It is a new technology that I didn’t consider wasteful bullshit in the first hour of researching it.

Are you afraid that AI will take your job?

No.

What about image generation?

I used (and liked) Noiselith until it shut down. DiffusionBee works but I think that Diffusers might be the better solution. Diffusers lets you use multiple models and it is easier to use than Stable Diffusion Web UI.

You advocate for not using ChatGPT. Do you use it?

I do. ChatGPT 4 is a 1.74t model. It can do cool things. I have an API key and I use it via MindMac. Using it that way means that I pay based on how much I use it instead of using it via a Pro account, though.

Are you going to only write about AI on here, now?

Nope. I still have other interests. Expect more Vue.js posts and likely something to do with Unity or Unreal at some point.

Is this going to be the last AI FAQ post?

Nope. I still haven’t covered training or fine-tuning.

https://jws.news/2024/ai-frequently-asked-questions/

ramikrispin, to python
@ramikrispin@mstdn.social avatar

(1/3) Last Friday, I was planning to watch Masters of the Air ✈️, but my ADHD had different plans 🙃, and I ended up running a short POC and creating a tutorial for getting started with Ollama Python 🚀. The settings are available for both Docker 🐳 and locally.

TLDR: It is straightforward to run LLM models locally with the Ollama Python library. Models with up to ~7B parameters run smoothly with low compute resources.

ramikrispin,
@ramikrispin@mstdn.social avatar

(2/3) The tutorial focuses on the following topics:
✅ Setting up Ollama server 🦙
✅ Setting up Python environment 🐍
✅ Pulling and running LLM (examples of Mistral, Llama2, and Vicuna)

Tutorial 📖: https://github.com/RamiKrispin/ollama-poc

ramikrispin,
@ramikrispin@mstdn.social avatar

(3/3) The tutorial will get you to run Ollama inside a dockerized container. Yet, there are some missing pieces, such as mounting LLM models from the local environment to avoid downloading the models during the build time. I plan to explore this topic sometime in the coming weeks.

  • All
  • Subscribed
  • Moderated
  • Favorites
  • megavids
  • thenastyranch
  • modclub
  • mdbf
  • everett
  • Youngstown
  • rosin
  • slotface
  • love
  • khanakhh
  • kavyap
  • ngwrru68w68
  • ethstaker
  • DreamBathrooms
  • anitta
  • magazineikmin
  • tester
  • tacticalgear
  • InstantRegret
  • osvaldo12
  • Durango
  • cubers
  • normalnudes
  • Leos
  • cisconetworking
  • GTA5RPClips
  • provamag3
  • JUstTest
  • All magazines