Localllama setup for $100k.

Consider this hypothetical scenario: if you were given $100,000 to build a PC/server to run open-source LLMs like LLaMA 3 for single-user purposes, what would you build?

Image

Image alternative text

Nomecks, 25 days ago

I’d probably try to find a second hand NVIDIA DGX

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Toes, 29 days ago

4 of whatever modern GPU has the most vram currently. (So I can run 4 personalities at the same time)

Whatever the best amd epyc cpu currently is.

As much ECC ram as possible.

Waifu themes all over the computer.

Linux, LTS edition.

A bunch of nvme SSDs configured redundantly.

And 2 RTX 4090s. (One for the host and one for me)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

possiblylinux127, 9 days ago

And a new car

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

TechNerdWizard42, 1 month ago

I run it all locally on my laptop. Was about $30k new but you can get them used now years later for about $1k to $2k.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

possiblylinux127, 1 month ago

A used minipc and a nice boat

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Sims, 1 month ago

I’m not an expert in any of this, so just wildly speculating in the middle of the night about a huge hypothetical AI-lab for 1 person:

Super high-end equipment would probably quickly eat such a budget (2-5 * H100?), but a ‘small’ rack of 20-25 normal GPU’s (p40) with 8gb+ vram, combined with a local petals.dev setup, would be my quick choice.

However, it’s hard to compete with the cloud on power efficiency, so the setup would quickly expend all future power expenses. All non-sensitive traffic should probably go to something like groq cloud, and the rest on private servers.

An alternative solution is to go for a Npu setup (tpu,lpu, whatnotpu), and/or even a small power generator (wind, solar, digester/burner) to drive it. A cluster of 50 Opi5b (rk3588) 32gbram is within budget (50*6, 300Tops in theory, running with 1.6tb ram on 500w.). Afaik, the underlying software stack isn’t there yet for small npu’s, but more and more frameworks other than cuda pops up (cuda, rocm, metal, opencl, vulkan, ?) so one for Npu’s will probably pop up soon.

Transformers use multiplications a lot, but bitnet doesn’t (only addition), so perhaps models will move to a less power intensive hardware and model frameworks in the future?

Last on my mind atmo: You would probably also not spend all money on inference/training compute. Any descent cognitive architecture around a model (agent networks) need support functions. Tool servers, homeserved software for agents (fora/communication, scraping, modelling, codetesting, statistics etc). Basically versions of the tools we our selves use for different projects and communication/cooperation in an organization.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

0x01, 1 month ago

Why in the world would you need such a large budget? A mac studio can run the 70b variant just fine at $12k

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Timely_Jellyfish_2077, 1 month ago (edited 1 month ago)

If possible, to run the upcoming llama 400B one. But this is just hypothetical.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

possiblylinux127, 9 days ago

May find a way to cluster GPUs and put a crazy amount of ram in a machine with a very power CPU that has enough memory channels and PCIe lanes to support it.

You also will need very fast storage

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

rufus, 25 days ago (edited 25 days ago)

Hmm, maybe with the next M4 Mac Studio. The current one maxes out at 192GB of memory. Which isn’t enough for an decent quantized version of a 400B model. So either like 380GB of (unified) RAM or 8 NVidia A6000.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

SpaceNoodle, 1 month ago

So the answer would be “an alibi for the other $88k”

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

slazer2au, 1 month ago

I’ll take ‘Someone got seed funding and now needs progress to unlock the next part of the package’ for $10 please Alex.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

kelvie, 1 month ago

Depends on what you’re doing with it, but prompt/context processing is a lot faster on Nvidia GPUs than on Apple chips, though if you are using the same prefix all the time it’s a bit better.

The time to first token is a lot faster on datacenter GPUs, especially as context length increases, and consumer GPUs don’t have enough vram.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Add comment