Guide to Self Hosting LLMs Faster/Better than Ollama

brucethemoose@lemmy.world · 19 hours ago

Take long videos on the front and back, silently, then display them with a message they’re “published” along with their photos.

There’d probably be a coup.

brucethemoose@lemmy.world · 2 days ago

The Klingons are practical. If something makes them fitter, they do it.

Mental health rejection culture is more… anti-intellectual? To quote Wikipedia:

It can stem from a distrust of elites or institutions perceived as disconnected from everyday experiences, concerns about cultural identity, or competition in valuing practical knowledge over theoretical or academic expertise. As such, psychological research suggests that certain individuals with anti-intellectual attitudes could sometimes lean towards a display of confidence in their personal experiences rather than trusting authorities,[5] while others adopt their anti-intellectual positions as a reaction to perceived threats to their social status or group identity.

That’s not very Klingon.

brucethemoose@lemmy.world · 3 days ago

Yeah, that fits me.

I mostly only liked the Old Republic stuff, but Andor floored me.

brucethemoose@lemmy.world · 3 days ago

+1

It’s grounded, too. Which I guess falls under ‘good writing’

brucethemoose@lemmy.world · edit-2 3 days ago

This happens in many fandoms, over generations. They get too drunk on nostalgia, to the point of hating minute differences/inconsistencies. Especially the most active members, which (I theorize) tend to be fondest of the originals.

See: the hate for Andor.

brucethemoose@lemmy.world · 5 days ago

Do other instances defederate with them, though?

brucethemoose@lemmy.world · edit-2 5 days ago

I’m in a similar boat, though I’ve been present for some time.

Dbzer0 seems like the best “fit” for me, but practically I just want the instance that’s not defederated/blocking other instances.

…Not sure which that is. But I’d look at Piefed before Lemmy, since they work together, but Piefed seems more desirable feature-wise.

brucethemoose@lemmy.world · edit-2 6 days ago

I dunno what all the fuss is about. GAN upscaling pretty good at this even before “AI” became a household term (and so cursed).

We had upscaling so good a decade ago, for old shows, that no one even notices it’s AI once the “fan cleanup” is published. I worked on one such project.

SOTA these days, as far as I know, is SeedVR2 + oldschool Vapoursynth pre/post processing, but plenty of projects are locally runnable: https://github.com/ByteDance-Seed/SeedVR

Obviously if you put “upscale this” into Sora or whatever, you are going to get lovecraftian horror back… which, to fair, I’d half expect Paramount to do just to get OpenAI into their earnings call.

brucethemoose@lemmy.world · 9 days ago

Yeah…

I think most non-PC-gamer consumers will just go to Android and iOS :(. It’s the simplest path.

Not sure about business. Sheer entrenchment aside, I’ve heard conflicting reports ranging from Windows management systems being so good they’re utterly unparalleled, to Windows systems breaking so much IT is getting frustrated.

brucethemoose@lemmy.world · 12 days ago

website

Bingo.

All the dev work is in the app.

brucethemoose@lemmy.world · edit-2 12 days ago

Did you check drawers?

I leave mine in drawers, somehow.

brucethemoose@lemmy.world · edit-2 14 days ago

It’s all C++ now, so it doesn’t really need docker! I don’t use docker for any ML stuff, just pip/uv venvs.

You might consider Arch (dockerless) ROCM soon; it looks like 7.1 is in the staging repo right now.

brucethemoose@lemmy.world · 14 days ago

Oh, I forgot!

You should check out Lemonade:

https://github.com/lemonade-sdk/lemonade

It’s supports Ryzen NPUs via 2 different runtimes… though apparently not the 8000 series yet?

brucethemoose@lemmy.world · edit-2 14 days ago

Yeah… Even if the LLM is RAM speed constrained, simply using another device to not to interrupt it would be good.

Honestly AMD’s software dev efforts are baffling. They’ve focused on a few on libraries precisely no-one uses, like this: https://github.com/amd/Quark

While ignoring issues holding back entire sectors (like broken flash-attention) with devs screaming about it at the top of their lungs.

Intel suffers from corporate Game of Thrones, but at least they have meaningful contributions in the open source space here, like the SYCL/AMX llama.cpp code or the OpenVINO efforts.

brucethemoose@lemmy.world · edit-2 14 days ago

It still uses memory bandwidth, unfortunately. There’s no way around that, though NPU TTS would still be neat.

…Also, generally, STT responses can’t be streamed, so you mind as well use the iGPU anyway. TTS can be chunked I guess, but do the major implementations do that?

brucethemoose@lemmy.world · edit-2 14 days ago

The IGP is more powerful than the NPU on these things anyway. The NPU us more for ‘background’ tasks, like Teams audio processing or whatever its used for on Windows.

Yeah, in hindsight, AMD should have tasked (and still should task) a few engineers on popular projects (and pushed NPU support harder), but GGML support is good these days. It’s gonna be pretty close to RAM speed-bound for text generation.

brucethemoose@lemmy.world · edit-2 14 days ago

Ah. On an 8000 APU, to be blunt, you’re likely better off with Vulkan + whatever omni models GGML supports these days. Last I checked, TG is faster and prompt processing is close to rocm.

…And yeah, that was total misadvertisement on AMD’s part. They’ve completely diluted the term kinda like TV makers did with ‘HDR’

brucethemoose@lemmy.world · edit-2 14 days ago

You can do hybrid inference of Qwen 30B omni for sure. Or fully offload inference of Vibevoice Large (9B). Or really a huge array of models.

…The limiting factor is free time, TBH. Just sifting through the sea of models, seeing if they work at all, testing if quantization works and such is a huge timesink, especially if you are trying to load stuff with rocm.

brucethemoose@lemmy.world · edit-2 15 days ago

I mean, there are many. TTS and self-hosted automation are huge in the local LLM scene.

We even have open source “omni” models now, that can ingest and output speech tokens directly (which means they get more semantic understanding from tone and such, they ‘choose’ the tone to reply with, and that it’s streamable word-by-word). They support all sorts of tool calling.

…But they aren’t easy to run. It’s still in the realm of homelabs with at least an RTX 3060 + hacky python projects.

If you’re mad, you can self-host Longcat Omni

https://huggingface.co/meituan-longcat/LongCat-Flash-Omni

And blow Alexa out of the water with a MIT-licensed model from, I kid you not, a Chinese food delivery company.

EDIT

For the curious, see:

Audio-text-to-text (and sometimes TTS): https://huggingface.co/models?pipeline_tag=audio-text-to-text&num_parameters=min%3A6B&sort=modified

TTS: https://huggingface.co/models?pipeline_tag=text-to-speech&num_parameters=min%3A6B&sort=modified

“Anything-to-anything,” generally image/video/audio/text -> text/speech: https://huggingface.co/models?pipeline_tag=any-to-any&num_parameters=min%3A6B&sort=modified

Bigger than 6B to exclude toy/test models.

brucethemoose@lemmy.world · 17 days ago

That’s what I was thinking of, thanks.

brucethemoose@lemmy.world · edit-2 1 year ago

Guide to Self Hosting LLMs Faster/Better than Ollama