

Yeah, I am very worried about Trek (and Avatar).
Are Trekkies not in a panic over this? I guess Ellison isn’t super famous, and a lot of criticism is probably censored/deranked, but still.
Yeah, I am very worried about Trek (and Avatar).
Are Trekkies not in a panic over this? I guess Ellison isn’t super famous, and a lot of criticism is probably censored/deranked, but still.
In my case it’s performance and sheer RAM need.
GLM 4.5 needs like 112GB RAM and absolutely every megabyte of VRAM from the GPU, at least without the quantization getting too compressed to use. I’m already swapping a tiny bit and simply cannot afford the overhead.
I think containers may slow down CPU<->GPU transfers slightly, but don’t quote me on that.
There’s no commit though; don’t like it, you can bail and forget about it.
Long form content, in general, seems to be going out of fashion.
It’s not just TV. Short articles outperform long deep dives in papers. Same with longer YouTube videos, which extends to the rise of shorts. Mobile and ‘short session’ games make up a huge chunk of playtime. I’m not sure about ‘big’ literature, but even fanfiction and amateur works are skewing towards collections of short, fluffy pieces instead of long-form adventures now.
It’s not just attention spans or strained attention capacity either; my impression is energy/time levels to devote to that are dropping. I know a working couple with no kids that still transitioned to shorter-form YouTube stuff over TV because they’re just too tired from work + basic maintenance.
It’s OpenCL, so it should even run on integrated graphics.
Yeah, it’s great! Extremely fast and marginally smaller than ffmpeg, last I checked, though I have not messed with those newer ffmpeg options. And cuetools has some other nice utilities anyway.
Heh, joke’s on you; you’re using the wrong library for obsessive FLAC compression anyway:
0.00017x sounds like a bug though. Maybe it balloons RAM usage enough to trigger swapping?
I dunno, still feels like a honeypot to me, lol.
100% a parody instance: https://maga.place/post/2245?scrollToComments=true
Even mention Epstein+Trump like that would be a site-wide shadowban posted on /r/conservative.
It’s good though. It’s plausibly pro trump.
I’ll make my own LLVM, with blackjack and hookers.
I’d be interested in more partial acceleration/offloading support.
As an example, the denoising AV1 does for grain synthesis is hard, but GPUs are really good at it. It would be awesome if they could offload that step to vulkan as an option.
Another would be better supported variable frame rate. But that’s a tall ask I guess.
Wonder if I’ll get HW coding.
It will not :(
It basically always fixed function, eg they would literally have to etch it into the silicon. If it’s not already there, you ain’t getting it.
The Xbox One did get a GPU shader decoder, IIRC. But that was an absolute crack project built out of necessity and a lot of financial interest (as its CPU was really bad).
Yeah. But it also messes stuff up from the llama.cpp baseline, and hides or doesn’t support some features/optimizations, and definitely doesn’t support the more efficient iq_k quants of ik_llama.cpp and its specialzied MoE offloading.
And that’s not even getting into the various controversies around ollama (like broken GGUFs or indications they’re going closed source in some form).
…It just depends on how much performance you want to squeeze out, and how much time you want to spend on the endeavor. Small LLMs are kinda marginal though, so IMO its important if you really want to try; otherwise one is probably better off spending a few bucks on an API that doesn’t log requests.
In case I miss your reply, assuming a 3080 + 64 GB of RAM, you want the IQ4_KSS (or IQ3_KS, for more RAM for tabs and stuff) version of this:
https://huggingface.co/ubergarm/GLM-4.5-Air-GGUF
Part of it will run on your GPU, part will live in system RAM, but ik_llama.cpp does the quantizations split and GPU offloading in a particularly efficient way for these kind of ‘MoE’ models. Follow the instructions on that page.
If you ‘only’ have 32GB RAM or less, that’s tricker, and the next question is what kind of speeds do you want. But it’s probably best to wait a few days and see how Qwen3 80B looks when it comes out. Or just go with the IQ4_K version of this: https://huggingface.co/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF
And you don’t strickly need the hyper optimization of ik_llama.cpp for a small model like Qwen3 30B. Something easier like lm studio or the llama.cpp docker image would be fine.
Alternatively, you could try to squeeze Gemma 27B into that 11GB VRAM, but it would be tight.
How much system RAM, and what kind? DDR5?
ik doesn’t have great documentation, so it’d be a lot easier for me to just point you places, heh.
At risk of getting more technical, ik_llama.cpp has a good built in webui:
https://github.com/ikawrakow/ik_llama.cpp/
Getting more technical, its also way better than ollama. You can run models way smarter than ollama can on the same hardware.
For reference, I’m running GLM-4 (667 GB of raw weights) on a single RTX 3090/Ryzen gaming rig, at reading speed, with pretty low quantization distortion.
And if you want a ‘look this up on the internet for me’ assistant (which you need for them to be truly useful), you need another docker project as well.
…That’s just how LLM self hosting is now. It’s simply too hardware intense and ad hoc to be easy and smart and cheap. You can indeed host a small ‘default’ LLM without much tinkering, but its going to be pretty dumb, and pretty slow on ollama defaults.
They already have a huge stake, don’t they?
They’re too late anyway, OpenAI is already enshittifying themselves.
Because however one feels about blockchain tech and its future, past companies within the crypto industry are notorious for selling the moon, being shady, and cashing out early. ‘ZCash’ appears to be a good example, particularly because a small group exerts such a high level of control over it.
And if the parallel holds, and at least some of that applies Jay Gaeber’s own personal experience and expectations of what a company’s trajectory should look like, it doesn’t bode well for Bluesky.
Mobile 5090 would be an underclocked, binned desktop 5080, AFAIK:
https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#GeForce_50_series
In KCD2 (a fantastic CryEngine game, a great benchmark IMO), at QHD, the APU is a hair less half as fast. For instance, 39 FPS at QHD vs 84 FPS for the mobile 5090:
https://www.notebookcheck.net/Nvidia-GeForce-RTX-5090-Laptop-Benchmarks-and-Specs.934947.0.html
https://www.notebookcheck.net/AMD-Radeon-8060S-Benchmarks-and-Specs.942049.0.html
Synthetic benchmarks between the two
But these are both presumably running at high TDP (150W for the 5090). Also, the mobile 5090 is catastrophically overpriced and inevitably tied to a weaker CPU, whereas the APU is a monster of a CPU. So make of that what you will.
Oh wow, that’s awesome! I didn’t know folks ran TDP tests like this, just that my old 3090 seems to have a minimum sweet spot around that same same ~200W based on my own testing, but I figured the 4000 or 5000 series might go lower. Apparently not, at least for the big die.
I also figured the 395 would draw more than 55W! That’s also awesome! I suspect newer, smaller GPUs like the 9000 or 5000 series still make the value proposition questionable, but still you make an excellent point.
And for reference, I just checked, and my dGPU hovers around 30W idle with no display connected.
To be clear, VMs absolutely have overhead but Docker/Podman is the question. It might be negligible.
And this is a particularly weird scenario (since prompt processing literally has to shuffle ~112GB over the PCIe bus for each batch). Most GPGPU apps aren’t so sensitive to transfer speed/latency.