• 1 Post
  • 167 Comments
Joined 2 years ago
cake
Cake day: March 22nd, 2024

help-circle












  • Yeah. But it also messes stuff up from the llama.cpp baseline, and hides or doesn’t support some features/optimizations, and definitely doesn’t support the more efficient iq_k quants of ik_llama.cpp and its specialzied MoE offloading.

    And that’s not even getting into the various controversies around ollama (like broken GGUFs or indications they’re going closed source in some form).

    …It just depends on how much performance you want to squeeze out, and how much time you want to spend on the endeavor. Small LLMs are kinda marginal though, so IMO its important if you really want to try; otherwise one is probably better off spending a few bucks on an API that doesn’t log requests.




  • At risk of getting more technical, ik_llama.cpp has a good built in webui:

    https://github.com/ikawrakow/ik_llama.cpp/

    Getting more technical, its also way better than ollama. You can run models way smarter than ollama can on the same hardware.

    For reference, I’m running GLM-4 (667 GB of raw weights) on a single RTX 3090/Ryzen gaming rig, at reading speed, with pretty low quantization distortion.

    And if you want a ‘look this up on the internet for me’ assistant (which you need for them to be truly useful), you need another docker project as well.

    …That’s just how LLM self hosting is now. It’s simply too hardware intense and ad hoc to be easy and smart and cheap. You can indeed host a small ‘default’ LLM without much tinkering, but its going to be pretty dumb, and pretty slow on ollama defaults.





  • Oh wow, that’s awesome! I didn’t know folks ran TDP tests like this, just that my old 3090 seems to have a minimum sweet spot around that same same ~200W based on my own testing, but I figured the 4000 or 5000 series might go lower. Apparently not, at least for the big die.

    I also figured the 395 would draw more than 55W! That’s also awesome! I suspect newer, smaller GPUs like the 9000 or 5000 series still make the value proposition questionable, but still you make an excellent point.

    And for reference, I just checked, and my dGPU hovers around 30W idle with no display connected.