@tehnomad

tehnomad@lemm.ee · 25 days ago

One thing I would do differently is setup LDAP and OIDC so you can use the same authentication credentials for different apps (at least the ones that support them). I use LLDAP and Authelia for this purpose.

tehnomad@lemm.ee · 28 days ago

I found a VRAM calculator for LLMs here: https://huggingface.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator

Wow it seems like for 128K context size you do need a lot of VRAM (~55 GB). Qwen 72B will take up ~39 GB so you would either need 4x 24GB Nvidia cards or the Mac Pro 192 GB RAM. Probably the cheapest option would be to deploy GPU instances on a service like Runpod. I think you would have to do a lot of processing before you get to the breakeven point of your own machine.

tehnomad@lemm.ee · 28 days ago

The context cache doesn’t take up too much memory compared to the model. The main benefit of having a lot of VRAM is that you can run larger models. I think you’re better off buying a 24 GB Nvidia card from a cost and performance standpoint.