3mo ago

Niche Model of the Day: Nemotron 49B 3bpw exl3

turboderp/Llama-3.3-Nemotron-Super-49B-v1-exl3 at 3.0bpw

This is one of the "smartest" models you can fit on a 24GB GPU now, with no offloading and very little quantization loss. It feels big and insightful, like a better (albeit dry) Llama 3.3 70B with thinking, and with more STEM world knowledge than QwQ 32B, but comfortably fits thanks the new exl3 quantization!

You need to use a backend that support exl3, like (at the moment) text-gen-web-ui or (soon) TabbyAPI.

7 comments

How does it answer the question. "Using lambda calculus attempt to approach a solution for N=NP."
- Heh, calls N=NP out about as politely as it can:
  
  Oh my, this worked much better than I expected. Thanks
What are the benefits of EXL3 vs the more normal quantizations? I have 16gb of VRAM on an AMD card. Would I be able to benefit from this quant yet?
- ^ what was said, not supported yet, though you can give it a shot theoretically.
  Basically exl3 means you can run 32B models, totally on GPU without a ton of quantization loss, if you can get it working on your computer. But exl2/exl3 is less popular largely because it’s PyTorch based, hence more finicky to setup (no GGUF single files, no Macs, no easy install, especially on AMD).
- AFAIK ROCm isn't yet supported: https://github.com/turboderp-org/exllamav3
  I hope the word "yet" means that it might come at some point, but for now it doesn't seem to be developed in any form or fashion.

7 comments