What are your favorite models so far?
What are your favorite models so far?
I think it's a good idea to share experiences about LLMs here, since benchmarks can only give a very rough overview on how well a model performs.
So please share how much you're using LLMs, what you use them for and how they well they perform at those tasks. For example, here are my answers to these questions:
Usage
I use LLMs daily for work and for random questions that I would previously use web search for.
I mainly use LLMs for reasoning heavy tasks, such as assisting with math or programming. Other frequent tasks include proofreading, helping with bureaucracy, or assisting with writing when it matters.
Models
The one I find most impressive at the moment is TheBloke/airoboros-l2-70B-gpt4-1.4.1-GGML/airoboros-l2-70b-gpt4-1.4.1.ggmlv3.q2_K.bin. It often manages to reason correctly on questions where most other models I tried fail, even though most humans wouldn't. I was surprised that something using only 2.5 bits per weight on average could produce anything but garbage. Downsides are that loading times are rather long, so I wouldn't ask it a question if I didn't want to wait. (Time to first token is almost 50s!). I'd love to hear how bigger quantizations or the unquantized versions perform.
Another one that made a good impression on me is Qwen-7B-Chat (demo). It manages to correctly answer some questions where even some llama2-70b finetunes fail, but so far I'm getting memory leaks when running it on my M1 mac in fp16 mode, so I didn't use it a lot. (this has been fixed it seems!)
All other models I briefly tried where not too useful. It's nice to be able to run them locally, but they were so much worse than chatGPT that it's often not even worth it to consider using them.
Bit off-topic but if I'm looking at this correctly, it uses a custom architecture which requires turning on
trust_remote_code
and the code that would be embedded into the models and trusted is not included in the repo. In fact, there's no real code in the repo: it's the just a bit of boilerplate to run inference and tests. If so, that's kind of spooky and I suggest being careful not to run inference on those models outside of a locked down environment like a container.I think that’s a very relevant comment, and I also got spooked by this before I ran it. But I noticed that the GitHub repo and the huggingface repo aren’t the same. You can find the remote code in the huggingface repo. I also briefly skimmed the code for potential causes of the memory leak, but it’s not clear to me what’s causing it. It could also be PyTorch or one of the huggingface libraries, since mps support is still very beta.
Ahh, interesting.
I mean, it's published by a fairly reputable organization so the chances of a problem are fairly low but I'm not sure there's any guarantee that the compiled Python in the pickle matches the source files there. I wrote my own pickle interpreter a while back and it's an insane file format. I think it would be nearly impossible to verify something like that. Loading a pickle file with the safety stuff disabled is basically the same as running a
.pyc
file: it can do anything a Python script can.So I think my caution still applies.
From their description here: https://github.com/QwenLM/Qwen-7B/blob/main/tech_memo.md#model
It doesn't seem like anything super crazy is going on. I doubt the issue would be in Transformers or PyTorch.
I'm not completely sure what you mean by "MPS".