micheal65536

2y ago

What is wrong with LLM benchmarks, and why are we still using them?

I have also tried to generate code using deterministic sampling (always pick the token with the highest probability). I didn't notice any appreciable improvement.

2y ago

What is wrong with LLM benchmarks, and why are we still using them?

Jump

I have a similar list of prompts/test cases that I use.

However, my experience has been that all fine-tuned LLaMa models give pretty much the same results. I haven't actually found a model that passes any of my "test cases" that others have failed (additionally, none until OpenOrca preview 2 had failed a test case that others had passed). All the models feel pretty much the same in terms of actual abilities, and the only noticeable difference is that they give their answers in a slightly different way.

2y ago

What is wrong with LLM benchmarks, and why are we still using them?

Jump

Yeah, I'm aware of how sampling and prompt format affect models. I always try to use the correct prompt format (although sometimes there are contradictions between what the documentation says and what the preset for the model in text-generation-webui says, in which case I often try both with no noticeable difference in results). For sampling I normally use the llama-cpp-python defaults and give the model a few attempts to answer the question (regenerate), sometimes I try it on a deterministic setting.

I wasn't aware that the benchmarks are multi-shot. I haven't looked so much into how the benchmarks are actually performed, tbh. But this is useful to know for comparison.

2y ago

What is wrong with LLM benchmarks, and why are we still using them?

Jump

I see your point and we are currently at the "trying to look good on benchmarks" stage with LLMs but my concern/frustration at the moment is that this is actually hindering real progress. Because researchers/developers are looking at the benchmarks and saying "it's X percentage, this is a big improvement" while ignoring real-world performance.

Questions like "how important is the parameter count" (I think it is more important than people are currently acknowledging) are being left unanswered because meanwhile people are saying "here's a 13B parameter model that scores X percentage compared to GPT-3" as if to imply that smaller = better even though this may be impeding the model's actual reasoning ability compared to learning patterns that score well on benchmarks. And new training methods are being developed (see: Evol-Instruct, Orca) through benchmark comparisons and not with consideration of their real-world performance.

I get that benchmarks are an important and useful tool, and I get that performing well on them is a motivating factor in an emerging and competitive industry. But I can't accept such an immediately-noticeable decline in real-world performance (model literally craps itself) compared to previous models while simultaneously bragging about how outstanding the benchmark performance is.

2y ago

Open-Orca has released their second preview of OpenChat - Hugging Face

Jump

I haven't tried that one yet because it seemed like an older model with a less refined dataset but I will put that one in the queue as the next model to download and try out.

2y ago

Open-Orca has released their second preview of OpenChat - Hugging Face

Jump

I am getting very poor results with this model. Its coding ability is noticeably worse than LLaMa 2. It will readily produce output that claims to be following a logical progression of steps, but often the actual answer is not consistent with the logic or the steps themselves are in fact not correct or logical.

Curious to know if other people who have tried it are getting the same results.

2y ago

Discovering Locally Run Language Models: Share Your Favorites/Not So Favorites!

Jump

Which one is the "newer" one? Looking at the quantised releases by TheBloke, I only see one version of 30B WizardLM (in multiple formats/quantisation sizes, plus the unofficial uncensored version).

2y ago

What are the best models you use?

Jump

I thought the original LLaMa was not particularly good for conversational-format interaction as it is not instruction fine-tuned? I thought its mode of operation is always "continue the provided text" (so for example instead of saying "please write an article about..." you would have to write the title and opening paragraph of an article and then it would continue the article).