Skip Navigation

InitialsDiceBearhttps://github.com/dicebear/dicebearhttps://creativecommons.org/publicdomain/zero/1.0/„Initials” (https://github.com/dicebear/dicebear) by „DiceBear”, licensed under „CC0 1.0” (https://creativecommons.org/publicdomain/zero/1.0/)NO
Posts
89
Comments
149
Joined
2 yr. ago

  • Didn't do well at rag, but maybe that's because RAG is mostly handled by the wrapper rather than relying on the output of the model, think it'll matter more for langchain/guidance and the example they gave

  • I feel like for non coding tasks you're sadly better off using a 13B model, codellama lost a lot of knowledge/chattiness from its coding fine tuning

    THAT SAID it actually kind of depends on what you're trying to do, if you're aiming for RP don't bother, if you're thinking about summarization or logic tasks or RAG, codellama may do totally fine, so more info may help

    If you have 24gb of VRAM (my assumption if you can load 34B) you could also play around with 70B at 2.4bpw using exllamav2 (if that made no sense lemme know if it interests you and I'll elaborate) but it'll probably be slower

  • I LOVE orca tunes, they almost always end up feeling like smarter versions of the base, so i'm looking forward to trying this one out when the GPTQ is finished

    GPTQ/AWQ links:

    https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-GPTQ

    https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-AWQ

    Does sliding attention speed up inference? I thought it was more about extending the capabilities of the context above what it was trained on. I suppose I could see it being used to drop context which would save on memory/inference, but didn't think that was the point of it, just a happy side effect, i could be wrong though

  • The abstract is meant to pull in random readers, so it's understandable they'd lay a bit of foundation about what the paper will be about, even if it seems rather simple and unnecessarily wordy

    LoRA is still considered to be the gold standard in efficient fine tuning, so that's why a lot of comparisons are made to it instead of QLoRA, which is more of a hacky way. They both have their advantages, but are pretty distinct.

    Another thing worth pointing out is that 4-bit is not actually just converting all 16bit weights into 4 bits (at least, not in GPTQ style) They also save a quantization factor, so there's more information that can be retrieved from the final quantization than just "multiple everything by 4"

    QA LoRA vs QLoRA: I think my distinction is the same as what you said, it's just about the starting and ending state. QLoRA though also introduced a lot of other different techniques, like double quantizations, normal float datatypes, and paged optimizations to make it work

    it's also worth point out, not understanding it has nothing to do with intellect, it's just how much foundational knowledge you have, i don't understand most of the math but i've read enough of the papers to understand to some degree what's going on

    The one thing I can't quite figure out is, I know QLoRA is competitive with a LoRA because it trains more layers of the transformer vs a LoRA, but I don't see any specific mention of QA-LoRA following that same method which I would think is needed to maintain the quality

    Overall you're right though, this paper is a bit on the weaker side, that said if it works then it works and it's a pretty decent discovery, but the paper alone does not guarantee that

  • By far the biggest pain point of Sony.. their software is clean stable and fast, with acceptable release cadence, but their promise of 2 years is completely unacceptable in this day

    Wish there was any way at all to influence them

  • There's plenty of smaller projects around that attempt to solve similar problems, metagpt, agent os, gpt-pilot, gpt-engineer, autochain, etc

    Several would I'm sure love a hand , you should check em out on GitHub!

  • It seems reasonably realistic if you compare it to code interpreter, it was able to recognize packages it hadn't installed and go seek them out, I don't think it's outside the scope for it to recognize which module wasn't installed and install it

    Even now with regular models they'll suggest the packages you install before executing the code they provide

  • Sure, I can try to add a couple lines on top of the abstract just to give a super brief synopsis

    In this case it would be something like:

    This paper discusses a new technique in which we can create a LORA for an already quantized model, this is unique from QLora which quantizes the full model on the fly to create a quantized lora. With this approach you can take your small model and work with it as is, saving a ton of resources and speeding up the process massively

  • This is great and comes with a very interesting model!

    I wonder if they cleverly slide the window in any way or if it's just a naive slide, could probably be pretty smart if you discard tokens that have minimal attention on them anyways to focus on important text

    For now, this is awesome!

  • The good news is if you do it wrong, much like regular speculative generation, you will still get the right result that the full model would output at the end, so there won't be any loss in quality, just loss in speed

    It's definitely a good point tho, finding the optimal configuration is the difference between slower/minimal speedup and potentially huge amounts of speedup

  • I'm not really sure I follow, it's just a simplification, the most appropriate phrasing I guess would be "given A belongs to B, does it know B 'owns' A" like the examples given with "A is the son of B, is B the parent of A"

  • To start, everything you're saying is entirely correct

    However, the existence of emergent behaviours like chain of thought reasoning shows that there's more to this than pure text predictions, it picks up patterns that were never explicitly trained, so it's entirely feasible to ponder if they're able to recognize reverse patterns

    Hallucinations are a vital part of understanding the models, they might not be long term problems but getting them to understand what they actually know to be true is extremely important in the growth and adoption of LLMs

    I think there's a lot more to the training and generation of text than you're giving it credit, the simplest way to explain it is that it's text prediction, but there's way too much depth to the training and model to say that's all it is

    At the end of the day it's just a fun thought inducing post :) but when Andrej karparthy says he doesn't have a great intuition on how LLM knowledge works (though in fairness he theorizes the same as you, directional learning) I think we can at least agree none of us know for sure what is correct!

  • LocalLLaMA @sh.itjust.works

    Meta’s Next AI Attack on OpenAI: Free Code-Generating Software

    Selfhosted @lemmy.world

    SimpleSecretsManager: A python library to manage encrypted secrets

    LocalLLaMA @sh.itjust.works

    A note on the importance of prompt and template formatting - as seen from starcoder

    LocalLLaMA @sh.itjust.works

    How is LLaMa.cpp possible? - Article

    LocalLLaMA @sh.itjust.works

    GGUF progressing nicely, ggerganov is back on it tomorrow!

    LocalLLaMA @sh.itjust.works

    Decoding intermediate activations in llama-2-7b — LessWrong

    LocalLLaMA @sh.itjust.works

    Oobabooga text-generation-webui now has ctransformers support!

    LocalLLaMA @sh.itjust.works

    Fine-Tuning Llama-2: A Comprehensive Case Study for Tailoring Models to Unique Applications

    LocalLLaMA @sh.itjust.works

    GitHub - neuml/txtai: 💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows

    LocalLLaMA @sh.itjust.works

    WizardLM-70B-V1.0 Released on HF

    LocalLLaMA @sh.itjust.works

    Turbopilot release v0.1.0

    LocalLLaMA @sh.itjust.works

    Chai is running their own open source leaderboard

    LocalLLaMA @sh.itjust.works

    Dolphin 7B by Eric Hartford based on Llama 2 released

    LocalLLaMA @sh.itjust.works

    PSA for any docker users, strange driver mismatch

    LocalLLaMA @sh.itjust.works

    Any suggestions for this community?

    LocalLLaMA @sh.itjust.works

    Open-Orca has released their second preview of OpenChat - Hugging Face

    Lemmy @lemmy.ml

    Best place to host a lemmy wiki

    LocalLLaMA @sh.itjust.works

    Large language models, explained with a minimum of math and jargon

    LocalLLaMA @sh.itjust.works

    Huggingface Text Generation Inference adds exllama support

    LocalLLaMA @sh.itjust.works

    A nice write up for LMQL