Skip Navigation

InitialsDiceBearhttps://github.com/dicebear/dicebearhttps://creativecommons.org/publicdomain/zero/1.0/„Initials” (https://github.com/dicebear/dicebear) by „DiceBear”, licensed under „CC0 1.0” (https://creativecommons.org/publicdomain/zero/1.0/)BR
Posts
21
Comments
1,978
Joined
1 yr. ago

  • What model size/family? What GPU? What context length? There are many different backends with different strengths; it's complicated, but I can tell you the optimal way to run it with a bit more specificity, heh.

  • Kobold.cpp is fantastic. Sometimes there are more optimal ways to squeeze models into VRAM (depends on the model/hardware), but TBH I have no complaints.

    I would recommend croco.cpp, a drop-in fork: https://github.com/Nexesenex/croco.cpp

    It has support for more the advanced quantization schemes of ik_llama.cpp. Specifically, you can get really fast performance offloading MoEs, and you can also use much higher quality quantizations, with even ~3.2bpw being relatively low loss. You'd have to make the quants yourself, but it's quite doable... just poorly documented, heh.

    The other warning I'd have is that some of it's default sampling presets are fdfunky, if only because they're from the old days of Pygmalion 6B and Llama 1/2. Newer models like much, much lower temperature and rep penalty.

  • It's kinda a hundred little things all pointing in a bad direction:

    https://old.reddit.com/r/LocalLLaMA/comments/1kg20mu/so_why_are_we_shing_on_ollama_again/

    https://old.reddit.com/r/LocalLLaMA/comments/1ko1iob/ollama_violating_llamacpp_license_for_over_a_year/

    https://old.reddit.com/r/LocalLLaMA/comments/1i8ifxd/ollama_is_confusing_people_by_pretending_that_the/

    I would summarize it as "AI Bro" like behavior:

    • Signs in the code they are preparing a commercial version of Ollama, likely dumping the free version as a bait and switch.
    • Heavy online marketing.
    • "Reinventing"the wheel" to shut out competition, even when base llama.cpp already has it implemented, like with modelfiles and the ollama API.
    • A lot of inexplicable forked behavior.

    Beyond that:

    • Misnaming models for hype reasons, like the tiny deepseek distils as "Deepseek"
    • Technical screw ups with the backend, chat templates and such hidden from users, so there's no apparent reason why models are misbehaving.
    • Not actually contributing to the core development of the engine.
    • Social media scummery.
    • Treating the user as 'dumb' by hiding things like the default hard 2048-token context window.
    • Not keeping up with technical innovations, like newer quantizations, SWA, batching, other backend stuff.
    • Bad default quantizations, even beyond the above. For instance, no Google QATs (last I checked), no imatrix, no dynamic quants.

    I could go on forever about more specific dramas, and I don't even remember the half of them. But there are plenty of technical and moral reasons to stay away.

    LM Studio is much better put together if you want 1-click. Truly open solutions that are more DIY (and reward you with dramatically better performance from the understanding/learning) are the way if you have the time/patience to burn.

  • I hate to drone on about this again, but:

    • ollama is getting less and less open, and (IMO) should not be used. If that doesn't concern you, you should be using LM Studio anyway.
    • The model sizes they mention are mostly for old models no-one should be using. The only exception is a 70B MoE (Hunyuan), but I think ollama doesn't even support that?
    • The quantization methods they mention are (comparatively) primitive and low performance, not cutting edge.
    • It mentions q8_0 twice, nonsensically... Um, it makes me think this article is AI slop?

    I'm glad opensuse is promoting local LLM usage, but please... not ollama, and be more specific.

    And don't use ollama to write it without checking :/

  • making the most with what you have

    That was, indeed, the motto of ML research for a long time. Just hacking out more efficient approaches.

    It's people like Altman that introduced the idea of not innovating and just scaling up what you already have. Hence many in the research community know he's full of it.

  • Oh and to answer this, specifically, Nvidia has been used in ML research forever. It goes back to 2008 and stuff like the desktop GTX 280/CUDA 1.0. Maybe earlier.

    Most "AI accelerators" are basically the same thing these days: overgrown desktop GPUs. They have pixel shaders, ROPs, video encoders and everything, with the one partial exception being the AMD MI300X and beyond (which are missing ROPs).

    CPUs were used, too. In fact, Intel made specific server SKUs for giant AI users like Facebook. See: https://www.servethehome.com/facebook-introduces-next-gen-cooper-lake-intel-xeon-platforms/

  • Machine learning has been a field for years, as others said, yeah, but Wikipedia would be a better expansion of the topic. In a nutshell, it's largely about predicting outputs based on trained input examples.

    It doesn't have to be text. For example, astronmers use it to find certain kinds of objects in raw data feeds. Object recognition (identifying things in pictures with little bounding boxes) is an old art at this point. Series prediction models are a thing, languagetool uses a tiny model to detect commonly confused words for grammar checking. And yes, image hashing is another, though not entirely machine learning based. IDK what Tineye does in their backend, but there are some more "oldschool" approaches using more traditional programming techniques, generating signatures for images that can be easily compared in a huge database.

    You've probably run ML models in photo editors, your TV, your phone (voice recognition), desktop video players or something else without even knowing it. They're tools.

    Seperately, image similarity metrics (like lpips or SSIM) that measure the difference between two images as a number (where, say, 1 would be a perfect match and 0 totally unrelated) are common components in machine learning pipelines. These are not usually machine learning based, barring a few execptions like VMAF (which Netflix developed for video).

    Text embedding models do the same with text. They are ML models.

    LLMs (aka models designed to predict the next 'word' in a block of text, one at a time, as we know them) in particular have an interesting history, going back to (If I even remember the name correctly) BERT in Google's labs. There were also tiny LLMS people did run on personal GPUs before ChatGPT was ever a thing, like the infamous Pygmalion 6B roleplaying bot, a finetune of GPT-J 6B. They were primitive and dumb, but it felt like witchcraft back then (before AI Bros marketers poisoned the well).

  • Late reply, but if you are looking into this, ik_llama.cpp is explicitly optimized for expert offloading. I can get like 16 t/s with a Hunyuan 70B on a 3090.

    If you want long context for models that fit in veam your last stop is TabbyAPI. I can squeeze in 128K context from a 32B in 24GB VRAM, easy… I could probably do 96K with 2 parallel slots, though unfortunately most models are pretty terrible past 32K.

  • Isn’t that a textbook Fourth Amendment case?

    I know they supposedly have some kind of holding period and this has been happening to minorities forever, and techically the mother requested she remain with her children, with no mention of her citizenship status in any of the reporting, other than she likely had a legal visa or something. But she was denied council and held. A congresswoman and her office are witnesses.

    It feels so dramatic that you'd think the ACLU or someone would jump on it as a test case.

  • The LLM “engine” is mostly detached from the UI.

    kobold.cpp is actually pretty great, and you can still use it with TabbyAPI (what you run for exllama) and the llama.cpp server.

    I personally love this for writing and testing though:

    https://github.com/lmg-anon/mikupad

    And Open Web UI for more general usage.

    There’s a big backlog of poorly documented knowledge too, heh, just ask if you’re wondering how to cram a specific model in. But the “jist” of the optimal engine rules are:

    • For MoE models (like Qwen3 30B), try ik_llama.cpp, which is a fork specifically optimized for big MoEs partially offloaded to CPU.
    • For Gemma 3 specifically, use the regular llama.cpp server since it seems to be the only thing supporting the sliding window attention (which makes long context easy).
    • For pretty much anything else, if it’s supported by exllamav3 and you have a 3060, it's optimal to use that (via its server, which is called TabbyAPI). And you can use its quantized cache (try Q6/5) to easily get long context.
  • But i remember the context being memory greedy due to being a multimodal

    No, it's super efficient! I can run 27B's full 128K on my 3090, easy.

    But you have to use the base llama.cpp server. kobold.cpp doesn't seem to support the sliding window attention (last I checked like two weeks ago), so even a small context takes up a ton there.

    And the image input part is optional. Delete the mmproj file, and it wont load.

    There are all sorts of engine quirks like this, heh, it really is impossible to keep up with.

  • Yeah it’s basically impossible to keep up with new releases, heh.

    Anyway, Gemma 12B is really popular now, and TBH much smarter than Nemo. You can grab a special “QAT” Q4_0 from Google (that works in kobold.cpp, but fits much more context with base llama.cpp) with basically the same performance as unquantized, would highly recommend that.

    I'd also highly recommend trying 24B when you get the rig! It’s so much better than Nemo, even more than the size would suggest, so it should still win out even if you have to go down to 2.9 bpw, I’d wager.

    Qwen3 30B A3B is also popular now, and would work on your 3770 and kobold.cpp with no changes (though there are speed gains to be had with the right framework, namely ik_llama.cpp)

    One other random thing, some of kobold.cpps sampling presets are very funky with new models. I’d recommend resetting everything to off, then start with like 0.4 temp, 0.04 MinP, 0.02/1024 rep penalty and 0.4 DRY, not the crazy high temp sampling they normally use, with newer models than llama2.

    I can host specific model/quantization on the kobold.cpp API to try if you want, to save tweaking time. Just ask (or PM me, as replies sometimes don’t send notifications).

    Good luck with exams! No worries about response times, /c/localllama is a slow, relaxed community.

  • politics @lemmy.world

    Trump floats regime change in Iran

    World News @lemmy.world

    Israel bombs Iranian state TV during live broadcast

    United States | News & Politics @lemmy.ml

    Scoop: Four reasons Musk attacked Trump's "big beautiful bill"

    World News @lemmy.world

    Israel plans to occupy and flatten all of Gaza if no deal by Trump's trip

    LocalLLaMA @sh.itjust.works

    Qwen3 "Leaked"

    LocalLLaMA @sh.itjust.works

    Niche Model of the Day: Nemotron 49B 3bpw exl3

    LocalLLaMA @sh.itjust.works

    Niche Model of the Day: Openbuddy 25.2q, QwQ 32B with Quantization Aware Training

    Ask Lemmy @lemmy.world

    How do y'all post clips/animations on Lemmy? Only GIF seems to work.

    politics @lemmy.world

    Trump 2.0 initial approval ratings higher than in first term

    politics @lemmy.world

    Behind the Curtain: Meta's make-up-with-MAGA map

    Enough Musk Spam @lemmy.world

    Elon Musk's headline dominance squeezes other CEOs

    politics @lemmy.world

    Trump sides with Musk in H-1B fight

    politics @lemmy.world

    Elon Musk pledges "war" over H-1B visa program, calls opponents racists

    politics @lemmy.world

    Musk calls MAGA element "contemptible fools" as virtual civil war brews

    Technology @lemmy.world

    Shipping Listing Suggests 24GB+ Intel Arc B580

    Selfhosted @lemmy.world

    Guide to Self Hosting LLMs Faster/Better than Ollama

    LocalLLaMA @sh.itjust.works

    Qwen2.5: A Party of Foundation Models!

    Ask Lemmy @lemmy.world

    How does Lemmy feel about "open source" machine learning, akin to the Fediverse vs Social Media?

    World News @lemmy.world

    Pressure grows as "last chance" negotiations for Gaza deal resume

    News @lemmy.world

    Hostage-ceasefire deal talks stall over new Netanyahu demands, Israeli officials say