Skip Navigation

InitialsDiceBearhttps://github.com/dicebear/dicebearhttps://creativecommons.org/publicdomain/zero/1.0/„Initials” (https://github.com/dicebear/dicebear) by „DiceBear”, licensed under „CC0 1.0” (https://creativecommons.org/publicdomain/zero/1.0/)BR
Posts
21
Comments
1,987
Joined
1 yr. ago

  • Late reply, but if you are looking into this, ik_llama.cpp is explicitly optimized for expert offloading. I can get like 16 t/s with a Hunyuan 70B on a 3090.

    If you want long context for models that fit in veam your last stop is TabbyAPI. I can squeeze in 128K context from a 32B in 24GB VRAM, easy… I could probably do 96K with 2 parallel slots, though unfortunately most models are pretty terrible past 32K.

  • Isn’t that a textbook Fourth Amendment case?

    I know they supposedly have some kind of holding period and this has been happening to minorities forever, and techically the mother requested she remain with her children, with no mention of her citizenship status in any of the reporting, other than she likely had a legal visa or something. But she was denied council and held. A congresswoman and her office are witnesses.

    It feels so dramatic that you'd think the ACLU or someone would jump on it as a test case.

  • The LLM “engine” is mostly detached from the UI.

    kobold.cpp is actually pretty great, and you can still use it with TabbyAPI (what you run for exllama) and the llama.cpp server.

    I personally love this for writing and testing though:

    https://github.com/lmg-anon/mikupad

    And Open Web UI for more general usage.

    There’s a big backlog of poorly documented knowledge too, heh, just ask if you’re wondering how to cram a specific model in. But the “jist” of the optimal engine rules are:

    • For MoE models (like Qwen3 30B), try ik_llama.cpp, which is a fork specifically optimized for big MoEs partially offloaded to CPU.
    • For Gemma 3 specifically, use the regular llama.cpp server since it seems to be the only thing supporting the sliding window attention (which makes long context easy).
    • For pretty much anything else, if it’s supported by exllamav3 and you have a 3060, it's optimal to use that (via its server, which is called TabbyAPI). And you can use its quantized cache (try Q6/5) to easily get long context.
  • But i remember the context being memory greedy due to being a multimodal

    No, it's super efficient! I can run 27B's full 128K on my 3090, easy.

    But you have to use the base llama.cpp server. kobold.cpp doesn't seem to support the sliding window attention (last I checked like two weeks ago), so even a small context takes up a ton there.

    And the image input part is optional. Delete the mmproj file, and it wont load.

    There are all sorts of engine quirks like this, heh, it really is impossible to keep up with.

  • Yeah it’s basically impossible to keep up with new releases, heh.

    Anyway, Gemma 12B is really popular now, and TBH much smarter than Nemo. You can grab a special “QAT” Q4_0 from Google (that works in kobold.cpp, but fits much more context with base llama.cpp) with basically the same performance as unquantized, would highly recommend that.

    I'd also highly recommend trying 24B when you get the rig! It’s so much better than Nemo, even more than the size would suggest, so it should still win out even if you have to go down to 2.9 bpw, I’d wager.

    Qwen3 30B A3B is also popular now, and would work on your 3770 and kobold.cpp with no changes (though there are speed gains to be had with the right framework, namely ik_llama.cpp)

    One other random thing, some of kobold.cpps sampling presets are very funky with new models. I’d recommend resetting everything to off, then start with like 0.4 temp, 0.04 MinP, 0.02/1024 rep penalty and 0.4 DRY, not the crazy high temp sampling they normally use, with newer models than llama2.

    I can host specific model/quantization on the kobold.cpp API to try if you want, to save tweaking time. Just ask (or PM me, as replies sometimes don’t send notifications).

    Good luck with exams! No worries about response times, /c/localllama is a slow, relaxed community.

  • Yeah and disapprove +11 (aka 54%) is the lowest poll (which I see more as +8 considering the 3% 'undecided' block). It's not even close to congressional or direction disapproval polls, and it's still around where he was last time as president.

    I don't mean to sound combative, but a slight plurality of disapproval is not gonna cut the mustard.

  • Like, not as a personal dog, but the overwhelming amount of people complaining about the DNC just aren’t up to date on what’s happening

    Fair point! I am not up to date TBH.

    I guess I'm pretty jaded too. The DNC getting things together!? What is this?

  • It’s not like the sane among us are suddenly going to decide to go along with fascism

    Oh you underestimate people's self interest. If Big Tech continues on it's trajectory to a kind of Theil-ish cyberpunk dystopia, most people are going to go along. Like, even I have super techy naturalized family that keeps using Google or Facebook stuff. It's (seemingly) too essential.

  • That's optimistic.

    It's assuming the Dem Party doesn't sabotage their own candidates. It's assuming they don't campaign like it's 1960 again. It's assuming social media will somehow be reigned in.

    It's assuming there will even be a fair environment for an election, instead of the government (and whoever's conflated with them) putting thumbs on the scales kinda like Hungary, or worse. It doesn't take much pressure to sway elections in environments this polarized.

  • We have to shout for him

    We can't.

    The people who need to hear it are in another bubble and never will.

    TBH I dunno how to fix it anymore. Even 'revolution' like many on Lemmy fantasize about will not penetrate, and most people don't want to understand what propaganda and algos are doing to them.

  • You can definitely quantize exl3s yourself; the process is vram light (albeit time intense).

    What 13B are you using? FYI the old Llama2 13B models don’t use GQA, so even their relatively short 4096 context takes up a lot of vram. Newer 12Bs and 14Bs are much more efficient (and much smarter TBH).

  • In the future, when we're transcendent tentacled robofurries doing poly in virtual space, on drugs (think Yivo from Futurama), we will look back in confusion at why so many people hate homosexuality so much. Like... don't they have other things to worry about?

    Or humanity will be all dead, I guess.

    And I'm talking about the mega conservatives protesting this; at least the Vatican is baby stepping and trying to minimize their cruelty.

  • Technology @lemmy.world

    Alleged AMD Strix Halo APU Appears in Benchmark