brucethemoose @ brucethemoose @lemmy.world

Posts

21
Comments

1,978
Joined

1 yr. ago

6m ago

Running Local LLMs with Ollama on openSUSE Tumbleweed

What model size/family? What GPU? What context length? There are many different backends with different strengths; it's complicated, but I can tell you the optimal way to run it with a bit more specificity, heh.

9m ago

Running Local LLMs with Ollama on openSUSE Tumbleweed

Jump

Yes, and it's hard to undo, and not obvious!

5h ago

Running Local LLMs with Ollama on openSUSE Tumbleweed

Jump

Kobold.cpp is fantastic. Sometimes there are more optimal ways to squeeze models into VRAM (depends on the model/hardware), but TBH I have no complaints.

I would recommend croco.cpp, a drop-in fork: https://github.com/Nexesenex/croco.cpp

It has support for more the advanced quantization schemes of ik_llama.cpp. Specifically, you can get really fast performance offloading MoEs, and you can also use much higher quality quantizations, with even ~3.2bpw being relatively low loss. You'd have to make the quants yourself, but it's quite doable... just poorly documented, heh.

The other warning I'd have is that some of it's default sampling presets are fdfunky, if only because they're from the old days of Pygmalion 6B and Llama 1/2. Newer models like much, much lower temperature and rep penalty.

7h ago

Vintage gaming advertising pictures: a gallery

Jump

They need to give it to the current marketing team. And save some for me.

7h ago

Poll: Zohran Mamdani's policies are popular with Americans outside New York — even if Mamdani is not

Jump

There's plenty of good journalism eeking by out there, it's just buried by feeds and spam.

7h ago

Running Local LLMs with Ollama on openSUSE Tumbleweed

Jump

It's kinda a hundred little things all pointing in a bad direction:

https://old.reddit.com/r/LocalLLaMA/comments/1kg20mu/so_why_are_we_shing_on_ollama_again/

https://old.reddit.com/r/LocalLLaMA/comments/1ko1iob/ollama_violating_llamacpp_license_for_over_a_year/

https://old.reddit.com/r/LocalLLaMA/comments/1i8ifxd/ollama_is_confusing_people_by_pretending_that_the/

I would summarize it as "AI Bro" like behavior:

Signs in the code they are preparing a commercial version of Ollama, likely dumping the free version as a bait and switch.
Heavy online marketing.
"Reinventing"the wheel" to shut out competition, even when base llama.cpp already has it implemented, like with modelfiles and the ollama API.
A lot of inexplicable forked behavior.

Beyond that:

Misnaming models for hype reasons, like the tiny deepseek distils as "Deepseek"
Technical screw ups with the backend, chat templates and such hidden from users, so there's no apparent reason why models are misbehaving.
Not actually contributing to the core development of the engine.
Social media scummery.
Treating the user as 'dumb' by hiding things like the default hard 2048-token context window.
Not keeping up with technical innovations, like newer quantizations, SWA, batching, other backend stuff.
Bad default quantizations, even beyond the above. For instance, no Google QATs (last I checked), no imatrix, no dynamic quants.

I could go on forever about more specific dramas, and I don't even remember the half of them. But there are plenty of technical and moral reasons to stay away.

LM Studio is much better put together if you want 1-click. Truly open solutions that are more DIY (and reward you with dramatically better performance from the understanding/learning) are the way if you have the time/patience to burn.

8h ago

Running Local LLMs with Ollama on openSUSE Tumbleweed

Jump

I hate to drone on about this again, but:

ollama is getting less and less open, and (IMO) should not be used. If that doesn't concern you, you should be using LM Studio anyway.
The model sizes they mention are mostly for old models no-one should be using. The only exception is a 70B MoE (Hunyuan), but I think ollama doesn't even support that?
The quantization methods they mention are (comparatively) primitive and low performance, not cutting edge.
It mentions q8_0 twice, nonsensically... Um, it makes me think this article is AI slop?

I'm glad opensuse is promoting local LLM usage, but please... not ollama, and be more specific.

And don't use ollama to write it without checking :/

9h ago

How did websites like TinEye recognize cropped photos of the same image (and other likened pictures), without the low-entry easyness of LLM/AI Models these days?

Jump

making the most with what you have

That was, indeed, the motto of ML research for a long time. Just hacking out more efficient approaches.

It's people like Altman that introduced the idea of not innovating and just scaling up what you already have. Hence many in the research community know he's full of it.

9h ago

How did websites like TinEye recognize cropped photos of the same image (and other likened pictures), without the low-entry easyness of LLM/AI Models these days?

Jump

Oh and to answer this, specifically, Nvidia has been used in ML research forever. It goes back to 2008 and stuff like the desktop GTX 280/CUDA 1.0. Maybe earlier.

Most "AI accelerators" are basically the same thing these days: overgrown desktop GPUs. They have pixel shaders, ROPs, video encoders and everything, with the one partial exception being the AMD MI300X and beyond (which are missing ROPs).

CPUs were used, too. In fact, Intel made specific server SKUs for giant AI users like Facebook. See: https://www.servethehome.com/facebook-introduces-next-gen-cooper-lake-intel-xeon-platforms/

10h ago

How did websites like TinEye recognize cropped photos of the same image (and other likened pictures), without the low-entry easyness of LLM/AI Models these days?

Jump

Machine learning has been a field for years, as others said, yeah, but Wikipedia would be a better expansion of the topic. In a nutshell, it's largely about predicting outputs based on trained input examples.

It doesn't have to be text. For example, astronmers use it to find certain kinds of objects in raw data feeds. Object recognition (identifying things in pictures with little bounding boxes) is an old art at this point. Series prediction models are a thing, languagetool uses a tiny model to detect commonly confused words for grammar checking. And yes, image hashing is another, though not entirely machine learning based. IDK what Tineye does in their backend, but there are some more "oldschool" approaches using more traditional programming techniques, generating signatures for images that can be easily compared in a huge database.

You've probably run ML models in photo editors, your TV, your phone (voice recognition), desktop video players or something else without even knowing it. They're tools.

Seperately, image similarity metrics (like lpips or SSIM) that measure the difference between two images as a number (where, say, 1 would be a perfect match and 0 totally unrelated) are common components in machine learning pipelines. These are not usually machine learning based, barring a few execptions like VMAF (which Netflix developed for video).

Text embedding models do the same with text. They are ML models.

LLMs (aka models designed to predict the next 'word' in a block of text, one at a time, as we know them) in particular have an interesting history, going back to (If I even remember the name correctly) BERT in Google's labs. There were also tiny LLMS people did run on personal GPUs before ChatGPT was ever a thing, like the infamous Pygmalion 6B roleplaying bot, a finetune of GPT-J 6B. They were primitive and dumb, but it felt like witchcraft back then (before AI Bros marketers poisoned the well).

10h ago

Red States, Defying Reality, Are Reclassifying Gas as a “Green” Fuel

Jump

Yes, buut forestry industries (at least in the US) are pretty sustainable, from what I've seen.

In other words, at least there's a nugget of truth under the lie.

22h ago

Lemmy, what's the meaning, or point if you prefer, of life? I know 42, but I'm serious. Nothing lasts, everything is meaningless - are we just amusing ourselves until death?

Jump

Other people.

Make connections in your little circle/tribe; make people happy. It's what we evolved to do.

1d ago

How to calculate cost-per-tokens output of local model compared to enterprise model API access

Jump

Late reply, but if you are looking into this, ik_llama.cpp is explicitly optimized for expert offloading. I can get like 16 t/s with a Hunyuan 70B on a 3090.

If you want long context for models that fit in veam your last stop is TabbyAPI. I can squeeze in 128K context from a 32B in 24GB VRAM, easy… I could probably do 96K with 2 parallel slots, though unfortunately most models are pretty terrible past 32K.

1d ago

Miami archbishop slams Everglades immigrant detention site as 'unbecoming' and 'corrosive'

Jump

He is a bishop.

It’s easy to forget how public leaders, especially clergy, used to talk.

1d ago

Congresswoman tracks down U.S. citizen children detained at Ferndale facility

Jump

Isn’t that a textbook Fourth Amendment case?

I know they supposedly have some kind of holding period and this has been happening to minorities forever, and techically the mother requested she remain with her children, with no mention of her citizenship status in any of the reporting, other than she likely had a legal visa or something. But she was denied council and held. A congresswoman and her office are witnesses.

It feels so dramatic that you'd think the ACLU or someone would jump on it as a test case.

1d ago

need help understanding if this setup is even feasible.

Jump

The LLM “engine” is mostly detached from the UI.

kobold.cpp is actually pretty great, and you can still use it with TabbyAPI (what you run for exllama) and the llama.cpp server.

I personally love this for writing and testing though:

https://github.com/lmg-anon/mikupad

And Open Web UI for more general usage.

There’s a big backlog of poorly documented knowledge too, heh, just ask if you’re wondering how to cram a specific model in. But the “jist” of the optimal engine rules are:

For MoE models (like Qwen3 30B), try ik_llama.cpp, which is a fork specifically optimized for big MoEs partially offloaded to CPU.
For Gemma 3 specifically, use the regular llama.cpp server since it seems to be the only thing supporting the sliding window attention (which makes long context easy).
For pretty much anything else, if it’s supported by exllamav3 and you have a 3060, it's optimal to use that (via its server, which is called TabbyAPI). And you can use its quantized cache (try Q6/5) to easily get long context.

1d ago

need help understanding if this setup is even feasible.

Jump

But i remember the context being memory greedy due to being a multimodal

No, it's super efficient! I can run 27B's full 128K on my 3090, easy.

But you have to use the base llama.cpp server. kobold.cpp doesn't seem to support the sliding window attention (last I checked like two weeks ago), so even a small context takes up a ton there.

And the image input part is optional. Delete the mmproj file, and it wont load.

There are all sorts of engine quirks like this, heh, it really is impossible to keep up with.

2d ago

ETH Zurich and EPFL will release a large language model (LLM) developed on public infrastructure. Trained on the “Alps” supercomputer at the Swiss National Supercomputing Centre (CSCS)

Jump

More open models are good! Granite needs a competitor.

I do hope they try an 'exotic' architecture. It doesn't have to be novel; another bitnet or Jamba/Falcon hybrid model would be sick. Is there anywhere I can submit suggestions, heh.

2d ago

need help understanding if this setup is even feasible.

Jump

Yeah it’s basically impossible to keep up with new releases, heh.

Anyway, Gemma 12B is really popular now, and TBH much smarter than Nemo. You can grab a special “QAT” Q4_0 from Google (that works in kobold.cpp, but fits much more context with base llama.cpp) with basically the same performance as unquantized, would highly recommend that.

I'd also highly recommend trying 24B when you get the rig! It’s so much better than Nemo, even more than the size would suggest, so it should still win out even if you have to go down to 2.9 bpw, I’d wager.

Qwen3 30B A3B is also popular now, and would work on your 3770 and kobold.cpp with no changes (though there are speed gains to be had with the right framework, namely ik_llama.cpp)

One other random thing, some of kobold.cpps sampling presets are very funky with new models. I’d recommend resetting everything to off, then start with like 0.4 temp, 0.04 MinP, 0.02/1024 rep penalty and 0.4 DRY, not the crazy high temp sampling they normally use, with newer models than llama2.

I can host specific model/quantization on the kobold.cpp API to try if you want, to save tweaking time. Just ask (or PM me, as replies sometimes don’t send notifications).

Good luck with exams! No worries about response times, /c/localllama is a slow, relaxed community.

3d ago

In 2022, 51% of US oil and gas profits went to the richest 1%

Jump

Just that the graph is leaving out salaried workers, whom I would classify as major beneficiaries of the oil industry, however that may or may not affect the statistics or any drawn conclusions.