Skip Navigation

InitialsDiceBearhttps://github.com/dicebear/dicebearhttps://creativecommons.org/publicdomain/zero/1.0/„Initials” (https://github.com/dicebear/dicebear) by „DiceBear”, licensed under „CC0 1.0” (https://creativecommons.org/publicdomain/zero/1.0/)HE
Posts
8
Comments
1,851
Joined
4 yr. ago

  • Well, the numbers I find on google are: a Nvidia 4090 can transfer 1008 GB/s. And a i9 does something like 90 GB/s. So you'd expect the CPU to be roughly 11 times slower than that GPU at fetching an enormous amount of numbers from memory.

    I think if you double the amount of DDR channels for your CPU, and if that also meant your transfer rate would double to 180 GB/s, you'd just be roughly 6 times slower than the 4090. I'm not sure if it works exactly like that. But I'd guess so. And there doesn't seem to be a recent i9 with quad channel. So you're stuck with a small fraction of the speed of a GPU if you're set on an i9. That's why I mentioned AMD Epyc or Apple processors. Those have a way higher memory throughput.

    And a larger model also means more numbers to transfer. So if you now also use your larger memory to use a 70B parameter model instead of an 12B parameter model (or whatever fits on a GPU), your tokens will now come in at a 65th of the speed in the end. Or phrased differently: you don't wait 6 seconds, but 6 and a half minutes.

  • AI inference is memory-bound. So, memory bus width is the main bottleneck. I also do AI on an (old) CPU, but the CPU itself is mainly idle and waiting for the memory. I'd say it'll likely be very slow, like waiting 10 minutes for a longer answer. I believe all the AI people use Apple silicon because of the unified memory and it's bus width. Or some CPU with multiple memory channels. The CPU speed doesn't really matter, you could choose a way slower one, because the actual multiplications aren't what slows it down. But you seem to be doing the opposite, get a very fast processor with just 2 memory channels.

  • I'd say this is the correct answer. If you're actually using that much RAM, you probably want it connected to the processor with a wide (fast) bus. I rarely see people do it with desktop or gaming processors. It might be useful for some edge-cases, but usually you want an Epyc processor or something like that, or it's way too much RAM that isn't connected fast enough.

  • In my experience, idle power consumption mainly depends on the mainboard used. The processors all(?) clock down to some more or less energy-efficient level. But the specific design of the mainboard and the components on it could double or half energy consumption.

  • I think a lot comes down to usage. It just depends whether you connect 1 camera to Frigate, or 6. And if you enable some AI features. Whether you download a lot of TV series or a few and delete old stuff. Or use ZFS or other demanding things. I personally like to keep the amount of servers low. So I probably wouldn't buy server 2 and try to run those services on 1 as well. I'm not sure. You did a good job seperating the stuff. And I think you got some good advice already. I'd add more harddisks, 6TB wouldn't do it for me. And some space for backups. But you can always keep an eye on actual resource usage and just buy RAM and harddisks as needed. As long as your servers have some slots left for future upgrades. But I think you already got way more servers and RAM than you'd need. I probably run half of those services on a smaller server.

  • Permanently Deleted

    Jump
  • I think Germany is alright. I mean, feel free to ask any specific questions. There are a decent amount of German people here on Lemmy.

    You'll certainly learn a lot of things if you're moving far away, and it'll be quite an adventure... And I suppose there are cultural differences, a slightly different mentality towards things... And I think we sometimes struggle with different problems than America. We also have a good amount of our own problems, but it'll be different ones.

    Other than that, I found countries usually turn out to be different than someone would think just from reading the news. For the better or worse... But usually for the better 😆

  • Sure. I think being honest is a solid choice, generally speaking. There is some etiquette. If you're way too direct, you might be perceived as a creep. But you certainly have to do something, or it won't lead anywhere.

    Telling people you want to stay in contact, or you think they're attractive, or you like their outfit, or whatever people do for flirting seems to be alright. Some people crack jokes and try to be funny, or interesting... Whatever floats your boat. I think the one important thing is to read the room. See if they're comfortable. And if they enjoy talking to you, or if you've just cornered them and are monologuing. Most (not all) people can do that. And I'd say as long as everyone is comfortable, it's the right thing. I mean you have to send some signals for them to know what's up with you. So yeah, that kind of directness might be helpful. And after that, spending time together (and not just in a larger group) is a signal, too, in my opinion.

    I don't think there is any general, correct way of doing it. It just depends on the situation, on who you are, and especially what the other person likes.

  • I don't know why everyone else here says "No." Maybe it's down to preference. I usually like people not just for their outer appearance, but to a greater degree for their intelligence, wits, humor, similar perspective on life... And it just takes time to talk about all of that. So, I rather keep it down with being suggestive and just let things play out. Took me a long time. But everyone is different.

    I'm not sure if I have a good definition of flirting. I'm more a problem-oriented person. I do whatever gets the job done. If I want to meet someone again, I just tell them that, as you said. And I usually don't have any ulterior motives. And I'm currently not in the dating game, so I'm pretty much relaxed on parties and social events in that regard. But I think I've always gone to social events to have fun, and not so much to do dating.

    It depends a bit on who your target audience is. I think it's usually a good idea to roughly be how you are and not play some role. But I'm not a dating expert, so I might be wrong.

  • I'd say yes. That'd be a clear sign. And bordering on what I'd call flirting. If you say "Hey, I really enjoyed that conversation, let's meet for a coffee some day, how can I text you?"

    It'd say it's polite and does the job. And there's no need to be super explicit, unless you want to initiate a one-night-stand.

  • Isn't flirting the accepted way of signaling to another person, that you're interested in them in a certain way? I mean I talk to lots of different people of different genders in my life. And I'm mostly very nice to people and find interesting topics to talk about. But how are they supposed to find out if it's just a nice conversation, or if I want to meet them again, or if I want to go on a date with them?

  • Ah, yeah I forgot about watches and jewelry. Guess you can buy a lot of them and they won't take up that much space. I'd stick with one or two of them, though. Make it a very nice one you really like and wear it all the time. IMO it doesn't really help if you get 20 half-nice watches and keep 19 of them in one of your multiple wardrobes, that's just hoarding stuff... Same applies to shoes, albeit you might be allowed to get a few more pairs of them. But what do I know...

  • Though there has to be more to this story. Chartering an entire private jet costs like a few thousand to 15,000 dollars for an hour. You can do this twice a week on that budget. Or buy lots of fancy food, electronic gadgets and gucci bags, maybe even cars. But don't you quickly run out of space to put them? So how would someone spend 100k?

  • That laptop should be a bit faster than mine. It's a few generations newer, has DDR5 RAM and maybe even proper dual channel. As far as I know, LLM inference is almost always memory bound. That means the bottleneck is your RAM speed (and how wide the bus is between CPU and memory). So whether you use SyCL, Vulkan or even the CPU cores shouldn't have a dramatic effect. The main thing limiting speed is, that the computer has to transfer gigabytes worth of numbers from memory to the processor on each step. So the iGPU or processor spends most of its time waiting for memory transfers. I haven't kept up with development, so I might be wrong here, but I don't think more that single digit tokens/sec is possible on such a computer. It'd have to be a workstation or server with multiple separate memory banks, or something like a MacBook with Apple silicon and its unified memory. Or a GPU with fast VRAM on it. Though, you might be able to do a bit more than 3 t/s.

    Maybe keep trying the different computation backends. Have a look at your laptop's power settings as well. Mine is a bit slow when it's on the default "balanced" power profile. It'll speed up once I set it to "performance" or gaming mode. And if you can't get llama.cpp compiled, maybe just try Ollama, Koboldcpp instead. They use the same framework and might be easier to install. And SyCL might prove to be a bit of a letdown. It's nice. But seems few people are using it, so it might not be very polished or optimized.

  • I'm not sure what kind of laptop you own. Mine does about 2-3 tokens/sec if I'm running a 8B parameter model. So your last try seems about right. Concerning the memory: Llama.cpp can load models "memory mapped". That means the system decides which necessary parts lo load into the memory. It might be all in there, but it doesn't count as active memory usage. I believe it'll count towards the "cached" value in the statistics. If you want to make sure, you have to force it not to memory-map the model. In llama.cpp that's the parameter --no-mmap I have no idea how to do it in gpt4all-chat. But I'd say it's already loaded in your case, it just doesn't show up as used memory, since it's the mmap thing.
    Maybe try a few other software as well, like one of: ollama, koboldcpp, llama.cpp and see how they do. And I wouldn't run full precision models on an iGPU. Keep it to quantized models. Q8 or Q5... or Q4...

  • Couldn't agree more. And a phone number is kind of important. I don't want to hand that out to 50 random companies for "security", tracking, and them to sell it to advertisers. Or lose it to hackers, which also happens regularly. And I really don't like to pull down my pants for Discord (or whoever) to inspect my private parts.

    Btw, the cross-post still leads to an error page for me.

  • I think interoperability works with centralized services as well. They can offer an API for other services to hook into. Like Reddit had different apps, bots, tools... You can connect your software to the Google cloud, even if it's made by a different company... I think interoperability works just fine with both models, at least on the technical side.