Skip Navigation

InitialsDiceBearhttps://github.com/dicebear/dicebearhttps://creativecommons.org/publicdomain/zero/1.0/„Initials” (https://github.com/dicebear/dicebear) by „DiceBear”, licensed under „CC0 1.0” (https://creativecommons.org/publicdomain/zero/1.0/)LU
Log in | Sign up @ davidagain @lemmy.world
Posts
5
Comments
1,274
Joined
1 yr. ago

  • Again with dismissing the evidence of my own eyes!

    I wasn't asking it to do calculations, I was asking it to put the data into a super formulaic sentence. It was good at the first couple of rows then it would get stuck in a rut and start lying. It was crap. A seven year old would have done it far better, and if I'd told a seven year old that they had made a couple of mistakes and to check it carefully, they would have done.

    Again, I didn't read it in a fucking article, I read it on my fucking computer screen, so if you'd stop fucking telling me I'm stupid for using it the way it fucking told me I could use it, or that I'm stupid for believing what the media tell me about LLMs, when all I'm doing is telling you my own experience, you'd sound a lot less like a desperate troll or someone who is completely unable to assimilate new information that differs from your dogma.

  • Wow. 30% accuracy was the high score!
    From the article:

    Testing agents at the office

    For a reality check, CMU researchers have developed a benchmark to evaluate how AI agents perform when given common knowledge work tasks like browsing the web, writing code, running applications, and communicating with coworkers.

    They call it TheAgentCompany. It's a simulation environment designed to mimic a small software firm and its business operations. They did so to help clarify the debate between AI believers who argue that the majority of human labor can be automated and AI skeptics who see such claims as part of a gigantic AI grift.

    the CMU boffins put the following models through their paces and evaluated them based on the task success rates. The results were underwhelming.

    ⚫ Gemini-2.5-Pro (30.3 percent)
    ⚫ Claude-3.7-Sonnet (26.3 percent)
    ⚫ Claude-3.5-Sonnet (24 percent)
    ⚫ Gemini-2.0-Flash (11.4 percent)
    ⚫ GPT-4o (8.6 percent)
    ⚫ o3-mini (4.0 percent)
    ⚫ Gemini-1.5-Pro (3.4 percent)
    ⚫ Amazon-Nova-Pro-v1 (1.7 percent)
    ⚫ Llama-3.1-405b (7.4 percent)
    ⚫ Llama-3.3-70b (6.9 percent),
    ⚫ Qwen-2.5-72b (5.7 percent),
    ⚫ Llama-3.1-70b (1.7 percent)
    ⚫ Qwen-2-72b (1.1 percent).

    "We find in experiments that the best-performing model, Gemini 2.5 Pro, was able to autonomously perform 30.3 percent of the provided tests to completion, and achieve a score of 39.3 percent on our metric that provides extra credit for partially completed tasks," the authors state in their paper

  • It's not completely random, but I'm telling you it fucked up, it fucked up badly, time after time, and I had to check every single thing manually. It's correctness run never lasted beyond a handful. If you build something using some equation it invented you're insane and should quit engineering before you hurt someone.

  • Definitely, but I think that Proud Boys leader who showed he could take a black dildo probably thought he was doing some really clever double bluff thing, but we see you Gavin McKinnes. We see you and the insecurities you're fighting so hard to hide.

  • Verify every single bloody line of output. Top three to five are good, then it starts guessing the rest based on the pattern so far. If I wanted to make shit up randomly, I would do it myself.

    People who trust LLMs to tell them things that are right rather than things that sound right have fundamentally misunderstood what an LLM is and how it works.

  • This is hilarious. I laughed for some time.

    "Log back in to continue your OralB brushing experience"

    Who thought it would be a good idea to have an online toothbrush, who decided to log customers out after a period of inactivity, and why, for all that is sane in the world, would not being logged in stop you from doing anything at all with your toothbrush!?!

  • Ah, my bad, you're right, for being consistently correct, I should have done 0.3^10=0.0000059049

    so the chances of it being right ten times in a row are less than one thousandth of a percent.

    No wonder I couldn't get it to summarise my list of data right and it was always lying by the 7th row.

  • If that’s the quality of answer you’re getting, then it’s a user error

    No, I know the data I gave it and I know how hard I tried to get it to use it truthfully.

    You have an irrational and wildly inaccurate belief in the infallibility of LLMs.

    You're also denying the evidence of my own experience. What on earth made you think I would believe you over what I saw with my own eyes?

  • I think it's lemmy users. I see a lot more LLM skepticism here than in the news feeds.

    In my experience, LLMs are like the laziest, shittiest know-nothing bozo forced to complete a task with zero attention to detail and zero care about whether it's crap, just doing enough to sound convincing.