davidagain

6d ago

PieFed.World is now open

Jump

How can I subscribe to piefed.world users and communities etc from my lemmy.world account?

6d ago

This will be my last post on Lemmy...

Jump

How do I subscribe to a user or community on piefed.world and see it in my lemmy.world feed?

6d ago

AI agents wrong ~70% of time: Carnegie Mellon study

Jump

Whereas if you ask a human to do the same thing ten times, the probability that they get all ten right is astronomically higher than 0.0000059049.

6d ago

Car crashes have killed and seriously injured roughly the same number of people as shootings in Chicago this year. Only one of these things draws media attention.

Jump

I agree it was a dumb comparison to start off with.

I wasn't the one who made it, but the license issue is the logical conclusion if OP insists on the comparison.

6d ago

AI agents wrong ~70% of time: Carnegie Mellon study

Jump

Again with dismissing the evidence of my own eyes!

I wasn't asking it to do calculations, I was asking it to put the data into a super formulaic sentence. It was good at the first couple of rows then it would get stuck in a rut and start lying. It was crap. A seven year old would have done it far better, and if I'd told a seven year old that they had made a couple of mistakes and to check it carefully, they would have done.

Again, I didn't read it in a fucking article, I read it on my fucking computer screen, so if you'd stop fucking telling me I'm stupid for using it the way it fucking told me I could use it, or that I'm stupid for believing what the media tell me about LLMs, when all I'm doing is telling you my own experience, you'd sound a lot less like a desperate troll or someone who is completely unable to assimilate new information that differs from your dogma.

6d ago

AI agents wrong ~70% of time: Carnegie Mellon study

Jump

Wow. 30% accuracy was the high score!
From the article:

Testing agents at the office

For a reality check, CMU researchers have developed a benchmark to evaluate how AI agents perform when given common knowledge work tasks like browsing the web, writing code, running applications, and communicating with coworkers.

They call it TheAgentCompany. It's a simulation environment designed to mimic a small software firm and its business operations. They did so to help clarify the debate between AI believers who argue that the majority of human labor can be automated and AI skeptics who see such claims as part of a gigantic AI grift.

the CMU boffins put the following models through their paces and evaluated them based on the task success rates. The results were underwhelming.

⚫ Gemini-2.5-Pro (30.3 percent)
⚫ Claude-3.7-Sonnet (26.3 percent)
⚫ Claude-3.5-Sonnet (24 percent)
⚫ Gemini-2.0-Flash (11.4 percent)
⚫ GPT-4o (8.6 percent)
⚫ o3-mini (4.0 percent)
⚫ Gemini-1.5-Pro (3.4 percent)
⚫ Amazon-Nova-Pro-v1 (1.7 percent)
⚫ Llama-3.1-405b (7.4 percent)
⚫ Llama-3.3-70b (6.9 percent),
⚫ Qwen-2.5-72b (5.7 percent),
⚫ Llama-3.1-70b (1.7 percent)
⚫ Qwen-2-72b (1.1 percent).

"We find in experiments that the best-performing model, Gemini 2.5 Pro, was able to autonomously perform 30.3 percent of the provided tests to completion, and achieve a score of 39.3 percent on our metric that provides extra credit for partially completed tasks," the authors state in their paper

6d ago

AI agents wrong ~70% of time: Carnegie Mellon study

Jump

Why are you giving it data

Because there's a button for that.

It’s output is dependent on the input

This thing that you said... It's false.

7d ago

Car crashes have killed and seriously injured roughly the same number of people as shootings in Chicago this year. Only one of these things draws media attention.

Jump

If guns are so alike to cars, why not require a license that you get by passing a written test on gun safety and a practical test on basic competence and safe usage?

7d ago

AI agents wrong ~70% of time: Carnegie Mellon study

Jump

It's not completely random, but I'm telling you it fucked up, it fucked up badly, time after time, and I had to check every single thing manually. It's correctness run never lasted beyond a handful. If you build something using some equation it invented you're insane and should quit engineering before you hurt someone.

7d ago

What sort of grill needs a firmware update lol

Jump

The same kind of grill that can be bricked remotely if you stop paying for software updates.

7d ago

Breaking the generational barriers

Jump

Definitely, but I think that Proud Boys leader who showed he could take a black dildo probably thought he was doing some really clever double bluff thing, but we see you Gavin McKinnes. We see you and the insecurities you're fighting so hard to hide.

7d ago

AI agents wrong ~70% of time: Carnegie Mellon study

Jump

Verify every single bloody line of output. Top three to five are good, then it starts guessing the rest based on the pattern so far. If I wanted to make shit up randomly, I would do it myself.

People who trust LLMs to tell them things that are right rather than things that sound right have fundamentally misunderstood what an LLM is and how it works.

7d ago

Just.....why?

Jump

This is hilarious. I laughed for some time.

"Log back in to continue your OralB brushing experience"

Who thought it would be a good idea to have an online toothbrush, who decided to log customers out after a period of inactivity, and why, for all that is sane in the world, would not being logged in stop you from doing anything at all with your toothbrush!?!

7d ago

AI agents wrong ~70% of time: Carnegie Mellon study

Jump

Ah, my bad, you're right, for being consistently correct, I should have done 0.3^10=0.0000059049

so the chances of it being right ten times in a row are less than one thousandth of a percent.

No wonder I couldn't get it to summarise my list of data right and it was always lying by the 7th row.

7d ago

AI agents wrong ~70% of time: Carnegie Mellon study

Jump

So the chances of it being right ten times in a row are 2%.

7d ago

AI agents wrong ~70% of time: Carnegie Mellon study

Jump

If that’s the quality of answer you’re getting, then it’s a user error

No, I know the data I gave it and I know how hard I tried to get it to use it truthfully.

You have an irrational and wildly inaccurate belief in the infallibility of LLMs.

You're also denying the evidence of my own experience. What on earth made you think I would believe you over what I saw with my own eyes?

7d ago

AI agents wrong ~70% of time: Carnegie Mellon study

Jump

I think it's lemmy users. I see a lot more LLM skepticism here than in the news feeds.

In my experience, LLMs are like the laziest, shittiest know-nothing bozo forced to complete a task with zero attention to detail and zero care about whether it's crap, just doing enough to sound convincing.

7d ago

AI agents wrong ~70% of time: Carnegie Mellon study

Jump

What's 0.7^10?

7d ago

AI agents wrong ~70% of time: Carnegie Mellon study

Jump

Human lives are the most important thing of all. Profits are irrelevant compared to human lives. I get that that's not how Besos sees the world, but he's a monstrous outlier.

1w ago

Evangelical church urges Trump admin to 'execute' LGBTQ Americans

Jump

Racist lesbian, no less, who says she wants to bang the black guy.