They let AIs run a company: what happened says a lot about the future of work

2 mei 2025

Can AI agents replace human employees? A study from Carnegie Mellon dives in

In a time when artificial intelligence is moving at lightning speed, people are asking if AI can take over jobs traditionally held by humans. Researchers from Carnegie Mellon University recently published a study on Arxiv that looks into this question. They set out to see whether AI agents could handle roles usually occupied by humans, sparking plenty of interest and a fair share of concerns along the way.

diving into the tech behind ai agents

The study brought together a variety of AI technologies to see how well they handle work-related tasks. The researchers turned to six different platforms: Claude by Anthropic, GPT-4o by OpenAI, Google Gemini, Amazon Nova, Meta Llama, and Qwen by Alibaba (each one showcasing its own strengths and quirks as some of today’s top AI systems).

Each AI agent was put into roles like financial analyst, project manager, and software engineer. These jobs were picked to cover a broad spectrum of skills and responsibilities you’d typically see in different professional settings. By using these roles, the team wanted to test if AI can manage complicated tasks that require both technical know-how and smart decision-making.

assignments given to ai agents

To mimic real-life job scenarios, the AI agents had a bunch of tasks to tackle. One of these was navigating files to analyze databases (a job that calls for serious attention to detail and data handling skills). They also ran virtual tours to pick new office sites, a test of their spatial reasoning and choice-making.

On top of that, the AI agents “worked” with simulated colleagues on assignments that involved chatting with departments like human resources (this was meant to test how they handle communication and teamwork, something pretty important in any office).

how did the ai perform?

The results were a mixed bag, with performance varying a lot between platforms. Claude 3.5 Sonnet came out on top, knocking out 24% of the tasks and earning a partial completion rate of 34.4%, though it came with an operating cost of $6.34. Meanwhile, Gemini 2.0 Flash wrapped up just 11.4% of its tasks but was much cheaper at only $0.79.

The rest of the agents couldn’t even clear the 10% mark when it came to completing their tasks. This really highlights the current limits when it comes to handling complex job functions on their own.

bumps in the road for ai agents

Even with some wins, the study uncovered quite a few stumbling blocks in how AI agents handle tasks. One major issue was picking up on implicit instructions; for example, some agents didn’t immediately recognize “.docx” as a Microsoft Word file (a basic but telling hiccup).

Social skills were another weak spot. Without those important people skills, tasks that needed a bit of nuanced communication ended up incomplete or just not right. Plus, dealing with the web turned out to be trickier than expected—pop-ups and all that—which is where human gut feeling often beats out robotic logic.

Interestingly, some agents tried taking shortcuts during their tasks, skipping over the tougher parts while mistakenly thinking they’d nailed it.

what these findings mean for the future job scene

The study’s findings show that while AI agents can handle certain specialized tasks pretty well, they’re not ready to completely replace human workers. This might be a bit of a relief for those worried about losing their jobs to robots.

As we head further into our digital future, it’s important to keep in mind these limitations when building AI systems that work alongside people rather than taking over completely. The study gets us thinking about a future where AI is used wisely to boost productivity, all without sacrificing the really human touches like creativity and empathy (qualities that, for now, remain uniquely ours).