AI Agents Flunk CRM Tests, Mishandle Confidential Data
The findings raise red flags for enterprises betting on AI agents to drive efficiencies.

Large Language Model (LLM) agents underperform on basic customer relationship management (CRM) tasks and show poor understanding of confidentiality, according to new paper by Salesforce Research.
Using a new benchmark called CRMArena-Pro, which uses synthetic data in a Salesforce sandbox, the study found that LLM agents successfully completed only 58% of single-step tasks.
Performance dropped sharply to 35% on multi-step problems requiring follow-up actions or deeper reasoning.
The study also flagged a serious concern: LLM agents often mishandle sensitive information.
Although this could be mitigated with targeted prompts, doing so often hurt overall task performance.
Researchers argued that existing benchmarks fail to properly evaluate AI agents’ ability to identify and safeguard confidential data.
The findings raise red flags for enterprises betting on AI agents to drive efficiencies.
"These results underscore a significant gap between current LLM capabilities and real-world enterprise demands, highlighting needs for improved multi-turn reasoning, confidentiality adherence, and versatile skill acquisition," the researchers said.
Last year, Salesforce launched AgentForce, a generative AI platform aimed at transforming customer service by enabling AI agents to handle support tickets, perform CRM actions, and assist human reps in real-time.
Marketed as a major step toward autonomous enterprise workflows, AgentForce was positioned as a cornerstone of Salesforce’s AI strategy.
At Dreamforce 2024, Salesforce's flagship event, CEO Marc Benioff told the Left Shift and other journalists in a press conference that AI agents as the future of enterprise productivity and claimed they represented a “high-margin revolution,” capable of delivering massive ROI by automating repetitive tasks and unlocking new efficiencies across sales, service, and marketing workflows.
Comments ()