• tfowinder@beehaw.org
    link
    fedilink
    arrow-up
    0
    ·
    9 hours ago

    Well the article says that the AI agents were able to complete 30% of the tasks given to it like searching the web, communicating with co workers, etc. I think this is interesting

    CMU researchers have developed a benchmark to evaluate how AI agents perform when given common knowledge work tasks like browsing the web, writing code, running applications, and communicating with coworkers

    “We find in experiments that the best-performing model, Gemini 2.5 Pro, was able to autonomously perform 30.3 percent of the provided tests to completion, and achieve a score of 39.3 percent on our metric that provides extra credit for partially completed tasks”

    Personally i belive this is impressive.