Skip to content

Home

Benchmark suite for evaluating AI agent:
FieldWorkArena
World's leading benchmark suite to accelerate AI deployment in field operations

Update

  • Feb. 27th, 2025: As FieldWorkArena V1.1, the factory dataset has been released for global use.
  • Evaluation software: GitHub
  • Evaluation dataset: Application form

AI agent deployment and evaluation

 AI agent deployment and evaluation
The introduction of AI agents is being considered to address the challenges faced by many workplaces, such as the aging of the population, lack of human resources, and delays in decision-making. In order to improve the functionality of AI agents, we have developed and provided a benchmark suite to evaluate AI agents by extending the evaluation method of web operations to field operations.

By FieldWorkArena

By FieldWorkArena
FieldWorkArena is a groundbreaking benchmark suite for evaluating AI agents. By using data and tasks from Fujitsu's actual factories and warehouses, we quantitatively evaluate how effectively AI agents work in the field. This clarifies the challenges of AI adoption and ensures evidence when applied in the field.

The benefits of FieldWorkArena

  1. Objective AI Performance Assessment
    • You can evaluate AI agents in a real-world environment and objectively measure their performance.
  2. Rapid AI development cycle
    • Accelerate development of AI agents through efficient testing with benchmarks.
  3. Reliable AI deployment
    • Reduce risk and increase success for AI deployments.
  4. Improvement of efficiency and safety in field operations
    • Through the selection and development of high-performance AI agents, we will improve the efficiency and safety of on-site operations.
  5. Accelerating the evolution of AI technologies
    • Accelerate research and development of AI technologies by providing standardized benchmarks.

Technical Overview

Target Industry/Users

The manufacturing such as factories and warehouses, and logistics industries are the main targets. Users include developers of AI agents and companies seeking to improve efficiency and safety management in field operations.

Challenges in Target Industry and Operations

  • Near-miss incidents in safety and manufacturing occur daily in field operations, and it is necessary to control the occurrence of serious incidents.
  • There is a huge amount of data including images and documents in the field, making it difficult to extract and analyze information.
  • There is no way to link incidents to corporate systems.

Technical Challenges

AI technologies such as multimodal LLM and AI agents such as GPT-4o can be used to solve the above problems. However, full-scale introduction has not been achieved for the following reasons.

  • The ability of existing AI technologies to handle current complex workflows is unclear.
  • Difficulty in integrated processing of various data formats (text, images, video, logs, etc.) obtained in the field.
  • Technology has not been established to select appropriate sources and perform tasks autonomously, depending on the situation.

Solutions

FieldWorkArena from Fujitsu is a benchmark suite for AI agents that includes more than 40 types of data (Image, operation manual) from 2 real-world scenes, as well as around 500 field-specific tasks and correct answers. You can quantitatively evaluate the extent to which existing multimodal LLMs and AI agents under research and development can support various tasks in the field. FieldWorkArena can be used to clarify issues to be solved and as evidence when applying AI in the field.

solutions

Fujitsu's Technological Advantage

  • No other company has a benchmark suite for evaluating AI agent performance that consists of real-world data and tasks such as factories or warehouses (as of January, 2025).
  • Provides benchmarks that comprehensively address various types of field operations: work planning, action, and reporting.
  • Collaboration with Carnegie Mellon University (CMU), the world leader in AI agent benchmarking.

The benefits of FieldWorkArena (Detailed version)

  • Provide standard benchmarks for the development and evaluation of field support AI agents
  • Contribute to improving the efficiency, safety and productivity of factory, warehouse and other manufacturing operations
  • Activating research and development of AI agents for field work support in the research community

Use Cases

  • End users:
    • Existing AI agents and AI technologies such as multimodal LLM can be validated.
    • By browsing the leaderboard, the best AI technology can be selected.
  • App Developers
    • By evaluating AI technologies under research and development such as AI agents and multimodal LLM in this benchmark, it is possible to claim superiority over existing technologies.

Case studies

  • Evaluation of an AI agent that detects near-miss events from on-site camera footage and automatically reports them to the appropriate person
    • Detection and reporting of health and safety violations in warehouse operations
    • Confirmation of compliance with operating procedures in the assembly process of parts and materials
  • Plans to offer retail scenes and tasks using CG data in the future

case studies

Program and Data