Home

Benchmark suite for evaluating AI agent:
FieldWorkArena

World's leading benchmark suite to accelerate AI deployment in field operations

Update

Aug. 22nd, 2025: As FieldWorkArena V2.1, ambiguous query texts have been corrected.
June 30th, 2025: As FieldWorkArena V2.0, the retail dataset has been released for global use.
May 30th, 2025: As FieldWorkArena V1.1, the warehouse dataset has been released for global use.
Feb. 27th, 2025: As FieldWorkArena V1.0, the factory dataset has been released for global use.
You must apply using the form below to download HuggingFace data.
Evaluation software: GitHub
Evaluation dataset: Application form
Technical paper: arXiv

AI agent deployment and evaluation

The introduction of AI agents is being considered to address the challenges faced by many workplaces, such as the aging of the population, lack of human resources, and delays in decision-making. In order to improve the functionality of AI agents, we have developed and provided a benchmark suite to evaluate AI agents by extending the evaluation method of web operations to field operations.

By FieldWorkArena

FieldWorkArena is a groundbreaking benchmark suite for evaluating AI agents. By using data and tasks from Fujitsu's actual factories and warehouses, we quantitatively evaluate how effectively AI agents work in the field. This clarifies the challenges of AI adoption and ensures evidence when applied in the field.

The benefits of FieldWorkArena

Objective AI Performance Assessment	You can evaluate AI agents in a real-world environment and objectively measure their performance.
Rapid AI development cycle	Accelerate development of AI agents through efficient testing with benchmarks.
Reliable AI deployment	Reduce risk and increase success for AI deployments.
Improvement of efficiency and safety in field operations	Through the selection and development of high-performance AI agents, we will improve the efficiency and safety of on-site operations.
Accelerating the evolution of AI technologies	Accelerate research and development of AI technologies by providing standardized benchmarks.

Technical Overview

Target Industry/Users

The manufacturing such as factories and warehouses, and logistics industries are the main targets. Users include developers of AI agents and companies seeking to improve efficiency and safety management in field operations.

Challenges in Target Industry and Operations

Near-miss incidents in safety and manufacturing occur daily in field operations, and it is necessary to control the occurrence of serious incidents.
There is a huge amount of data including images and documents in the field, making it difficult to extract and analyze information.
There is no way to link incidents to corporate systems.

Technical Challenges

AI technologies such as multimodal LLM and AI agents such as GPT-4o can be used to solve the above problems. However, full-scale introduction has not been achieved for the following reasons.

The ability of existing AI technologies to handle current complex workflows is unclear.
Difficulty in integrated processing of various data formats (text, images, video, logs, etc.) obtained in the field.
Technology has not been established to select appropriate sources and perform tasks autonomously, depending on the situation.

Solutions

FieldWorkArena from Fujitsu is a benchmark suite for AI agents that includes more than 40 types of data (Image, operation manual) from 2 real-world scenes, as well as around 500 field-specific tasks and correct answers. You can quantitatively evaluate the extent to which existing multimodal LLMs and AI agents under research and development can support various tasks in the field. FieldWorkArena can be used to clarify issues to be solved and as evidence when applying AI in the field.

solutions

Fujitsu's Technological Advantage

No other company has a benchmark suite for evaluating AI agent performance that consists of real-world data and tasks such as factories or warehouses (as of January, 2025).
Provides benchmarks that comprehensively address various types of field operations: work planning, action, and reporting.
Collaboration with Carnegie Mellon University (CMU), the world leader in AI agent benchmarking.

The benefits of FieldWorkArena (Detailed version)

Provide standard benchmarks for the development and evaluation of field support AI agents
Contribute to improving the efficiency, safety and productivity of factory, warehouse and other manufacturing operations
Activating research and development of AI agents for field work support in the research community

Use Cases

End users:
- Existing AI agents and AI technologies such as multimodal LLM can be validated.
- By browsing the leaderboard, the best AI technology can be selected.
App Developers
- By evaluating AI technologies under research and development such as AI agents and multimodal LLM in this benchmark, it is possible to claim superiority over existing technologies.

Case studies

Evaluation of an AI agent that detects near-miss events and work content from on-site camera footage and automatically reports them to the appropriate person
- Detection and reporting of health and safety violations in warehouse operations
- Confirmation of compliance with operating procedures in the assembly process of parts and materials
- Confirming employee work and customer purchasing behavior in retail settings

case studies

FieldWork as a Knowledge

Publishing video data in FieldWorkArena along with structured data represented as Knowledge Graphs (KGs) using KG enhanced RAG technology. Using RAG (Retrieval-Augmented Generation) technology enables natural‑language search over video information.

Two Types of Knowledge Graphs (KGs) Published
- Video KG: A graph individually constructed for each video based on its content
- Video-Collection KG: A graph aggregating the content of multiple videos in the dataset
Use cases
- Using the Video KG, users can search for related information within a video in natural language and perform question answering.
- Using the Video-Collection KG, users can search for videos in the dataset that contain scenes relevant to natural‑language queries.

FieldWork as a Knowledge

Program and Data

FieldWorkArena
- Evaluation software: GitHub
- Evaluation dataset: Application form
- Leaderboard: coming soon
FieldWork as a Knowledge
- Evaluation software: GitHub
- Evaluation dataset: Application form (* Same as above)

Inquiry

Home

Update

AI agent deployment and evaluation

By FieldWorkArena

The benefits of FieldWorkArena

Technical Overview

Target Industry/Users

Challenges in Target Industry and Operations

Technical Challenges

Solutions

Fujitsu's Technological Advantage

The benefits of FieldWorkArena (Detailed version)

Use Cases

Case studies

FieldWork as a Knowledge

Program and Data

Related Information