Benchmarks
Learn how to run and evaluate your SWE agents using benchmarks
Overview
Benchmarking is a crucial step in evaluating the performance and capabilities of your Software Engineering (SWE) agents. The SWE Development Kit (swekit) provides tools and methods to run standardized benchmarks, allowing you to assess and compare different agent implementations.
SWE-Bench
SWE-Bench is a comprehensive benchmark suite designed specifically for evaluating SWE agents. It includes a variety of real-world software engineering tasks from popular Python open-source projects, providing a robust testing environment.
Running SWE-Bench
To run the SWE-Bench benchmark on your agent:
Prerequisites
Ensure Docker is installed and running on your system.
Run the Benchmark
Navigate to the agent directory and run:
- By default,
python benchmark.py
runs only 1 test instance. - Specify a test split ratio to run more tests, e.g.,
--test-split=1:300
runs 300 tests.
Workspace Environments
You can run the benchmarks in different sandbox environments:
No additional configuration required.
Implementation Details
We utilize SWE-Bench-Docker to ensure each test instance runs in an isolated container with its specific environment and Python version.
Next Steps
- Explore the Development Guide to learn how to extend your SWE agent’s functionality by adding new tools or extending existing ones.
- Check out the Workspace Environments section for more details on running your agents in different environments.
Was this page helpful?