Benchmarks
Overview
Benchmarking is a crucial step in evaluating the performance and capabilities of your Software Engineering (SWE) agents. The SWE Development Kit (swekit) provides tools and methods to run standardized benchmarks, allowing you to assess and compare different agent implementations.
SWE-Bench
SWE-Bench is a comprehensive benchmark suite designed specifically for evaluating SWE agents. It includes a variety of real-world software engineering tasks from popular Python open-source projects, providing a robust testing environment.
Running SWE-Bench
To run the SWE-Bench benchmark on your agent:
Workspace Environments
You can run the benchmarks in different sandbox environments:
Docker (Default)
E2B
FlyIO
No additional configuration required.
Implementation Details
We utilize SWE-Bench-Docker to ensure each test instance runs in an isolated container with its specific environment and Python version.
Next Steps
- Explore the Development Guide to learn how to extend your SWE agent’s functionality by adding new tools or extending existing ones.
- Check out the Workspace Environments section for more details on running your agents in different environments.