SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
2024-06-02
![]()
The SWE-bench project investigates the ability of language models to automatically resolve GitHub issues. It uses a dataset comprising 2,294 issue-pull request pairs from 12 popular Python repositories, with evaluations based on unit test verification. The leaderboard showcases various models and their performance on this task, with Amazon Q Developer Agent currently leading.
Was this useful?