Benchmark Results
Reproducible evaluations of AI research skills with swap consistency checks and rubric-based scoring.
Overview
3
1
2
100%
Best Performing Model
google/gemini-3-flash-preview
Rubric Win: 100%Skill Score: 7.0Baseline Score: 4.3
Models
| # | Model | Rubric Win | Blind Win | Skill Score | Swap Consistency | Cases |
|---|---|---|---|---|---|---|
| 1 | google/gemini-3-flash-preview | 100% | 100% | 7.0vs 4.3 | 100% | 2 |
| 2 | google/gemini-2.5-flash | 100% | N/A | 7.0vs 3.0 | 100% | 1 |
Case Runs
| # | Case | Model | Rubric Winner | Skill Score | Baseline Score |
|---|---|---|---|---|---|
| 1 | Japanese Empire Communication Empire | google/gemini-3-flash-preview | Skill | 7.0 | 4.5 |
| 2 | Japanese Empire Communication Empire | google/gemini-3-flash-preview | Skill | 7.0 | 4.0 |
| 3 | Japanese Empire Communication Empire | google/gemini-2.5-flash | Skill | 7.0 | 3.0 |
Cases
| # | Case | Skill | Run Count |
|---|---|---|---|
| 1 | Japanese Empire Communication Empire | research.read-journal-article | 3 |
Skills
| # | Skill | Category | Stage | Runs | Cases |
|---|---|---|---|---|---|
| 1 | Read a Historical Journal Article | research | discover | 3 | 3 |