BARSBenchmarks

Benchmark Results

Reproducible evaluations of AI research skills with swap consistency checks and rubric-based scoring.

3

1

2

100%

Best Performing Model

google/gemini-3-flash-preview

Rubric Win: 100%Skill Score: 7.0Baseline Score: 4.3
#ModelRubric WinBlind WinSkill ScoreSwap ConsistencyCases
1google/gemini-3-flash-preview100%100%7.0vs 4.3100%2
2google/gemini-2.5-flash100%N/A7.0vs 3.0100%1
#CaseModelRubric WinnerSkill ScoreBaseline Score
1Japanese Empire Communication Empiregoogle/gemini-3-flash-previewSkill7.04.5
2Japanese Empire Communication Empiregoogle/gemini-3-flash-previewSkill7.04.0
3Japanese Empire Communication Empiregoogle/gemini-2.5-flashSkill7.03.0
#SkillCategoryStageRunsCases
1Read a Historical Journal Article
research
discover33