BARSBenchmarks

Benchmark Results

Reproducible evaluations of AI research skills with swap consistency checks and rubric-based scoring.

Overview

Case Runs

Cases Evaluated

Models Tested

Best Rubric Win

100%

Best Performing Model

google/gemini-3-flash-preview

Rubric Win: 100%Skill Score: 7.0Baseline Score: 4.3

Models

#	Model	Rubric Win	Blind Win	Skill Score	Swap Consistency	Cases
1	google/gemini-3-flash-preview	100%	100%	7.0vs 4.3	100%	2
2	google/gemini-2.5-flash	100%	N/A	7.0vs 3.0	100%	1

Case Runs

#	Case	Model	Rubric Winner	Skill Score	Baseline Score
1	Japanese Empire Communication Empire	google/gemini-3-flash-preview	Skill	7.0	4.5
2	Japanese Empire Communication Empire	google/gemini-3-flash-preview	Skill	7.0	4.0
3	Japanese Empire Communication Empire	google/gemini-2.5-flash	Skill	7.0	3.0

Cases

#	Case	Skill	Run Count
1	Japanese Empire Communication Empire	research.read-journal-article	3

Skills

#	Skill	Category	Stage	Runs	Cases
1	Read a Historical Journal Article	research	discover	3	3