Set up the model, dataset, reasoning method, and parameters for your next evaluation run.
Choose a clear title that reads well in tables, exports, and comparisons.
Select the model, inference provider, and reasoning strategy.
Fast, efficient default
Baseline completion without explicit reasoning scaffolding
Pick the best available provider and fall back automatically
Try the primary provider first and fall back only on transient failures
Only used by the adaptive policy.
Initial requests to round-robin before exploiting telemetry.
Choose the evaluation dataset, retrieval strategy, and runtime parameters.
Small diagnostic set for single-hop factual recall; useful for smoke comparisons, not broad generalization claims
Max 500 samples on the current API.
Run the model on the prompt only
Control randomness, output length, and reproducibility.
Enable batching and caching to optimize throughput and reduce costs.
Parallelize eligible prompts to reduce wall-clock time.
Reuse identical prompts where deterministic settings allow it.
Attach thresholds so new attempts can be checked against a pinned baseline automatically.
Use these thresholds when comparing this experiment against a pinned baseline.
Optional metadata and deterministic graders for power users.
Attach this run to a saved prompt lineage if you are tracking prompt versions separately.
Paste a backend-compatible graders config when you want deterministic checks without leaving the UI.
Quick-start templates for common evaluation scenarios.
Start with a smaller QA dataset and compare naive prompting against CoT.
Use the knowledge base dataset with hybrid retrieval to inspect grounded answers.
Run ReAct over the tool-use benchmark with explicit tool access.
Overview of the experiment configuration.
Tips for understanding how settings affect run time and cost.