LLMForge
Evaluation Console
Workspace
OverviewExperimentsCompareNew experiment
Shortcut
Command palette

Jump between experiment flows without leaving the keyboard.

Repository
HomeGitHub
Back to experiments
Experiment builder

Configure and launch a new experiment

Set up the model, dataset, reasoning method, and parameters for your next evaluation run.

Basics
Name the experiment

Choose a clear title that reads well in tables, exports, and comparisons.

Configuration
Model and reasoning

Select the model, inference provider, and reasoning strategy.

Fast, efficient default

Baseline completion without explicit reasoning scaffolding

Pick the best available provider and fall back automatically

Try the primary provider first and fall back only on transient failures

Only used by the adaptive policy.

Initial requests to round-robin before exploiting telemetry.

Evaluation setup
Dataset and runtime

Choose the evaluation dataset, retrieval strategy, and runtime parameters.

Small diagnostic set for single-hop factual recall; useful for smoke comparisons, not broad generalization claims

Max 500 samples on the current API.

Run the model on the prompt only

Generation
Hyperparameters

Control randomness, output length, and reproducibility.

Optimization
Execution options

Enable batching and caching to optimize throughput and reduce costs.

Parallelize eligible prompts to reduce wall-clock time.

Reuse identical prompts where deterministic settings allow it.

Regression
Regression gates

Attach thresholds so new attempts can be checked against a pinned baseline automatically.

Use these thresholds when comparing this experiment against a pinned baseline.

Advanced
Prompt lineage and graders

Optional metadata and deterministic graders for power users.

Attach this run to a saved prompt lineage if you are tracking prompt versions separately.

Paste a backend-compatible graders config when you want deterministic checks without leaving the UI.

Cancel
Quick starts
Preset configurations

Quick-start templates for common evaluation scenarios.

Summary
Untitled experiment

Overview of the experiment configuration.

NAIVEtrivia_qaauto
ModelLlama 3.2 (1B)
Samples10
Max tokens150
Retrievalnone
Complexity12%
Field notes
Execution trade-offs

Tips for understanding how settings affect run time and cost.

Retrieval adds context-fetch overhead, especially with hybrid or reranked modes.
More samples and larger max tokens increase result payload size and evaluation time.
Batching and caching are safe to enable because the backend already profiles them in result exports.