LLMForge

Basics

Name the experiment

Choose a clear title that reads well in tables, exports, and comparisons.

Experiment name

Description

Configuration

Model and reasoning

Select the model, inference provider, and reasoning strategy.

Model

Fast, efficient default

Reasoning method

Baseline completion without explicit reasoning scaffolding

Provider

Pick the best available provider and fall back automatically

Routing policy

Try the primary provider first and fall back only on transient failures

Adaptive epsilon

Only used by the adaptive policy.

Exploration window

Initial requests to round-robin before exploiting telemetry.

Strict comparison modePin the first available provider and disable cross-provider fallback for scored runs.

Evaluation setup

Dataset and runtime

Choose the evaluation dataset, retrieval strategy, and runtime parameters.

Dataset

Small diagnostic set for single-hop factual recall; useful for smoke comparisons, not broad generalization claims

Samples

Max 500 samples on the current API.

Retrieval method

Run the model on the prompt only

Generation

Hyperparameters

Control randomness, output length, and reproducibility.

Temperature

Max tokens

Seed

Optimization

Execution options

Enable batching and caching to optimize throughput and reduce costs.

Enable batching

Parallelize eligible prompts to reduce wall-clock time.

Enable caching

Reuse identical prompts where deterministic settings allow it.

Regression

Regression gates

Attach thresholds so new attempts can be checked against a pinned baseline automatically.

Enable regression gates

Use these thresholds when comparing this experiment against a pinned baseline.

Advanced

Prompt lineage and graders

Optional metadata and deterministic graders for power users.

Prompt version ID

Attach this run to a saved prompt lineage if you are tracking prompt versions separately.

Deterministic graders JSON

Paste a backend-compatible graders config when you want deterministic checks without leaving the UI.

Cancel

Quick starts

Preset configurations

Quick-start templates for common evaluation scenarios.

Summary

Untitled experiment

Overview of the experiment configuration.

NAIVEtrivia_qaauto

ModelLlama 3.2 (1B)

Samples10

Max tokens150

Retrievalnone

Complexity12%

Field notes

Execution trade-offs

Tips for understanding how settings affect run time and cost.

Retrieval adds context-fetch overhead, especially with hybrid or reranked modes.

More samples and larger max tokens increase result payload size and evaluation time.

Batching and caching are safe to enable because the backend already profiles them in result exports.

Configure and launch a new experiment

Configure and launch a new experiment