luml.experiments.evaluation.evaluate
evaluate
def evaluate(
eval_dataset: list[EvalItem],
inference_fn: Callable[[dict[str, Any]], Any],
scorers: list[BaseScorer],
dataset_id: str,
experiment_tracker: ExperimentTracker,
n_threads: int = 1
) -> EvalResults
Evaluates a dataset using the given inference function and scorers.
This function processes each evaluation item from the input dataset, applies the provided inference function to generate predictions, and scores these predictions using the specified scorers. Results are aggregated across the dataset and returned.
Arguments:
eval_datasetlist[EvalItem] - The dataset to evaluate, where each item contains the necessary data for inference and scoring.inference_fnCallable[[dict[str, Any]], Any] - A callable that generates predictions for a single evaluation input. The callable receives a dictionary of input data and returns the corresponding prediction.scorerslist[BaseScorer] - A list of scorer objects used to evaluate the predictions for each item in the dataset.dataset_idstr - A unique identifier for the dataset being evaluated.experiment_trackerExperimentTracker - An object for tracking evaluation results and metadata during the experiment.n_threadsint - The number of threads to use for parallel evaluation. Defaults to 1, which performs evaluation sequentially.
Returns:
EvalResults- An object containing detailed evaluation results for each item, aggregated scores across the dataset, and the associated dataset ID.