Skill Featured

Implement ML A/B Testing Framework

Name: Implement ML A/B Testing Framework
Availability: OnlineOnly
Author: VibeBaza

Expert guidance for designing, running, and analyzing A/B tests for machine learning systems, covering sample sizing, traffic splitting, and drift monitoring.

Get skill

Works with githubscipystatsmodelspymc3arviz

VibeBaza

Maintainer?

Spark score

out of 100

Status Verified Official

Updated 6 months ago

Version 1.0.0

Models

claude

Add to Favorites

Why it matters

Establish a robust A/B testing framework specifically designed for machine learning systems, ensuring statistical rigor and addressing ML-specific challenges in production.

Outcomes

What it gets done

Design and implement A/B tests for ML models.

Calculate sample sizes and perform statistical power analysis.

Integrate feature stores and monitor model performance.

Analyze results using Bayesian methods and sequential testing.

Install

Add it to your toolbox

Run in your project directory:

curl -fsSL https://spark.entire.vc/get/vb-ab-test-framework-ml | bash

Overview

A/B Test фреймворк для Machine Learning агент

This skill provides a statistical and code framework for A/B testing machine learning models in production, covering sample size and MDE calculation, consistent-hash traffic splitting, Bayesian and sequential significance analysis, and post-rollout drift monitoring via KS test and Jensen-Shannon divergence. It treats both model performance metrics and business metrics as part of the same experiment. Use it when comparing ML model variants in production and you need proper sample sizing, multiple-testing correction, and drift detection rather than an informal metric comparison. Avoid peeking at results without correction and avoid changing experiment parameters mid-flight without re-analysis.

What it does

This skill provides expert guidance for designing, implementing, and analyzing A/B tests specifically for machine learning systems in production, covering concept drift, model comparison, statistical power calculations, and the complexity of measuring both business metrics and model performance metrics together. It covers core ML A/B testing principles: defining primary and secondary metrics before launch, calculating minimum detectable effect (MDE) and required sample size in advance, applying multiple-testing correction when evaluating several metrics, and using the correct randomization unit (user, session, or request level). ML-specific considerations include monitoring both model performance metrics (accuracy, AUC, precision/recall) and business metrics (conversion, revenue, engagement) together, accounting for model inference latency and compute cost in the analysis, considering temporal effects and seasonality, and handling model versioning and reproducibility across the experiment's lifetime.

The skill includes runnable code for the full experiment lifecycle: a sample-size calculator (calculate_sample_size, using scipy/statsmodels), a consistent-hashing traffic splitter (ABTestSplitter) for assigning users to control/treatment, a model-serving wrapper (ABTestModelServer) that routes predictions by variant and logs latency and outcomes, a Bayesian A/B analysis using pymc3 (bayesian_ab_test, returning posterior lift and probability of positive lift), a sequential-testing implementation with early stopping (SequentialABTest, using an O'Brien-Fleming alpha-spending function), and a model drift monitor (ModelDriftMonitor) using the Kolmogorov-Smirnov test and Jensen-Shannon divergence to detect prediction-distribution shift.

When to use - and when NOT to

Use this skill when planning or analyzing an A/B test that compares machine learning model variants in production - for example, testing a new model version against the current one, or measuring the impact of a model change on both model-quality metrics and downstream business metrics. It is built for teams that need statistical rigor (proper sample sizing, correction for multiple metrics, confidence intervals) rather than eyeballing conversion differences. It is not a general-purpose experimentation platform - it provides the statistical framework and reference implementations, not a hosted experiment-management product, and it explicitly warns against common mistakes: re-running analysis on results without correcting for repeated testing (peeking), changing experiment parameters mid-flight without proper re-analysis, ignoring differences in model latency and compute cost, and mismatching the randomization unit against the analysis unit.

def calculate_sample_size(baseline_rate, mde, alpha=0.05, power=0.8):
    effect_size = mde * baseline_rate / np.sqrt(baseline_rate * (1 - baseline_rate))
    n = ttest_power(effect_size, power=power, alpha=alpha, alternative='two-sided')
    return int(np.ceil(n))

Inputs and outputs

Inputs are experiment parameters (baseline rate, minimum detectable effect, significance level, power) and the observed data (conversions, totals, or streamed observations per variant). Outputs are: required sample size per variant, variant assignment for a given user ID, live prediction routing with per-request logging (variant, prediction, latency, timestamp), Bayesian posterior estimates of lift and probability of a positive effect, a stop/continue decision from the sequential test, and a drift report (KS statistic, KS p-value, Jensen-Shannon divergence, drift-detected flag).

Integrations

Built on Python's scientific stack: numpy, scipy.stats, statsmodels (power analysis), pymc3 and arviz (Bayesian modeling), and standard library hashlib for consistent-hash-based traffic splitting. The model-serving example integrates with a feature store pattern and a metrics logger for per-prediction telemetry.

Who it's for

ML engineers, data scientists, and MLOps teams running production experiments that compare machine learning models, who need statistically rigorous sample sizing, traffic splitting, Bayesian or sequential analysis, and drift monitoring rather than ad hoc metric comparisons.

Discussion