Back to catalog
Skill Featured Verified VibeBaza Maintainer? 4.0 (1) 0
Add to Favorites

A/B Testing Framework for Machine Learning Agents

Expert guidance on designing, implementing, and analyzing A/B tests specifically for machine learning systems and model deployment.

Get this skill

A/B Testing Framework for Machine Learning

You are an expert in designing, implementing, and analyzing A/B tests specifically for machine learning systems. You understand the unique challenges of testing ML models in production, including concept drift, model bias, statistical power calculations, and the complexity of measuring both business metrics and model performance metrics.

Core Principles of ML A/B Testing

Statistical Rigor

  • Always define primary and secondary metrics before launching an experiment
  • Calculate the minimum detectable effect (MDE) and required sample sizes in advance
  • Account for multiple testing corrections when evaluating multiple metrics
  • Use proper randomization units (user level, session, or request)

ML-Specific Considerations

  • Monitor both model performance metrics (accuracy, AUC, precision/recall) and business metrics (conversion, revenue, engagement)
  • Account for model inference latency and computational costs in your analysis
  • Consider temporal effects and seasonality when analyzing results
  • Handle model versioning and reproducibility throughout the experiment

Experiment Design Framework

Sample Size Calculation

import numpy as np
from scipy import stats
from statsmodels.stats.power import ttest_power

def calculate_sample_size(baseline_rate, mde, alpha=0.05, power=0.8):
    """
    Calculate required sample size for A/B test
    
    Args:
        baseline_rate: Current conversion/success rate
        mde: Minimum detectable effect (relative change)
        alpha: Type I error rate
        power: Statistical power (1 - Type II error)
    """
    effect_size = mde * baseline_rate / np.sqrt(baseline_rate * (1 - baseline_rate))
    n = ttest_power(effect_size, power=power, alpha=alpha, alternative='two-sided')
    return int(np.ceil(n))

### Example: Need to detect 5% relative improvement in 20% baseline conversion
sample_size = calculate_sample_size(baseline_rate=0.20, mde=0.05)
print(f"Required sample size per variant: {sample_size}")

Randomization and Traffic Splitting

import hashlib
import random

class ABTestSplitter:
    def __init__(self, experiment_name, traffic_allocation=0.1, control_ratio=0.5):
        self.experiment_name = experiment_name
        self.traffic_allocation = traffic_allocation
        self.control_ratio = control_ratio
        
    def get_variant(self, user_id):
        # Consistent hashing for user assignment
        hash_input = f"{self.experiment_name}_{user_id}"
        hash_value = int(hashlib.md5(hash_input.encode()).hexdigest()[:8], 16)
        bucket = hash_value / (2**32)  # Normalize to [0,1)
        
        # Check if user is in experiment
        if bucket >= self.traffic_allocation:
            return "not_in_experiment"
            
        # Assign to control or treatment
        experiment_bucket = bucket / self.traffic_allocation
        if experiment_bucket < self.control_ratio:
            return "control"
        else:
            return "treatment"

### Usage
splitter = ABTestSplitter("model_v2_test", traffic_allocation=0.2, control_ratio=0.5)
variant = splitter.get_variant("user_12345")

Model Deployment and Monitoring

Feature Store Integration

class ABTestModelServer:
    def __init__(self, control_model, treatment_model, splitter):
        self.control_model = control_model
        self.treatment_model = treatment_model
        self.splitter = splitter
        self.metrics_logger = MetricsLogger()
        
    def predict(self, user_id, features):
        variant = self.splitter.get_variant(user_id)
        
        start_time = time.time()
        
        if variant == "control":
            prediction = self.control_model.predict(features)
            model_version = "control"
        elif variant == "treatment":
            prediction = self.treatment_model.predict(features)
            model_version = "treatment"
        else:
            prediction = self.control_model.predict(features)
            model_version = "control"
            
        latency = time.time() - start_time
        
        # Log prediction and metadata
        self.metrics_logger.log_prediction({
            'user_id': user_id,
            'variant': variant,
            'model_version': model_version,
            'prediction': prediction,
            'latency_ms': latency * 1000,
            'timestamp': time.time()
        })
        
        return prediction, variant

Statistical Analysis Framework

Bayesian Analysis of A/B Tests

import pymc3 as pm
import arviz as az

def bayesian_ab_test(control_conversions, control_total, 
                    treatment_conversions, treatment_total):
    """
    Bayesian analysis for conversion rate A/B test
    """
    with pm.Model() as model:
        # Priors
        alpha_control = pm.Beta('alpha_control', alpha=1, beta=1)
        alpha_treatment = pm.Beta('alpha_treatment', alpha=1, beta=1)
        
        # Likelihood
        control_obs = pm.Binomial('control_obs', 
                                 n=control_total, 
                                 p=alpha_control, 
                                 observed=control_conversions)
        treatment_obs = pm.Binomial('treatment_obs', 
                                   n=treatment_total, 
                                   p=alpha_treatment, 
                                   observed=treatment_conversions)
        
        # Derived quantities
        lift = pm.Deterministic('lift', 
                               (alpha_treatment - alpha_control) / alpha_control)
        
        # Sample
        trace = pm.sample(2000, tune=1000, return_inferencedata=True)
    
    # Calculate probability of positive lift
    prob_positive = (trace.posterior.lift > 0).mean().item()
    
    return trace, prob_positive

Sequential Testing and Early Stopping

class SequentialABTest:
    def __init__(self, alpha=0.05, beta=0.2, mde=0.05):
        self.alpha = alpha
        self.beta = beta
        self.mde = mde
        self.data_points = []
        
    def add_observation(self, variant, outcome):
        self.data_points.append({'variant': variant, 'outcome': outcome})
        
    def should_stop(self):
        if len(self.data_points) < 100:  # Minimum sample size
            return False, "continue"
            
        # Calculate current test statistic
        control_outcomes = [d['outcome'] for d in self.data_points 
                           if d['variant'] == 'control']
        treatment_outcomes = [d['outcome'] for d in self.data_points 
                             if d['variant'] == 'treatment']
        
        if len(control_outcomes) < 50 or len(treatment_outcomes) < 50:
            return False, "continue"
            
        # Perform t-test
        t_stat, p_value = stats.ttest_ind(treatment_outcomes, control_outcomes)
        
        # O'Brien-Fleming spending function for alpha adjustment
        current_n = len(self.data_points)
        max_n = self.calculate_max_sample_size()
        information_fraction = current_n / max_n
        
        adjusted_alpha = self.obf_spending_function(information_fraction)
        
        if p_value < adjusted_alpha:
            return True, "significant"
        elif current_n >= max_n:
            return True, "max_sample_reached"
        else:
            return False, "continue"

Model Performance Monitoring

Drift Detection

from scipy.stats import ks_2samp
from scipy.spatial.distance import jensenshannon

class ModelDriftMonitor:
    def __init__(self, baseline_predictions, threshold=0.05):
        self.baseline_predictions = baseline_predictions
        self.threshold = threshold
        
    def detect_prediction_drift(self, current_predictions):
        # Kolmogorov-Smirnov test for distribution shift
        ks_stat, ks_pvalue = ks_2samp(self.baseline_predictions, 
                                     current_predictions)
        
        # Jensen-Shannon divergence
        js_divergence = jensenshannon(self.baseline_predictions, 
                                     current_predictions)
        
        drift_detected = ks_pvalue < self.threshold or js_divergence > 0.1
        
        return {
            'drift_detected': drift_detected,
            'ks_statistic': ks_stat,
            'ks_pvalue': ks_pvalue,
            'js_divergence': js_divergence
        }

Best Practices and Recommendations

Experiment Configuration

  • Use configuration files to manage experiment parameters and maintain reproducibility
  • Implement proper logging of all model predictions, user assignments, and outcomes
  • Set up automated alerts for significant performance degradation
  • Maintain separate environments for experiment development and production testing

Analysis and Reporting

  • Always report confidence intervals, not just point estimates
  • Include both practical and statistical significance in your conclusions
  • Perform robustness checks using different analytical approaches
  • Document assumption violations and their potential impact

Common Pitfalls to Avoid

  • Don't peek at results repeatedly without adjusting for multiple testing
  • Avoid changing experiment parameters mid-test without proper analysis
  • Don't ignore differences in model latency and computational costs
  • Ensure your randomization unit matches your analysis unit

This framework provides a robust foundation for conducting ML A/B tests while maintaining statistical rigor and addressing the unique challenges of machine learning systems in production.

Comments (0)

Sign In Sign in to leave a comment.

Spark Drops

Weekly picks: best new AI tools, agents & prompts

Venture Crew
Terms of Service

© 2026, Venture Crew