DistributionRegressor

Nonparametric conditional density estimation using LightGBM with a CDF-based approach.

$ pip install distribution-regressor

When model predictions drive automated decisions, a single number is not enough. Probabilistic prediction outputs the full distribution, revealing uncertainty, skew, and multimodality.

Point PredictionProbabilistic Prediction
Tomorrow's temperature 73.4 F Symmetric, low variance. The point estimate works well here.
Retail demand 142 units Right-skewed with a long tail. The mean underestimates the risk of high demand.
Commute time 33.5 min Bimodal: 20 min or 50 min. The mean of 33.5 almost never happens.

How it works

The target range is discretized into a grid of threshold points. For each training sample and threshold, a binary target indicates whether the true value falls below that threshold. A single LightGBM model learns the conditional CDF F(τ|x) = P(Y ≤ τ | X = x), with a monotone increasing constraint on the threshold feature to ensure a valid CDF. The PMF is recovered by differencing.

1. Grid
Discretize y-range
into n_bins points
2. Binary Targets
zij = 1{yi ≤ τj}
4. Train
Single LightGBM
logistic loss + monotone
3. Expand
Cross (xi, τj)
pairs + binary targets

At inference, the model predicts F(τ|x) for all grid points, then differences the CDF to recover the probability mass function. This produces a full nonparametric distribution for every input - no distributional assumptions required.

Comparison with NGBoost: NGBoost uses natural gradient boosting to fit parameters of a parametric distribution (e.g., Normal, LogNormal). This package is nonparametric - it learns the density shape directly, so it can represent multimodal or irregular distributions without choosing a parametric family.

Installation

$pip install distribution-regressor

Requires Python ≥ 3.8

From source

# Clone and install in development mode
git clone https://github.com/guyko81/DistributionRegressor.git
cd DistributionRegressor
pip install -e .

Dependencies

All dependencies are installed automatically via pip.

LightGBM
≥ 3.0.0
NumPy
≥ 1.20.0
Pandas
≥ 1.3.0
scikit-learn
≥ 1.0.0
SciPy
≥ 1.7.0
joblib
≥ 1.0.0

Verify installation

from distribution_regressor import DistributionRegressor
print("Ready!")

Quick Start

Install and start predicting distributions in under 10 lines of code. The defaults work well for most problems - just set n_bins and n_estimators.

from distribution_regressor import DistributionRegressor

# Initialize with defaults - this is all you need
model = DistributionRegressor(
    n_bins=50,           # grid resolution
    n_estimators=500,    # LightGBM trees
)

# Train
model.fit(X_train, y_train)

# Point predictions
y_mean = model.predict(X_test)           # expected value
y_mode = model.predict_mode(X_test)      # most likely value

# Uncertainty
intervals = model.predict_interval(X_test, confidence=0.9)
std = model.predict_std(X_test)

# Full distribution
grids, dists, offsets = model.predict_distribution(X_test)

# Evaluate density at specific points
density = model.pdf(X_test, y_test)
nll = model.nll(X_test, y_test)

Visualization

import matplotlib.pyplot as plt

# Predict distribution for a single sample
grids, dists, offsets = model.predict_distribution(X_test[0:1])

plt.plot(grids[0], dists[0], label='Predicted PDF')
plt.axvline(y_test[0], color='r', linestyle='--', label='True Value')
plt.legend()
plt.show()

Output smoothing

The raw predicted PMF can appear stepped or jagged at lower n_bins values. Apply Gaussian smoothing to the CDF before deriving outputs for smoother densities. This can be set at construction or changed after training:

# Set at construction
model = DistributionRegressor(n_bins=100, output_smoothing=1)

# Or change after training without refitting
model.set_output_smoothing(output_smoothing=1.5)

# Preserve discrete mass points (e.g., exact zeros)
model.set_output_smoothing(output_smoothing=1, atom_values=0)

Key Parameters

ParameterDefaultDescription
n_bins50 Number of grid points for the CDF. Controls the resolution of the predicted distribution. Higher values capture finer detail but increase memory proportionally (n_samples × n_bins expanded rows). 50 is fine for most problems; use 100–200 for complex or wide-range targets.
n_estimators100 Number of LightGBM boosting iterations. Since the model learns a CDF (a smooth function), it often benefits from more trees than a standard regression model. 500–1000 with a lower learning rate (0.01–0.05) typically gives the best results.
learning_rate0.1 Boosting learning rate. Lower values (0.01–0.05) with more estimators generally produce better-calibrated distributions.
output_smoothing0 Gaussian smoothing sigma (in grid-index units) applied to the predicted CDF before deriving outputs. Produces smoother densities. A value of 1–2 works well. Can be changed after training via set_output_smoothing() without refitting.
atom_valuesNone Discrete mass points whose CDF jumps should be preserved through smoothing. Useful when the target has exact values that appear frequently (e.g., zero for "store closed"). Only relevant when output_smoothing > 0.
**kwargs- Any LightGBM parameter: max_depth, num_leaves, subsample, colsample_bytree, min_child_samples, etc. These are passed directly to the underlying LGBMRegressor.
Recommended starting point: DistributionRegressor(n_bins=100, n_estimators=1000, learning_rate=0.01) works well across a wide range of problems. Increase n_bins for targets with very wide ranges or fine structure. Adjust LightGBM parameters (num_leaves, subsample) as you would for any gradient boosting model.

Advanced options

These options are available for specific use cases but are not needed for most problems. See the Advanced Options page for details.

ParameterDefaultDescription
use_base_modelFalse If True, first trains a base LightGBM regressor, then models the residual distribution. Can improve RMSE but may obscure multimodal structure. Not needed for most use cases.
monte_carlo_trainingFalse If True, samples a small number of grid points per training sample instead of using the full grid. Reduces memory from n × n_bins to n × mc_samples. Use with mc_resample_freq for better quality.
mc_samples5 Number of grid points per sample when monte_carlo_training=True. 20 is a good default.
mc_resample_freq100 Resample MC grid points every this many trees. Lower values give better grid coverage (like SGD over threshold space). 100 is a good default; 10 gives the best quality.

API Reference

Scikit-learn compatible interface with rich distributional methods.

Constructor

DistributionRegressor(
    n_bins=50,                 # grid resolution (higher = finer distributions)
    n_estimators=100,         # LightGBM trees
    learning_rate=0.1,        # boosting learning rate
    output_smoothing=0,       # Gaussian smoothing on CDF (0 = off)
    atom_values=None,         # preserve discrete mass points through smoothing
    random_state=None,        # random seed
    use_base_model=False,     # advanced: learn residual distribution
    monte_carlo_training=False,# advanced: memory-efficient MC sampling
    mc_samples=5,             # advanced: MC points per sample
    mc_resample_freq=100,     # advanced: resample grid points every N trees
    **kwargs                  # passed to LGBMRegressor
)

Training

model.fit(X, y, sample_weight=None)
Fit the model by learning F(τ|x) = P(Y ≤ τ | X = x). X can be a NumPy array or Pandas DataFrame. sample_weight applies per-sample weights to the expanded training set.

Point Predictions

model.predict(X) / model.predict_mean(X)
Expected value E[Y|X] = ∑ grid × pmf. This is the distributional mean, computed from the full predicted PMF.
model.predict_mode(X)
Most likely value (peak of the predicted density). Useful for multimodal distributions where the mean may fall between modes.
model.predict_quantile(X, q=0.5)
Arbitrary quantile(s) via CDF inversion. Pass a scalar for a single quantile, or a list/array for multiple (e.g., q=[0.1, 0.5, 0.9]). Returns shape (n_samples,) or (n_samples, n_quantiles).
model.predict_interval(X, confidence=0.95)
Symmetric prediction interval [lower, upper]. Returns shape (n_samples, 2). A confidence=0.9 gives the 5th–95th percentile interval.
model.predict_std(X)
Standard deviation of the predicted distribution: √(E[Y²|X] - E[Y|X]²).

Full Distribution

model.predict_distribution(X)
Returns (grids, distributions, base_offsets):
  • grids: (n_samples, n_bins) - per-sample grid points in absolute y-space
  • distributions: (n_samples, n_bins) - probability mass (sums to 1 per row)
  • base_offsets: (n_samples,) - base model predictions (zeros when use_base_model=False)

Density Evaluation

model.pdf(X, y, eps=1e-10)
Evaluate predicted density f(y|X) at specific y values via linear interpolation on the PMF grid. Returns eps for y values outside the grid support.
model.logpdf(X, y, eps=1e-10)
Log-density log f(y|X). Equivalent to np.log(model.pdf(X, y, eps)).
model.cdf(X, y)
Cumulative distribution F(y|X) = P(Y ≤ y | X). Returns 0 below grid support, 1 above.
model.ppf(X, q=0.5)
Percent point function (inverse CDF / quantile function). Alias for predict_quantile.
model.nll(X, y, eps=1e-10)
Mean negative log-likelihood - a proper scoring rule for density evaluation. Lower is better. Useful for model comparison and hyperparameter tuning.

Post-Training Configuration

model.set_output_smoothing(output_smoothing=0, atom_values=None)
Change CDF smoothing after training without refitting. Affects all subsequent predictions. output_smoothing is the Gaussian sigma in grid-index units. atom_values specifies discrete mass points whose CDF jumps are preserved through smoothing.

Citation

@software{distributionregressor2025,
  title  = {DistributionRegressor: Nonparametric Distributional Regression},
  author = {Gabor Gulyas},
  year   = {2025},
  url    = {https://github.com/guyko81/DistributionRegressor}
}

Rossmann Store Sales

Predicting daily sales for a retail store. The model captures day-of-week patterns, promotional effects, and holiday uncertainty - including the hard zero on Sundays.

from distribution_regressor import DistributionRegressor

model = DistributionRegressor(
    n_bins=100, n_estimators=1000, learning_rate=0.01,
    random_state=42,
)
model.fit(train_data[features], train_data['Sales'])

# Prediction intervals for time series visualization
p10 = model.predict_quantile(X, 0.1)
p50 = model.predict_quantile(X, 0.5)
p90 = model.predict_quantile(X, 0.9)

Time series with uncertainty bands

Rossmann sales time series with uncertainty bands
Predicted sales distribution for the test period (last 5 weeks). Blue = actual sales, red bands = 10th–90th and 25th–75th percentile intervals. Sundays (dips to zero) are captured cleanly. The model naturally produces wider uncertainty bands for promotional days and tighter bands for regular patterns.

Individual distribution shapes

Each sample gets a full predicted density. The closed-day prediction collapses to a tight spike at zero, while open days show the characteristic sales distribution with appropriate spread depending on context (promo vs. normal weekday):

Individual predicted densities for Rossmann samples
Predicted p(Sales|x) for selected test days. The closed day shows a sharp spike at zero. Promo days have slightly wider spreads, while regular weekdays are tighter. Red dashed line = actual sales value.

California Housing (vs NGBoost)

Predicting the full house value distribution conditional on location and census features (8 features, 20,640 samples). Compared against NGBoost on test-set negative log-likelihood.

Benchmark results

MethodTest NLL
DistributionRegressor 0.14
NGBoost (Normal) 0.56
Key takeaway: DistributionRegressor outperforms NGBoost (Normal) by 0.42 nats on California Housing, without requiring any distributional assumptions. House value distributions are often multimodal and bounded, which violates the Normal assumption that NGBoost relies on. Target is median house value in $100k (sklearn default scale).

The limitation of parametric models

NGBoost with a Normal distribution forces every prediction into a symmetric, unimodal bell curve. When the true conditional distribution is multimodal, skewed, or bounded, the parametric assumption breaks down and wastes probability mass on impossible values.

NGBoost can use more complex parametric families to mitigate this, but the user must choose the right family in advance. DistributionRegressor sidesteps this entirely - it learns whatever shape the data requires.

DistributionRegressor vs NGBoost on California Housing data
DistributionRegressor (blue) captures complex conditional densities that NGBoost's Normal (green dashed) cannot represent. The parametric model is forced into a symmetric bell curve regardless of the true data distribution.

Base Model (use_base_model)

Most users do not need this option. The default direct mode (use_base_model=False) learns the full conditional distribution end-to-end and works well for the vast majority of problems. The base model option exists for specific scenarios described below.

When use_base_model=True, the model first trains a standard LightGBM regressor for point predictions, computes out-of-fold residuals, and then learns the residual CDF rather than the raw target CDF. This means the distributional model only needs to capture the uncertainty around the point prediction rather than the full target range.

How it works

1. Base Model
Train LGBM regressor
on full target y
2. OOF Residuals
5-fold CV residuals
r = y - ŷ
4. Shift Back
grids + base_pred
= absolute distribution
3. CDF on Residuals
Learn F(τ|x) on r
with per-sample grids

When to consider it

use_base_model=False (default)

The CDF model learns p(y|x) directly. This is the recommended approach.

  • Simpler - single model, no residual computation
  • Preserves multimodal and complex distribution shapes
  • Cleaner theoretical foundation
  • Works well across all distribution types
  • Already produces good point predictions (mean, quantiles)

use_base_model=True

First fits a standard LightGBM for point predictions, then models the residual distribution.

  • May marginally improve RMSE on unimodal, well-behaved targets
  • Per-sample adaptive grid centered on base prediction
  • Higher grid resolution in the likely region
  • Trade-off: can obscure multimodal structure in the data
  • Trade-off: adds training complexity (5-fold CV + 2 models)

The trade-off

The base model approach learns a residual distribution centered near zero, which means the CDF model focuses its capacity on the uncertainty around the point prediction rather than the full target range. This can improve RMSE slightly, but adds complexity (5-fold CV + two models) and may introduce subtle shifts or smoothing in the predicted distribution shape.

In practice, the direct model (use_base_model=False) produces distributions that are precise enough for both point prediction and uncertainty quantification. The base model option adds complexity for a marginal RMSE gain that is rarely worth the trade-off.

Calibration and RMSE comparison

Calibration comparison with and without base model
Calibration plot and RMSE comparison on Rossmann test set. Both modes show comparable calibration. The baseline model offers slightly better RMSE on this unimodal dataset, but the difference is marginal.

Distribution shape on multimodal data

On California Housing data - where conditional distributions can be multimodal - both modes capture the overall shape, but the baseline model can shift or smooth out fine structure due to the residual centering:

Direct vs baseline on multimodal data
Direct model (blue) vs. baseline model (orange dashed) on California Housing data. Both capture the overall distribution shape, but the baseline model can introduce subtle shifts and smoothing artifacts from the residual transformation.
# Only use base model when you specifically need it
model = DistributionRegressor(
    use_base_model=True,
    n_bins=100, n_estimators=1000, learning_rate=0.01,
)

Monte Carlo Training (monte_carlo_training)

By default, the training set is expanded to n_samples × n_bins rows: every sample is paired with every grid point. Monte Carlo training instead samples mc_samples uniform grid points per sample, reducing peak memory significantly. With periodic resampling (mc_resample_freq), different groups of trees train on different random grid points, improving coverage of the threshold space.

How it works

For each training sample, mc_samples threshold points are drawn uniformly from the grid range. The mc_resample_freq parameter controls how often fresh points are drawn:

  • mc_resample_freq=100 (default): Resample every 100 trees. With 1000 trees, the model trains on 10 different random grids. Low overhead.
  • mc_resample_freq=10: Resample every 10 trees. More grid coverage, but higher overhead from repeated dataset construction and LightGBM init_model calls.

Resampling works via LightGBM's incremental training (init_model): each group of trees continues from the previous model but trains on a freshly sampled expanded dataset. The idea is analogous to mini-batch SGD — each batch of trees sees a different random slice of the threshold space.

Memory comparison

monte_carlo_training=False (default)

Each sample is crossed with all grid points.

100k samples × 100 bins = 10M rows

1M samples × 200 bins = 200M rows (may exceed RAM)

monte_carlo_training=True

Each sample gets mc_samples points per resampling round.

100k samples × 20 MC points = 2M rows

1M samples × 20 MC points = 20M rows (manageable)

When to use it

MC training is useful when full grid expansion would exceed available memory — typically with large training sets (100k+ samples) and high n_bins (100+). For small to medium datasets, the full grid is simpler and fast enough.

Resampling (mc_resample_freq) significantly improves MC quality compared to single-shot MC. In our benchmarks on the Rossmann dataset, MC with mc_resample_freq=10 produced distributions visually close to the full grid, with comparable NLL. Results may vary by dataset — we recommend comparing on your own data.

# Full grid (default) - simple, good for small/medium datasets
model = DistributionRegressor(n_bins=100)

# MC with resampling - for large datasets where memory is tight
model = DistributionRegressor(
    monte_carlo_training=True,
    mc_samples=20,
    mc_resample_freq=100,   # resample grid points every 100 trees
    n_bins=100,
)

# MC with frequent resampling - better grid coverage, more overhead
model = DistributionRegressor(
    monte_carlo_training=True,
    mc_samples=20,
    mc_resample_freq=10,    # resample every 10 trees
    n_bins=100,
)

Quality comparison

The chart below compares the full grid against MC with two resampling frequencies on the Rossmann dataset. With mc_resample_freq=10, MC closely tracks the full grid distributions:

Full grid vs MC resampling comparison
Predicted densities on Rossmann test samples. Blue = full grid, orange = MC K=20 with freq=100, green = MC K=20 with freq=10.

Tuning Guide

Practical guidance for getting the best results from DistributionRegressor.

Step 1: Start with good defaults

For most problems, the following configuration works well out of the box:

model = DistributionRegressor(
    n_bins=100,
    n_estimators=1000,
    learning_rate=0.01,
    random_state=42,
)

Step 2: Adjust n_bins for your target range

n_bins controls the resolution of the predicted distribution. Think of it as the number of "buckets" the model uses to represent the density.

ScenarioRecommended n_bins
Narrow, well-behaved targets (temperature, standardized scores) 50
Moderate range (retail sales, housing prices) 100
Wide range or fine structure (house prices, medical costs) 150–200
Very wide or heavy-tailed (insurance claims, financial) 200+ (consider log-transforming the target first)

Higher n_bins gives finer resolution but increases memory linearly. If memory is tight, combine higher n_bins with monte_carlo_training=True and mc_resample_freq (see Monte Carlo Training).

Step 3: Tune the LightGBM parameters

Since DistributionRegressor uses LightGBM under the hood, all standard LGBM tuning strategies apply. Pass any parameter through **kwargs:

model = DistributionRegressor(
    n_bins=100,
    n_estimators=1000,
    learning_rate=0.01,
    # Standard LightGBM parameters
    num_leaves=63,
    max_depth=-1,
    subsample=0.8,
    colsample_bytree=0.8,
    min_child_samples=20,
)

Step 4: Evaluate with proper scoring rules

Use negative log-likelihood (NLL) as your primary evaluation metric for the distribution quality. Lower NLL means better-calibrated, sharper distributions. Use RMSE or MAE alongside NLL to evaluate point prediction quality:

# Distribution quality (proper scoring rule)
nll = model.nll(X_test, y_test)

# Point prediction quality
from sklearn.metrics import mean_squared_error
rmse = mean_squared_error(y_test, model.predict(X_test), squared=False)

# Calibration check: does 90% interval contain ~90% of true values?
iv = model.predict_interval(X_test, confidence=0.9)
coverage = ((y_test >= iv[:, 0]) & (y_test <= iv[:, 1])).mean()
print(f"90% interval coverage: {coverage:.1%}")

Step 5: Apply output smoothing if needed

If the predicted densities look too jagged (especially at lower n_bins), apply Gaussian smoothing. You can experiment with this after training without refitting:

# Try different smoothing levels
for sigma in [0, 0.5, 1, 2]:
    model.set_output_smoothing(sigma)
    nll = model.nll(X_test, y_test)
    print(f"sigma={sigma}: NLL={nll:.4f}")

# If target has discrete atoms (e.g., exact zeros), preserve them
model.set_output_smoothing(output_smoothing=1, atom_values=0)
Common pitfalls:
  • Too few bins: If n_bins is much smaller than the effective range of your target, the predicted distribution will be coarse. Increase n_bins or transform the target.
  • Too few estimators: CDF learning benefits from many small steps. If distributions look underfit (too uniform or too wide), increase n_estimators or reduce learning_rate.
  • Outliers: Extreme outliers stretch the grid range, wasting resolution. Consider clipping or transforming the target before fitting.