Concepts

This page is a stable, linkable reference for SPFlow concepts (separate from the API reference and notebooks).

Shapes and Dimensions

SPFlow modules use a consistent internal shape convention: (features, channels, repetitions). You will often see this displayed as D (features), C (channels), and R (repetitions) in model.to_str() output.

Terminology

  • Features (D): Number of random variables represented by the module (usually len(scope)).

  • Channels (C): Parallel distributions/mixture channels computed in one forward pass.

  • Repetitions (R): Independent parameterizations of the same structure.

Where shapes appear

  • Input data: data is shaped (batch, num_features).

  • Log-likelihood outputs: log_likelihood returns (batch, out_features, out_channels).

  • Module metadata: spflow.modules.module_shape.ModuleShape stores (features, channels, repetitions).

Practical tips

  • Use model.to_str() to sanity-check shapes and scopes end-to-end.

  • If data.shape[1] != len(model.scope), check your leaf scopes first.

Scopes and Decomposability

A scope identifies which input variables (features) a module operates on. Scopes enforce the structural constraints that make inference tractable.

What is a scope?

Use spflow.meta.Scope to describe feature indices:

from spflow.meta import Scope

scope = Scope([0, 1, 2])

Rules of thumb

  • Sum nodes combine inputs with the same scope.

  • Product nodes combine inputs with disjoint scopes (decomposability / independence assumption).

Common failure modes

  • Scope mismatch in a Sum: you mixed modules that do not cover the same variables.

  • Overlapping scopes in a Product: you combined two modules that both model the same feature(s).

Related references

Missing Data and Evidence

SPFlow uses NaN-based evidence: missing values are represented with torch.nan. This makes it easy to mix observed and unobserved variables in the same tensor.

Log-likelihood with missing data

When computing likelihoods, NaN entries are treated as “unknown” variables to marginalize out:

import torch

data = torch.randn(32, 5)
data[0, 2] = float("nan")  # feature 2 missing for sample 0

log_ll = model.log_likelihood(data)

Conditional sampling with evidence

For conditional sampling, you can provide an evidence tensor where NaNs indicate values to sample:

evidence = torch.full((10, num_features), float("nan"))
evidence[:, 0] = 0.5  # condition on feature 0

samples = model.sample_with_evidence(evidence=evidence)

Related references

Caching and Dispatch

Probabilistic circuits are DAGs, and many operations reuse subcomputations. SPFlow provides a lightweight caching mechanism to avoid redundant work during inference, learning, and sampling.

Cache basics

  • Use spflow.utils.cache.Cache to memoize intermediate results across a single traversal.

  • Many modules use the spflow.utils.cache.cached() decorator for operations like log_likelihood.

  • spflow.utils.cache.Cache also provides Cache.extras for storing custom, user-defined information that should be available throughout a recursive traversal.

When you should care

  • Repeatedly calling log_likelihood on the same model inside a loop can be faster if you reuse a cache.

  • Debugging unexpected values is easier if you can control whether cached results are reused.

Related references