Concepts

This page is a stable, linkable reference for SPFlow concepts (separate from the API reference and notebooks).

See also SOCS for details on signed circuits and compatibility.

Shapes and Dimensions

SPFlow modules use a consistent internal shape convention: (features, channels, repetitions). You will often see this displayed as D (features), C (channels), and R (repetitions) in model.to_str() output.

Terminology

  • Features (D): Number of random variables represented by the module (usually len(scope)).

  • Channels (C): Parallel distributions/mixture channels computed in one forward pass.

  • Repetitions (R): Independent parameterizations of the same structure.

Where shapes appear

  • Input data: data is shaped (batch, num_features).

  • Log-likelihood outputs: log_likelihood returns (batch, out_features, out_channels, repetitions).

  • Module metadata: spflow.modules.module_shape.ModuleShape stores (features, channels, repetitions).

Practical tips

  • Use model.to_str() to sanity-check shapes and scopes end-to-end.

  • If data.shape[1] != len(model.scope), check your leaf scopes first.

Scopes and Decomposability

A scope identifies which input variables (features) a module operates on. Scopes enforce the structural constraints that make inference tractable.

What is a scope?

Use spflow.meta.Scope to describe feature indices:

from spflow.meta import Scope

scope = Scope([0, 1, 2])

Rules of thumb

  • Sum nodes combine inputs with the same scope.

  • Product nodes combine inputs with disjoint scopes (decomposability / independence assumption).

Common failure modes

  • Scope mismatch in a Sum: you mixed modules that do not cover the same variables.

  • Overlapping scopes in a Product: you combined two modules that both model the same feature(s).

Related references

Missing Data and Evidence

SPFlow uses NaN-based evidence: missing values are represented with torch.nan. This makes it easy to mix observed and unobserved variables in the same tensor.

Log-likelihood with missing data

When computing likelihoods, NaN entries are treated as “unknown” variables to marginalize out:

import torch

data = torch.randn(32, 5)
data[0, 2] = float("nan")  # feature 2 missing for sample 0

log_ll = model.log_likelihood(data)

Conditional sampling with evidence

For conditional sampling, you can provide an evidence tensor where NaNs indicate values to sample:

evidence = torch.full((10, num_features), float("nan"))
evidence[:, 0] = 0.5  # condition on feature 0

samples = model.sample_with_evidence(evidence=evidence)

Related references

Differentiable Sampling

The main public sampling APIs are sample, sample_with_evidence, and mpe. All three APIs support return_leaf_params=True and then return (samples, leaf_param_records).

SPFlow also contains differentiable routing/sampling paths for selected modules and leaves, but this is an advanced interface and not uniformly supported across all components. For APC models specifically, inference APIs (encode/decode/sampling/likelihood) remain available, while the model objective APIs (AutoencodingPC.loss_components / loss) are available. Latent stats are available from APC encoders via exact selected latent leaf parameters. Exact latent KL/stat extraction is supported for Normal, Bernoulli, Binomial, and Categorical leaves against fixed canonical priors; unsupported latent families raise explicit errors (no fallback path). APC trainer helper functions in spflow.zoo.apc.train are available for lightweight training loops.

Caching and Dispatch

Probabilistic circuits are DAGs, and many operations reuse subcomputations. SPFlow provides a lightweight caching mechanism to avoid redundant work during inference, learning, and sampling.

Cache basics

  • Use spflow.utils.cache.Cache to memoize intermediate results across a single traversal.

  • Many modules use the spflow.utils.cache.cached() decorator for operations like log_likelihood.

  • spflow.utils.cache.Cache also provides Cache.extras for storing custom, user-defined information that should be available throughout a recursive traversal.

When you should care

  • Repeatedly calling log_likelihood on the same model inside a loop can be faster if you reuse a cache.

  • Debugging unexpected values is easier if you can control whether cached results are reused.

Related references