InterSiteMatchedInterpolation#

class uniharmony.interpolation.InterSiteMatchedInterpolation(alpha: float | tuple[float, float] | list[float] = 0.3, target_tolerance: float | None = None, covariate_tolerance: ArrayLike | None = None, k: int | Literal['max', 'average'] = 1, mode: Literal['pairwise', 'base_to_others'] = 'pairwise', *, concatenate: bool = True, random_state: int | RandomState | None = None)#

Inter-Site Matched Interpolation (ISMI) Harmonization.

This sampler performs cross-site matched interpolation to reduce spurious correlations between site and class labels by generating synthetic samples from matched samples across different sites.

The algorithm iterates through all unique site pairs. The matching strategy prioritizes target value matching first, then refines matches using categorical and continuous covariates.

The interpolation takes a base sample from one site and a matched sample from another (target) site. A synthetic sample is generated by interpolating between the base and target samples using a randomly sampled alpha value. The alpha parameter controls how close the synthetic sample is to the base or target site, allowing a flexible interpolation.

By matching the samples with similar target values and covariates across sites, ISMI creates synthetic samples that mantains the the underlying data distribution while reducing site-specific biases.

Parameters:
alphafloat or tuple[float, float], optional (default 0.3)

Interpolation weight(s). If a float, all interpolations use this constant alpha value. If a tuple (min, max), alpha is sampled uniformly from [min, max] for each interpolation. Alpha must be in the range [0, 1]. Interpretation:

  • alpha = 0.0: synthetic sample equals base site sample

  • alpha = 0.5: synthetic sample is midpoint between sites

  • alpha = 1.0: synthetic sample equals target site sample

Default is 0.3 to keep synthetic samples closer to the base site.

target_tolerancefloat or None, optional (default None)

Tolerance for target value matching. For classification tasks, this should be 0 or None (exact match required). For regression tasks, allows matching targets within ±target_tolerance. If None, exact matching is required. target_tolerance should be in the same units as the target variable (e.g., years for age prediction).

covariate_tolerancearray-like, shape (n_continuous,), optional (default None)

Maximum allowed absolute difference for continuous covariate matching. Must have one value per continuous covariate column. If None, exact matching is required (covariate_tolerance=0). Covariate tolerance should be in the same units as the corresponding covariate (e.g., years for age).

kint or {“max”, “average”}, optional (default 1)

Number of matches to use for interpolation: - int >= 1: Use exactly k matches per sample. If fewer matches

are found, a warning is issued and all available matches are used.

  • "max": Use all available matches for each sample.

  • "average": Interpolate the base sample with the average of all matched samples from the target site.

mode{“pairwise”, “base_to_others”}, optional (default “pairwise”)

Interpolation mode: - "pairwise": Generate all unique site pairs. For n sites,

this produces n*(n-1)/2 pairs.

  • "base_to_others": Use each site as base once against all other sites combined. This mode only works with k=”average”, as the avergage of all matches for all sites is used as target sample.

concatenatebool, optional (default True)

If True, the output dataset includes both original and synthetic samples. If False, only synthetic samples are returned.

random_stateint or np.random.RandomState or None, optional (default None)

The seed of the pseudo random number generator for reproducibility.

Attributes:
interpolator_SamplerMixin

The fitted interpolator instance.

sites_resampled_ndarray of shape (n_samples_new,)

Site labels for the resampled dataset (original + synthetic).

unmatched_samples_dict

Dictionary tracking unmatched samples per direction. Keys are tuples indicating the interpolation direction (source_site, target_site). For base_to_others mode, target is "others". Values are counts of source samples that found no valid matches.

Methods

fit(X, y, **params)

Check inputs and statistics of the sampler.

fit_resample(X, y, sites, *[, ...])

Fit and resample using cross-site matched interpolation.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

set_fit_resample_request(*[, allow_nan, ...])

Configure whether metadata should be requested to be passed to the fit_resample method.

set_params(**params)

Set the parameters of this estimator.

See also

IntraSiteInterpolation

Site-wise class balancing without cross-site interpolation.

Notes

Alpha Reversal Strategy

For computational efficiency, the algorithm processes each unique site pair only once. For a pair (SiteA, SiteB):

  • Forward (A→B): samples alpha from [alpha_min, alpha_max]

  • Reverse (B→A): samples alpha from [1-alpha_max, 1-alpha_min]

This mathematically emulates running B→A with the original alpha range but avoids redundant matching operations.

Matching Hierarchy

Matches are determined in the following priority order:

  1. Target value: Must match exacly for categorical variables

    or within target_tolerance if target is a contunuous variable.

  2. Categorical covariates: Exact match required, checked in column order (first column must match, then second, etc.)

  3. Continuous covariates: Must be within covariate_tolerance for each column

If no samples in the target site satisfy all criteria, the source sample is counted as unmatched and no synthetic sample is generated.

References

[1]

Nieto, N., et al. (2026). Data harmonizing via interpolation applied to brain age prediction. Springer Nature. https://doi.org/10.1007/s44248-026-00100-7

Examples

>>> import numpy as np
>>> from uniharmony.interpolation import InterSiteMatchedInterpolation
>>>
>>> # Generate sample data with 3 sites
>>> rng = np.random.RandomState(42)
>>> X = rng.randn(150, 10)
>>> y = rng.randint(0, 2, 150)
>>> sites = np.array(["A"] * 50 + ["B"] * 50 + ["C"] * 50)
>>>
>>> # Define covariates for matching
>>> categorical_covariate = np.array([["M"], ["F"]] * 75)  # Sex
>>> continuous_covariate = rng.randint(20, 80, (150, 1))  # Age
>>> tolerance = np.array([5.0])  # ±5 years tolerance
>>>
>>> # Create interpolator with pairwise mode and k=2 matches
>>> ismi = InterSiteMatchedInterpolation(
...     alpha=(0.2, 0.4),
...     tolerance=tolerance,
...     k=2,
...     mode="pairwise",
...     random_state=42,
... )
>>>
>>> # Generate harmonized dataset
>>> X_res, y_res = ismi.fit_resample(
...     X, y,
...     sites=sites,
...     categorical_covariate=categorical_covariate,
...     continuous_covariate=continuous_covariate,
... )
>>>
>>> # Check unmatched samples
>>> print(ismi.unmatched_samples_)
{('A', 'B'): 3, ('B', 'A'): 2, ('A', 'C'): 5, ...}
fit_resample(X: ArrayLike, y: ArrayLike, sites: ArrayLike, *, categorical_covariate: ArrayLike | None = None, continuous_covariate: ArrayLike | None = None, allow_nan=False) tuple[ndarray[tuple[Any, ...], dtype[float64]], ndarray[tuple[Any, ...], dtype[Any]]]#

Fit and resample using cross-site matched interpolation.

Parameters:
Xarray-like of shape (n_samples, n_features)

Feature matrix containing input samples.

yarray-like of shape (n_samples,)

Target values. Can be integer labels (classification) or continuous values (regression).

sitesarray-like of shape (n_samples,)

Site or domain identifiers for each sample. Resampling is performed between different sites.

categorical_covariatearray-like of shape (n_samples, n_categorical), default=None

Categorical covariates used for hierarchical matching (e.g., sex, diagnosis group). Matching proceeds in column order: the first column must match exactly, then the second, etc.

continuous_covariatearray-like of shape (n_samples, n_continuous), default=None

Continuous covariates used for matching after categorical matching (e.g., age, education years). Matches must be within tolerance for each column.

allow_nanbool, default=None

If allow continuos and categorical covariates to present NaN or not

Returns:
X_resampledndarray of shape (n_samples_new, n_features)

The feature matrix after cross-site interpolation, containing both original and synthetic samples.

y_resampledndarray of shape (n_samples_new,)

The corresponding targets after resampling.

Site labels for the resampled dataset are stored in the attribute sites_resampled_
Raises:
ValueError

If input arrays have incompatible shapes, if fewer than two unique sites are present, if covariates contain invalid values, or if parameters are inconsistent.

set_fit_resample_request(*, allow_nan: bool | None | str = '$UNCHANGED$', categorical_covariate: bool | None | str = '$UNCHANGED$', continuous_covariate: bool | None | str = '$UNCHANGED$', sites: bool | None | str = '$UNCHANGED$') InterSiteMatchedInterpolation#

Configure whether metadata should be requested to be passed to the fit_resample method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit_resample if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit_resample.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
allow_nanstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for allow_nan parameter in fit_resample.

categorical_covariatestr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for categorical_covariate parameter in fit_resample.

continuous_covariatestr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for continuous_covariate parameter in fit_resample.

sitesstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for sites parameter in fit_resample.

Returns:
selfobject

The updated object.

Examples#

Multisite Harmonization using Inter-Site Matched Interpolation (ISMI)

Multisite Harmonization using Inter-Site Matched Interpolation (ISMI)