InterSiteMatchedInterpolation#
- class uniharmony.interpolation.InterSiteMatchedInterpolation(alpha: float | tuple[float, float] | list[float] = 0.3, target_tolerance: float | None = None, covariate_tolerance: ArrayLike | None = None, k: int | Literal['max', 'average'] = 1, mode: Literal['pairwise', 'base_to_others'] = 'pairwise', *, concatenate: bool = True, random_state: int | RandomState | None = None)#
Inter-Site Matched Interpolation (ISMI) Harmonization.
This sampler performs cross-site matched interpolation to reduce spurious correlations between site and class labels by generating synthetic samples from matched samples across different sites.
The algorithm iterates through all unique site pairs. The matching strategy prioritizes target value matching first, then refines matches using categorical and continuous covariates.
The interpolation takes a base sample from one site and a matched sample from another (target) site. A synthetic sample is generated by interpolating between the base and target samples using a randomly sampled alpha value. The alpha parameter controls how close the synthetic sample is to the base or target site, allowing a flexible interpolation.
By matching the samples with similar target values and covariates across sites, ISMI creates synthetic samples that mantains the the underlying data distribution while reducing site-specific biases.
- Parameters:
- alphafloat or tuple[float, float], optional (default 0.3)
Interpolation weight(s). If a float, all interpolations use this constant alpha value. If a tuple
(min, max), alpha is sampled uniformly from[min, max]for each interpolation. Alpha must be in the range [0, 1]. Interpretation:alpha = 0.0: synthetic sample equals base site samplealpha = 0.5: synthetic sample is midpoint between sitesalpha = 1.0: synthetic sample equals target site sample
Default is 0.3 to keep synthetic samples closer to the base site.
- target_tolerancefloat or None, optional (default None)
Tolerance for target value matching. For classification tasks, this should be 0 or None (exact match required). For regression tasks, allows matching targets within
±target_tolerance. IfNone, exact matching is required. target_tolerance should be in the same units as the target variable (e.g., years for age prediction).- covariate_tolerancearray-like, shape (n_continuous,), optional (default None)
Maximum allowed absolute difference for continuous covariate matching. Must have one value per continuous covariate column. If
None, exact matching is required (covariate_tolerance=0). Covariate tolerance should be in the same units as the corresponding covariate (e.g., years for age).- kint or {“max”, “average”}, optional (default 1)
Number of matches to use for interpolation: -
int >= 1: Use exactlykmatches per sample. If fewer matchesare found, a warning is issued and all available matches are used.
"max": Use all available matches for each sample."average": Interpolate the base sample with the average of all matched samples from the target site.
- mode{“pairwise”, “base_to_others”}, optional (default “pairwise”)
Interpolation mode: -
"pairwise": Generate all unique site pairs. Fornsites,this produces
n*(n-1)/2pairs."base_to_others": Use each site as base once against all other sites combined. This mode only works with k=”average”, as the avergage of all matches for all sites is used as target sample.
- concatenatebool, optional (default True)
If
True, the output dataset includes both original and synthetic samples. IfFalse, only synthetic samples are returned.- random_stateint or np.random.RandomState or None, optional (default None)
The seed of the pseudo random number generator for reproducibility.
- Attributes:
- interpolator_SamplerMixin
The fitted interpolator instance.
- sites_resampled_ndarray of shape (n_samples_new,)
Site labels for the resampled dataset (original + synthetic).
- unmatched_samples_dict
Dictionary tracking unmatched samples per direction. Keys are tuples indicating the interpolation direction (source_site, target_site). For
base_to_othersmode, target is"others". Values are counts of source samples that found no valid matches.
Methods
fit(X, y, **params)Check inputs and statistics of the sampler.
fit_resample(X, y, sites, *[, ...])Fit and resample using cross-site matched interpolation.
get_metadata_routing()Get metadata routing of this object.
get_params([deep])Get parameters for this estimator.
set_fit_resample_request(*[, allow_nan, ...])Configure whether metadata should be requested to be passed to the
fit_resamplemethod.set_params(**params)Set the parameters of this estimator.
See also
IntraSiteInterpolationSite-wise class balancing without cross-site interpolation.
Notes
Alpha Reversal Strategy
For computational efficiency, the algorithm processes each unique site pair only once. For a pair (SiteA, SiteB):
Forward (A→B): samples alpha from
[alpha_min, alpha_max]Reverse (B→A): samples alpha from
[1-alpha_max, 1-alpha_min]
This mathematically emulates running B→A with the original alpha range but avoids redundant matching operations.
Matching Hierarchy
Matches are determined in the following priority order:
- Target value: Must match exacly for categorical variables
or within
target_toleranceif target is a contunuous variable.
Categorical covariates: Exact match required, checked in column order (first column must match, then second, etc.)
Continuous covariates: Must be within
covariate_tolerancefor each column
If no samples in the target site satisfy all criteria, the source sample is counted as unmatched and no synthetic sample is generated.
References
[1]Nieto, N., et al. (2026). Data harmonizing via interpolation applied to brain age prediction. Springer Nature. https://doi.org/10.1007/s44248-026-00100-7
Examples
>>> import numpy as np >>> from uniharmony.interpolation import InterSiteMatchedInterpolation >>> >>> # Generate sample data with 3 sites >>> rng = np.random.RandomState(42) >>> X = rng.randn(150, 10) >>> y = rng.randint(0, 2, 150) >>> sites = np.array(["A"] * 50 + ["B"] * 50 + ["C"] * 50) >>> >>> # Define covariates for matching >>> categorical_covariate = np.array([["M"], ["F"]] * 75) # Sex >>> continuous_covariate = rng.randint(20, 80, (150, 1)) # Age >>> tolerance = np.array([5.0]) # ±5 years tolerance >>> >>> # Create interpolator with pairwise mode and k=2 matches >>> ismi = InterSiteMatchedInterpolation( ... alpha=(0.2, 0.4), ... tolerance=tolerance, ... k=2, ... mode="pairwise", ... random_state=42, ... ) >>> >>> # Generate harmonized dataset >>> X_res, y_res = ismi.fit_resample( ... X, y, ... sites=sites, ... categorical_covariate=categorical_covariate, ... continuous_covariate=continuous_covariate, ... ) >>> >>> # Check unmatched samples >>> print(ismi.unmatched_samples_) {('A', 'B'): 3, ('B', 'A'): 2, ('A', 'C'): 5, ...}
- fit_resample(X: ArrayLike, y: ArrayLike, sites: ArrayLike, *, categorical_covariate: ArrayLike | None = None, continuous_covariate: ArrayLike | None = None, allow_nan=False) tuple[ndarray[tuple[Any, ...], dtype[float64]], ndarray[tuple[Any, ...], dtype[Any]]]#
Fit and resample using cross-site matched interpolation.
- Parameters:
- Xarray-like of shape (n_samples, n_features)
Feature matrix containing input samples.
- yarray-like of shape (n_samples,)
Target values. Can be integer labels (classification) or continuous values (regression).
- sitesarray-like of shape (n_samples,)
Site or domain identifiers for each sample. Resampling is performed between different sites.
- categorical_covariatearray-like of shape (n_samples, n_categorical), default=None
Categorical covariates used for hierarchical matching (e.g., sex, diagnosis group). Matching proceeds in column order: the first column must match exactly, then the second, etc.
- continuous_covariatearray-like of shape (n_samples, n_continuous), default=None
Continuous covariates used for matching after categorical matching (e.g., age, education years). Matches must be within
tolerancefor each column.- allow_nanbool, default=None
If allow continuos and categorical covariates to present NaN or not
- Returns:
- X_resampledndarray of shape (n_samples_new, n_features)
The feature matrix after cross-site interpolation, containing both original and synthetic samples.
- y_resampledndarray of shape (n_samples_new,)
The corresponding targets after resampling.
- Site labels for the resampled dataset are stored in the attribute
sites_resampled_
- Raises:
- ValueError
If input arrays have incompatible shapes, if fewer than two unique sites are present, if covariates contain invalid values, or if parameters are inconsistent.
- set_fit_resample_request(*, allow_nan: bool | None | str = '$UNCHANGED$', categorical_covariate: bool | None | str = '$UNCHANGED$', continuous_covariate: bool | None | str = '$UNCHANGED$', sites: bool | None | str = '$UNCHANGED$') InterSiteMatchedInterpolation#
Configure whether metadata should be requested to be passed to the
fit_resamplemethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofit_resampleif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit_resample.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
- allow_nanstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
allow_nanparameter infit_resample.- categorical_covariatestr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
categorical_covariateparameter infit_resample.- continuous_covariatestr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
continuous_covariateparameter infit_resample.- sitesstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sitesparameter infit_resample.
- Returns:
- selfobject
The updated object.
Examples#
Multisite Harmonization using Inter-Site Matched Interpolation (ISMI)