Inter-Site Matched Interpolation (ISMI)#

Inter-Site Matched Interpolation (ISMI) is a data harmonization technique that generates synthetic training samples by interpolating between participants matched across different data acquisition sites or scanners. Unlike traditional harmonization methods that remove site effects, ISMI explicitly models cross-site variation by creating intermediate samples between matched pairs, aiming to improve model generalization to unseen scanners and datasets.

This method is particularly effective when:

  • Training data comes from multiple sites with different equipment/protocols

  • You need to balance site representation without discarding real data

  • Site-specific covariate distributions (age, sex) differ between locations

  • Traditional ComBat harmonization removes signal you want to preserve

Mathematical Formulation#

Given two sites \(S_i\) and \(S_j\), for each sample \(\mathbf{x}_k \in S_i\) with target \(y_k\), ISMI finds matches in \(S_j\) where \(y_m \approx y_k\) (and optionally matching categorical/continuous covariates). Synthetic samples are generated via:

\[\mathbf{x}_{\text{synth}} = \mathbf{x}_k + \alpha (\mathbf{x}_m - \mathbf{x}_k)\]
\[y_{\text{synth}} = y_k + \alpha (y_m - y_k) \quad \text{(regression)}\]
\[y_{\text{synth}} = y_k \quad \text{(classification)}\]

Where \(\alpha \in (0, 1)\) controls interpolation strength (default: 0.3). An alpha of approximately 0 will generate a interpolated sample more similar to the base sample. On the other hand, a value closer to 1 will generate an interpolated sample more similar to the target sample.

For regression, targets are interpolated continuously. For classification, targets remain discrete while features are interpolated.


Basic Usage#

from uniharmony.interpolation import InterSiteMatchedInterpolation
import numpy as np

# Your multi-site data
X = np.random.randn(300, 50)  # Features
y = np.random.randint(0, 2, 300)  # Binary targets
sites = np.array(["Site_A"]*100 + ["Site_B"]*100 + ["Site_C"]*100)

# Initialize and fit
ismi = InterSiteMatchedInterpolation(
    alpha=0.3,
    k=2,
    random_state=42
)

X_harmonized, y_harmonized = ismi.fit_resample(X, y, sites=sites)

print(f"Original: {len(X)} samples")
print(f"Harmonized: {len(X_harmonized)} samples")
print(f"Unmatched by direction: {ismi.unmatched_samples_}")

Advanced Usage with Covariates#

For strict matching on demographics (e.g., exact sex match, age within 5 years):

Covariates#

sex = np.array([["M"], ["F"], ...])  # Categorical
age = np.array([[25], [34], ...])    # Continuous

ismi = InterSiteMatchedInterpolation(
    alpha=(0.2, 0.6),           # Random interpolation per sample
    k="average",                 # Interpolate toward mean of matches
    target_tolerance=None,       # Exact class match required
    covariate_tolerance=np.array([5.0]),  # Age within 5 years
    mode="pairwise",
    random_state=42
)

X_res, y_res = ismi.fit_resample(
    X, y, sites=sites,
    categorical_covariate=sex,
    continuous_covariate=age
)

Understanding Unmatched Samples#

After fitting, ismi.unmatched_samples_ contains a dictionary mapping (source_site, target_site) to counts of samples that couldn’t be matched. Important: Asymmetry is Expected The unmatched matrix is directional and typically asymmetric:

{
    ('Site_A', 'Site_B'): 45,   # 45 samples from A couldn't match in B
    ('Site_B', 'Site_A'): 40,   # 40 samples from B couldn't match in A
    ('Site_A', 'Site_C'): 12,
    ('Site_C', 'Site_A'): 8,
    ...
}

Algorithm Modes#

Pairwise Mode (mode="pairwise")#

Creates synthetic samples for every site pair combination (A↔B, A↔C, B↔C). Best for: Understanding specific site-to-site relationships When you want maximum synthetic diversity

Base-to-Others Mode (mode="base_to_others")#

Each site is interpolated against all other sites combined. Forces k=”average”. Best for: Reducing computational cost with many sites Creating “site vs. population” harmonization

Implementation Notes#

Classification: Targets are preserved exactly for synthetic samples (no interpolation of labels) Regression: Targets are interpolated continuously between matched pairs Memory: With k=”max” and many sites, memory usage scales as O(n²). Use k=1 or k=5 for large datasets Validation: Always check unmatched_samples_ after fitting. High unmatched counts (>50%) indicate poor site compatibility for the chosen matching criteria

Reference Implementation#

This implementation generalizes the method described in:

Key differences from the paper:

  • The original implementation was specific to brain age prediction (regression)

  • This version supports both classification and regression

  • Added support for categorical covariates beyond binary sex

  • Generalized to arbitrary site counts and pairing strategies

Source code for original study: Aditi-Asati/Interpolation-And-Brain-Age-Prediction

Citation#

If you use this method in your research, please cite:

@article{nieto2026data,
  title={Data harmonizing via interpolation applied to brain age prediction},
  author={Nieto, Nicol{\'a}s and Asati, Aditi and Jadhav, Kushaj and Patil, Kaustubh R},
  journal={Discover Data},
  volume={4},
  number={1},
  pages={3},
  year={2026},
  publisher={Springer},
  doi={10.1007/s44248-026-00100-7}
}