Skip to content

IntraSiteInterpolation

Bases: SamplerMixin, BaseEstimator

Intra-Site Interpolation (ISI) Harmonization.

This sampler performs site-wise class balancing to reduce spurious correlations between site membership and class labels.

For each site independently: - The majority class is identified. - All minority classes are oversampled to match the majority count. - Any imblearn-compatible oversampling strategy may be used.

The method supports both binary and multi-class classification and returns a globally concatenated, site-harmonized dataset.

Parameters:

Name Type Description Default
interpolator str or SamplerMixin instance, optional (default "smote")

The interpolator to use. Can be a str specifying a built-in method or an instance of SamplerMixin. Supported str methods are:

  • "smote": Synthetic Minority Over-sampling Technique
  • "borderline-smote": Borderline-SMOTE
  • "svm-smote": SVM-SMOTE
  • "adasyn": Adaptive Synthetic Sampling
  • "kmeans-smote": KMeans-SMOTE
  • "random": Random Over-Sampling
'smote'
random_state int or RandomState instance or None, optional (default None)

The seed of the pseudo random number generator or RandomState for reproducibility.

None
verbose bool, optional (default True)

If True, logs progress information.

False
**kwargs dict

Additional keyword arguments passed to interpolator.

{}
Source code in src/uniharmony/interpolation/_intra_site.py
class IntraSiteInterpolation(SamplerMixin, BaseEstimator):
    """Intra-Site Interpolation (ISI) Harmonization.

    This sampler performs **site-wise class balancing** to reduce spurious
    correlations between site membership and class labels.

    For each site independently:
    - The majority class is identified.
    - All minority classes are oversampled to match the majority count.
    - Any imblearn-compatible oversampling strategy may be used.

    The method supports both binary and multi-class classification and
    returns a globally concatenated, site-harmonized dataset.

    Parameters
    ----------
    interpolator : str or SamplerMixin instance, optional (default "smote")
        The interpolator to use. Can be a str specifying a built-in method or
        an instance of SamplerMixin.
        Supported str methods are:

          - "smote": Synthetic Minority Over-sampling Technique
          - "borderline-smote": Borderline-SMOTE
          - "svm-smote": SVM-SMOTE
          - "adasyn": Adaptive Synthetic Sampling
          - "kmeans-smote": KMeans-SMOTE
          - "random": Random Over-Sampling

    random_state : int or RandomState instance or None, optional (default None)
        The seed of the pseudo random number generator or RandomState for
        reproducibility.
    verbose : bool, optional (default True)
        If True, logs progress information.
    **kwargs : dict
        Additional keyword arguments passed to ``interpolator``.

    """

    def __init__(
        self,
        interpolator: str | SamplerMixin = "smote",
        *,
        random_state: int | np.random.RandomState | None = None,
        verbose: bool = False,
        **kwargs,
    ) -> None:
        self.interpolator = interpolator
        self.random_state = random_state
        self.verbose = verbose
        self.kwargs = kwargs

    def fit_resample(
        self,
        X: np.ndarray,
        y: np.ndarray,
        *,
        sites: np.ndarray,
    ):
        """Fit and resample the dataset using site-wise harmonization.

        Parameters
        ----------
        X : numpy.ndarray of shape (n_samples, n_features)
            Feature matrix containing the input samples.
        y : numpy.ndarray of shape (n_samples,)
            Target class labels associated with each sample in ``X``.
        sites : numpy.ndarray of shape (n_samples,)
            Site or domain identifiers indicating the origin of each sample.
            Resampling is performed independently within each site.

        Returns
        -------
        X_resampled : numpy.ndarray of shape (n_samples_new, n_features)
            The feature matrix after site-wise oversampling.
        y_resampled : numpy.ndarray of shape (n_samples_new,)
            The corresponding class labels after resampling.

        Raises
        ------
        ValueError
            If ``X``, ``y``, and ``sites`` have incompatible shapes, if fewer
            than two unique sites are present, or if any site contains samples
            from only a single class.

        Notes
        -----
        For each site, the majority class count is used as the target.
        All minority classes within that site are oversampled to match
        this count using the configured interpolator.

        """
        X, y = check_X_y(X, y)
        sites = check_array(sites, ensure_2d=False)

        # Sanity checks for site length and number of sites
        sites_sanity_checks(X, sites)

        # This methods needs at least two classes per site
        class_representation_checks(y, sites)

        random_state = check_random_state(self.random_state)
        if isinstance(self.interpolator, str):
            self.interpolator = create_interpolator(
                self.interpolator,
                random_state=random_state,
                **self.kwargs,
            )
        elif isinstance(self.interpolator, SamplerMixin):
            # Make sure the provided interpolator
            # has "not majority" as sampling_strategy
            if self.interpolator.sampling_strategy in ["auto", "not majority"]:
                raise ValueError("IntraSiteInterpolation requires the interpolator to have `sampling_strategy='not majority'`.")
        else:
            raise ValueError("interpolator must be either a stringor an instance of SamplerMixin.")

        X_out, y_out, sites_out = [], [], []

        for site in np.unique(sites):
            mask = sites == site
            X_site, y_site = X[mask], y[mask]

            if self.verbose:
                logger.info(f"[ISI] Site {site}: {Counter(y_site)}")

            X_rs, y_rs = self.interpolator.fit_resample(X_site, y_site)  # type: ignore

            X_out.append(X_rs)
            y_out.append(y_rs)
            sites_out.append(np.full(len(X_rs), site))

        self.sites_resampled_ = np.concatenate(sites_out)

        return np.vstack(X_out), np.concatenate(y_out)

    # ------------------------------------------------------------------ #
    # Compatibility
    # ------------------------------------------------------------------ #
    def _fit_resample(self, X, y, **params):
        """No-use implementation required by SamplerMixin.

        This sampler overrides ``fit_resample`` directly because it
        requires the additional ``sites`` argument.
        """
        pass

fit_resample(X, y, *, sites)

Fit and resample the dataset using site-wise harmonization.

Parameters:

Name Type Description Default
X numpy.ndarray of shape (n_samples, n_features)

Feature matrix containing the input samples.

required
y numpy.ndarray of shape (n_samples,)

Target class labels associated with each sample in X.

required
sites numpy.ndarray of shape (n_samples,)

Site or domain identifiers indicating the origin of each sample. Resampling is performed independently within each site.

required

Returns:

Name Type Description
X_resampled numpy.ndarray of shape (n_samples_new, n_features)

The feature matrix after site-wise oversampling.

y_resampled numpy.ndarray of shape (n_samples_new,)

The corresponding class labels after resampling.

Raises:

Type Description
ValueError

If X, y, and sites have incompatible shapes, if fewer than two unique sites are present, or if any site contains samples from only a single class.

Notes

For each site, the majority class count is used as the target. All minority classes within that site are oversampled to match this count using the configured interpolator.

Source code in src/uniharmony/interpolation/_intra_site.py
def fit_resample(
    self,
    X: np.ndarray,
    y: np.ndarray,
    *,
    sites: np.ndarray,
):
    """Fit and resample the dataset using site-wise harmonization.

    Parameters
    ----------
    X : numpy.ndarray of shape (n_samples, n_features)
        Feature matrix containing the input samples.
    y : numpy.ndarray of shape (n_samples,)
        Target class labels associated with each sample in ``X``.
    sites : numpy.ndarray of shape (n_samples,)
        Site or domain identifiers indicating the origin of each sample.
        Resampling is performed independently within each site.

    Returns
    -------
    X_resampled : numpy.ndarray of shape (n_samples_new, n_features)
        The feature matrix after site-wise oversampling.
    y_resampled : numpy.ndarray of shape (n_samples_new,)
        The corresponding class labels after resampling.

    Raises
    ------
    ValueError
        If ``X``, ``y``, and ``sites`` have incompatible shapes, if fewer
        than two unique sites are present, or if any site contains samples
        from only a single class.

    Notes
    -----
    For each site, the majority class count is used as the target.
    All minority classes within that site are oversampled to match
    this count using the configured interpolator.

    """
    X, y = check_X_y(X, y)
    sites = check_array(sites, ensure_2d=False)

    # Sanity checks for site length and number of sites
    sites_sanity_checks(X, sites)

    # This methods needs at least two classes per site
    class_representation_checks(y, sites)

    random_state = check_random_state(self.random_state)
    if isinstance(self.interpolator, str):
        self.interpolator = create_interpolator(
            self.interpolator,
            random_state=random_state,
            **self.kwargs,
        )
    elif isinstance(self.interpolator, SamplerMixin):
        # Make sure the provided interpolator
        # has "not majority" as sampling_strategy
        if self.interpolator.sampling_strategy in ["auto", "not majority"]:
            raise ValueError("IntraSiteInterpolation requires the interpolator to have `sampling_strategy='not majority'`.")
    else:
        raise ValueError("interpolator must be either a stringor an instance of SamplerMixin.")

    X_out, y_out, sites_out = [], [], []

    for site in np.unique(sites):
        mask = sites == site
        X_site, y_site = X[mask], y[mask]

        if self.verbose:
            logger.info(f"[ISI] Site {site}: {Counter(y_site)}")

        X_rs, y_rs = self.interpolator.fit_resample(X_site, y_site)  # type: ignore

        X_out.append(X_rs)
        y_out.append(y_rs)
        sites_out.append(np.full(len(X_rs), site))

    self.sites_resampled_ = np.concatenate(sites_out)

    return np.vstack(X_out), np.concatenate(y_out)