🤺 ComBat-based Harmonization#
ComBat - neuroComBat#
ComBat is one of the most widely used statistical harmonization methods [1]. It was originally proposed for genomics to correct batch effects in microarray gene-expression data. In these datasets, samples are processed in batches (e.g., laboratory runs), and systematic technical differences between batches introduce unwanted variability that can obscure biological signals.
The method was later adapted to multi-site neuroimaging datasets, as neuroComBat, where similar sources of variability arise from differences in scanner hardware, acquisition protocols, and preprocessing pipelines [2].
For each feature and each site, ComBat estimates:
a site-specific location parameter (mean shift), and
a site-specific scale parameter (variance scaling).
Additionally, ComBat allows biological relevant variance to be linearly preserved when passed as covariate.
Because of its robustness and simplicity, ComBat has become a standard harmonization approach in neuroimaging and has inspired a large family of extensions and variants.
Source code
ComBat-GAM: Preservation of non-Linear biological covariate effects.#
While neuroComBat can preserve biological covariates influence features linearly, non-linear biological relationships (for example age-related brain trajectories) may not be preserved accurately.
ComBat-GAM extends the original ComBat by allowing non-linear covariate effects using generalized additive models (GAMs) [3].
This is particularly important for neuroimaging variables such as age, which often show strong non-linear effects on brain features.
Key contributions:
modeling of non-linear biological covariates
improved preservation of biological signals
compatibility with ML pipelines
Source code
CovBat: Harmonization of covariate matrix#
ComBat assumes that site effects can be fully modeled using feature-wise mean shifts and variance scaling. However, more complex scanner differences, such as differences in feature covariance structure, are not addressed. While standard ComBat aligns means and variances, differences in feature correlations may still remain and affect multivariate analyses.
CovBat extends ComBat by harmonizing the covariance structure of features across sites. CovBat corrects these differences and has shown improved performance in machine learning applications using neuroimaging features [4].
Source code
Neuroharmony: Harmonization based on IQMs#
ComBat can not be applied to unknown sites, as the location/scale parameters are learnt by site.
Neuroharmony relies on scanner-independent image metrics, instead of site tags.
Neuroharmony is a harmonization approach based on image quality metrics (IQMs) rather than explicit site labels [5].
Because it relies on scanner-related image characteristics instead of site identifiers, it can generalize to previously unseen scanners or sites.
Source code
Longitudinal ComBat (LongComBat): Adapted ComBat for repeated scaners.#
Longitudinal ComBat adapts the ComBat framework for longitudinal studies, where subjects are scanned repeatedly over time.
Standard ComBat assumes independent observations, which is violated in repeated-measure designs.
Longitudinal ComBat introduces subject-specific random effects to model within-subject correlations [6].
Source code
DeepComBat: Hybrid approach of deep learning and ComBat.#
DeepComBat integrates ComBat with deep learning-based feature modeling.
The method uses neural networks to capture complex non-linear structure in the data while preserving the statistical harmonization principles of ComBat.
This hybrid approach allows the modeling of complex scanner effects and feature interactions that cannot be captured by linear models [7].
Source code
PrettYharmony: A framework to integrate ComBat-based harmonization methods into Machine learning pipelines.#
A common challenge in multi-site datasets occurs when class distributions differ across sites. For example:
Site A contains mostly control subjects
Site B contains mostly patients
In such cases, biological signal and site effects become confounded.
If harmonization removes site effects without accounting for the biological variable, the biological signal may also be partially removed.
To address this, ComBat allows users to include biological variables as covariates to preserve during harmonization.
However, this introduces an issue in machine-learning pipelines.
If the covariate being preserved is also the target variable of the ML model, then the target must be known during harmonization. In real-world prediction scenarios this information is unavailable, creating a new form of data leakage \({^6}\).
Even if harmonization parameters are estimated using only the training set, the transformation of test data still requires the target variable.
A suitable alternative in these scenarios is PrettYharmonize, which allows ComBat-based harmonization to be integrated into ML pipelines without requiring the target variable during inference [8].
Source Code