Note
Go to the end to download the full example code.
Characterise a multisite problem#
The first step before applying any harmonization technique is to understand and characterize our data
Imports#
import matplotlib.pyplot as plt
import seaborn as sns
from uniharmony import verbosity
from uniharmony.datasets import (
get_multisite_data_statistics,
make_multisite_classification,
print_statistics_summary,
)
from uniharmony.plot import plot_features_by_site
sns.set_theme(style="whitegrid")
verbosity("warning")
Data generation#
Let’s use the multisite data generator to simulate some data
Generating example data...
============================================================
Now let’s compute some statistics
Computing statistics...
============================================================
2026-05-18 13:03:34 [info ] ============================================================
2026-05-18 13:03:34 [info ] DATASET STATISTICS SUMMARY
2026-05-18 13:03:34 [info ] ============================================================
2026-05-18 13:03:34 [info ]
OVERALL:
2026-05-18 13:03:34 [info ] Samples: 1000
2026-05-18 13:03:34 [info ] Features: 10
2026-05-18 13:03:34 [info ] Sites: 5
2026-05-18 13:03:34 [info ] Classes: 3
2026-05-18 13:03:34 [info ]
CLASS DISTRIBUTION:
2026-05-18 13:03:34 [info ] class_0: 335 samples (33.5%)
2026-05-18 13:03:34 [info ] class_1: 335 samples (33.5%)
2026-05-18 13:03:34 [info ] class_2: 330 samples (33.0%)
2026-05-18 13:03:34 [info ]
SITE DISTRIBUTION:
2026-05-18 13:03:34 [info ] site_0: 200 samples (20.0%)
2026-05-18 13:03:34 [info ] site_1: 200 samples (20.0%)
2026-05-18 13:03:34 [info ] site_2: 200 samples (20.0%)
2026-05-18 13:03:34 [info ] site_3: 200 samples (20.0%)
2026-05-18 13:03:34 [info ] site_4: 200 samples (20.0%)
2026-05-18 13:03:34 [info ]
SITE STATISTICS (summary):
2026-05-18 13:03:34 [info ] site_0:
2026-05-18 13:03:34 [info ] Samples: 200
2026-05-18 13:03:34 [info ] Class distribution: {'class_0': 67, 'class_1': 67, 'class_2': 66}
2026-05-18 13:03:34 [info ] site_1:
2026-05-18 13:03:34 [info ] Samples: 200
2026-05-18 13:03:34 [info ] Class distribution: {'class_0': 67, 'class_1': 67, 'class_2': 66}
2026-05-18 13:03:34 [info ] site_2:
2026-05-18 13:03:34 [info ] Samples: 200
2026-05-18 13:03:34 [info ] Class distribution: {'class_0': 67, 'class_1': 67, 'class_2': 66}
2026-05-18 13:03:34 [info ] site_3:
2026-05-18 13:03:34 [info ] Samples: 200
2026-05-18 13:03:34 [info ] Class distribution: {'class_0': 67, 'class_1': 67, 'class_2': 66}
2026-05-18 13:03:34 [info ] site_4:
2026-05-18 13:03:34 [info ] Samples: 200
2026-05-18 13:03:34 [info ] Class distribution: {'class_0': 67, 'class_1': 67, 'class_2': 66}
2026-05-18 13:03:34 [info ]
FEATURE STATISTICS (first 5 features):
2026-05-18 13:03:34 [info ] feat_0: mean=0.8137, std=1.6804, MAD=1.2405
2026-05-18 13:03:34 [info ] feat_1: mean=-0.2313, std=1.9461, MAD=1.4023
2026-05-18 13:03:34 [info ] feat_2: mean=1.0819, std=2.0994, MAD=1.4588
2026-05-18 13:03:34 [info ] feat_3: mean=0.8475, std=1.6420, MAD=1.2095
2026-05-18 13:03:34 [info ] feat_4: mean=0.7886, std=1.5675, MAD=1.1310
2026-05-18 13:03:34 [info ]
CORRELATIONS:
2026-05-18 13:03:34 [info ] Average Inter-Site Correlation: 0.9485
2026-05-18 13:03:34 [info ] ============================================================

Total running time of the script: (0 minutes 3.870 seconds)