Characterise a multisite problem

Characterise a multisite problem#

The first step before applying any harmonization technique is to understand and characterize our data

Imports#

import matplotlib.pyplot as plt
import seaborn as sns

from uniharmony import verbosity
from uniharmony.datasets import (
    get_multisite_data_statistics,
    make_multisite_classification,
    print_statistics_summary,
)
from uniharmony.plot import plot_features_by_site


sns.set_theme(style="whitegrid")
verbosity("warning")

Data generation#

Let’s use the multisite data generator to simulate some data

print("Generating example data...")
X, y, sites = make_multisite_classification(
    n_sites=5,
    n_samples=1000,
    n_features=10,
    n_classes=3,
    random_state=42,
)

print("\n" + "=" * 60)
Generating example data...

============================================================

Now let’s compute some statistics

print("Computing statistics...")
print("=" * 60)

# Compute statistics
stats = get_multisite_data_statistics(
    X=X,
    y=y,
    sites=sites,
    feature_names=[f"feat_{i}" for i in range(X.shape[1])],
)
verbosity("info")
# Print summary
print_statistics_summary(stats)
verbosity("warning")
Computing statistics...
============================================================
2026-05-18 13:03:34 [info     ] ============================================================
2026-05-18 13:03:34 [info     ] DATASET STATISTICS SUMMARY
2026-05-18 13:03:34 [info     ] ============================================================
2026-05-18 13:03:34 [info     ]
OVERALL:
2026-05-18 13:03:34 [info     ]   Samples: 1000
2026-05-18 13:03:34 [info     ]   Features: 10
2026-05-18 13:03:34 [info     ]   Sites: 5
2026-05-18 13:03:34 [info     ]   Classes: 3
2026-05-18 13:03:34 [info     ]
CLASS DISTRIBUTION:
2026-05-18 13:03:34 [info     ]   class_0: 335 samples (33.5%)
2026-05-18 13:03:34 [info     ]   class_1: 335 samples (33.5%)
2026-05-18 13:03:34 [info     ]   class_2: 330 samples (33.0%)
2026-05-18 13:03:34 [info     ]
SITE DISTRIBUTION:
2026-05-18 13:03:34 [info     ]   site_0: 200 samples (20.0%)
2026-05-18 13:03:34 [info     ]   site_1: 200 samples (20.0%)
2026-05-18 13:03:34 [info     ]   site_2: 200 samples (20.0%)
2026-05-18 13:03:34 [info     ]   site_3: 200 samples (20.0%)
2026-05-18 13:03:34 [info     ]   site_4: 200 samples (20.0%)
2026-05-18 13:03:34 [info     ]
SITE STATISTICS (summary):
2026-05-18 13:03:34 [info     ]   site_0:
2026-05-18 13:03:34 [info     ]     Samples: 200
2026-05-18 13:03:34 [info     ]     Class distribution: {'class_0': 67, 'class_1': 67, 'class_2': 66}
2026-05-18 13:03:34 [info     ]   site_1:
2026-05-18 13:03:34 [info     ]     Samples: 200
2026-05-18 13:03:34 [info     ]     Class distribution: {'class_0': 67, 'class_1': 67, 'class_2': 66}
2026-05-18 13:03:34 [info     ]   site_2:
2026-05-18 13:03:34 [info     ]     Samples: 200
2026-05-18 13:03:34 [info     ]     Class distribution: {'class_0': 67, 'class_1': 67, 'class_2': 66}
2026-05-18 13:03:34 [info     ]   site_3:
2026-05-18 13:03:34 [info     ]     Samples: 200
2026-05-18 13:03:34 [info     ]     Class distribution: {'class_0': 67, 'class_1': 67, 'class_2': 66}
2026-05-18 13:03:34 [info     ]   site_4:
2026-05-18 13:03:34 [info     ]     Samples: 200
2026-05-18 13:03:34 [info     ]     Class distribution: {'class_0': 67, 'class_1': 67, 'class_2': 66}
2026-05-18 13:03:34 [info     ]
FEATURE STATISTICS (first 5 features):
2026-05-18 13:03:34 [info     ]   feat_0: mean=0.8137, std=1.6804, MAD=1.2405
2026-05-18 13:03:34 [info     ]   feat_1: mean=-0.2313, std=1.9461, MAD=1.4023
2026-05-18 13:03:34 [info     ]   feat_2: mean=1.0819, std=2.0994, MAD=1.4588
2026-05-18 13:03:34 [info     ]   feat_3: mean=0.8475, std=1.6420, MAD=1.2095
2026-05-18 13:03:34 [info     ]   feat_4: mean=0.7886, std=1.5675, MAD=1.1310
2026-05-18 13:03:34 [info     ]
CORRELATIONS:
2026-05-18 13:03:34 [info     ]   Average Inter-Site Correlation: 0.9485
2026-05-18 13:03:34 [info     ] ============================================================
# Same plot individual points overlay
fig2, ax2 = plot_features_by_site(
    X,
    sites,
    figsize=(14, 7),
    rotation=45,
    show_points=True,
    title="All Features by Site (with individual points)",
)
plt.show()
All Features by Site (with individual points)

Total running time of the script: (0 minutes 3.870 seconds)

Gallery generated by Sphinx-Gallery