Skip to content

Characterize a multisite problem using UniHarmony

The first step before applying any harmonization technique is to understand and characterize our data

# Imports
from uniharmony._multisite_data_generation import simulate_multisite_data
from uniharmony.multisite_data_characterization import (
    get_site_data_statistics,
    print_statistics_summary,
)

Let's use the multisite data generator to simulate some data

print("Generating example data...")
X, y, sites = simulate_multisite_data(
    n_sites=3,
    n_samples=100,
    n_features=10,
    n_classes=3,
    random_state=42,
    verbose=True,
)

print("\n" + "=" * 60)
Generating example data...
Using balanced classes: [[0.3333333333333333, 0.3333333333333333, 0.3333333333333333], [0.3333333333333333, 0.3333333333333333, 0.3333333333333333], [0.3333333333333333, 0.3333333333333333, 0.3333333333333333]]
Generating 34 samples for site 0
Generating 33 samples for site 1
Generating 33 samples for site 2
Generated 100 samples across 3 sites
Class distribution: [34 33 33]
Site distribution: [34 33 33]

============================================================

Now lets compute some statistics

print("Computing statistics...")
print("=" * 60)

# Compute statistics
stats = get_site_data_statistics(
    x=X,
    y=y,
    site_labels=sites,
    feature_names=[f"feat_{i}" for i in range(X.shape[1])],
    compute_comprehensive=True,
    verbose=True,
)

# Print summary
print_statistics_summary(stats)
Computing statistics...
============================================================
Computing statistics for 100 samples, 10 features, 3 sites, 3 classes
  Processing site 0...
  Processing site 1...
  Processing site 2...
  Processing class 0...
  Processing class 1...
  Processing class 2...
============================================================
DATASET STATISTICS SUMMARY
============================================================

OVERALL:
  Samples: 100
  Features: 10
  Sites: 3
  Classes: 3

CLASS DISTRIBUTION:
  class_0: 34 samples (34.0%)
  class_1: 33 samples (33.0%)
  class_2: 33 samples (33.0%)

SITE DISTRIBUTION:
  site_0: 34 samples (34.0%)
  site_1: 33 samples (33.0%)
  site_2: 33 samples (33.0%)

SITE STATISTICS (summary):
  site_0:
    Samples: 34
    Class distribution: {'class_0': 12, 'class_1': 11, 'class_2': 11}
  site_1:
    Samples: 33
    Class distribution: {'class_0': 11, 'class_1': 11, 'class_2': 11}
  site_2:
    Samples: 33
    Class distribution: {'class_0': 11, 'class_1': 11, 'class_2': 11}

FEATURE STATISTICS (first 5 features):
  feat_0: mean=-1.5404, std=3.4675
  feat_1: mean=-1.0764, std=1.6735
  feat_2: mean=2.2555, std=2.2806
  feat_3: mean=1.2875, std=4.7056
  feat_4: mean=1.1251, std=1.8842

CORRELATIONS:
  Average Inter-Site Correlation: 0.0513
============================================================