.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/01-basic-examples/01-plot_eos_in_ml.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_examples_01-basic-examples_01-plot_eos_in_ml.py>`
        to download the full example code.

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_01-basic-examples_01-plot_eos_in_ml.py:


Impact of Effects of Site in ML
===============================

EoS can have two rather opposite effects to ML pipelines.

- The first effect is to **hinder** the real true signal. In this cases, the ML model have harder time to find the true signal, thus removing EoS should *improve* our classification, as the signal-to-noise ratio should improve.

- The second effect is to **confound** the true signal. In this cases, the ML model can *use* the EoS signal to fraudulently improve the performance, as the predictions will not be based on true biological signal but rather on site effects. In such cases, removing the EoS will *reduce* the model's performance.

.. GENERATED FROM PYTHON SOURCE LINES 13-15

Imports
-------

.. GENERATED FROM PYTHON SOURCE LINES 15-34

.. code-block:: Python


    import matplotlib.pyplot as plt
    import pandas as pd
    import seaborn as sns
    from sklearn.linear_model import LogisticRegression
    from sklearn.model_selection import StratifiedKFold, cross_val_score

    from uniharmony import verbosity
    from uniharmony.datasets import make_multisite_classification
    from uniharmony.plot import plot_decision_boundary_2d


    sns.set_theme(style="whitegrid")
    verbosity("warning")

    clf = LogisticRegression()
    cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)


.. GENERATED FROM PYTHON SOURCE LINES 35-37

Data generation
---------------

.. GENERATED FROM PYTHON SOURCE LINES 37-62

.. code-block:: Python


    # First, lets create an example without EoS and classes balanced across site.
    X, y, sites = make_multisite_classification(
        n_sites=3, n_features=2, signal_strength=1, site_effect_strength=0, signal_type="blobs", random_state=23
    )
    # Create DataFrame for easier plotting
    df = pd.DataFrame(
        {"Feature 1": X[:, 0], "Feature 2": X[:, 1], "Class": [f"Class {c}" for c in y], "Site": [f"Site {s}" for s in sites]}
    )
    # Perform 10-fold stratified cross-validation
    scores = cross_val_score(clf, X, y, cv=cv, scoring="roc_auc")

    fig, ax = plt.subplots(1, 1, figsize=(10, 8))
    # Plot with site as hue and class as style
    sns.scatterplot(data=df, x="Feature 1", y="Feature 2", hue="Site", style="Class", s=100, alpha=0.7, ax=ax)
    ax.set_title(f"Data distribution, mean CV AUC: {scores.mean():.4f}", fontsize=14, fontweight="bold")

    plt.tight_layout()

    # Fit the model and plot the decision boundary,
    # this is just for visualization purposes, the real evaluation was be done with cross-validation
    clf.fit(X, y)
    plot_decision_boundary_2d(ax, clf)


.. image-sg:: /auto_examples/01-basic-examples/images/sphx_glr_01-plot_eos_in_ml_001.png
   :alt: Data distribution, mean CV AUC: 0.8365
   :srcset: /auto_examples/01-basic-examples/images/sphx_glr_01-plot_eos_in_ml_001.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 63-69

.. code-block:: Python


    X, y, sites = make_multisite_classification(
        n_sites=3, n_features=2, signal_strength=1, site_effect_strength=4, signal_type="blobs", random_state=23
    )


.. GENERATED FROM PYTHON SOURCE LINES 70-72

Plotting
--------

.. GENERATED FROM PYTHON SOURCE LINES 74-75

Create DataFrame for easier plotting

.. GENERATED FROM PYTHON SOURCE LINES 75-94

.. code-block:: Python

    df = pd.DataFrame(
        {"Feature 1": X[:, 0], "Feature 2": X[:, 1], "Class": [f"Class {c}" for c in y], "Site": [f"Site {s}" for s in sites]}
    )
    scores = cross_val_score(clf, X, y, cv=cv, scoring="roc_auc")

    fig, ax = plt.subplots(1, 1, figsize=(10, 8))
    # Plot with site as hue and class as style
    sns.scatterplot(data=df, x="Feature 1", y="Feature 2", hue="Site", style="Class", s=100, alpha=0.7, ax=ax)

    ax.set_title(f"Data distribution, mean CV AUC: {scores.mean():.4f}", fontsize=14, fontweight="bold")

    plt.tight_layout()

    # Fit the model and plot the decision boundary,
    # this is just for visualization purposes, the real evaluation was be done with cross-validation
    clf.fit(X, y)
    plot_decision_boundary_2d(ax, clf)


.. image-sg:: /auto_examples/01-basic-examples/images/sphx_glr_01-plot_eos_in_ml_002.png
   :alt: Data distribution, mean CV AUC: 0.8317
   :srcset: /auto_examples/01-basic-examples/images/sphx_glr_01-plot_eos_in_ml_002.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 95-122

.. code-block:: Python


    X, y, sites = make_multisite_classification(
        n_sites=3, n_features=2, signal_strength=0, site_effect_strength=4, signal_type="blobs", random_state=23
    )

    # Create DataFrame for easier plotting
    df = pd.DataFrame(
        {"Feature 1": X[:, 0], "Feature 2": X[:, 1], "Class": [f"Class {c}" for c in y], "Site": [f"Site {s}" for s in sites]}
    )

    scores = cross_val_score(clf, X, y, cv=cv, scoring="roc_auc")

    # Plot with site as hue and class as style
    fig, ax = plt.subplots(1, 1, figsize=(10, 8))
    sns.scatterplot(data=df, x="Feature 1", y="Feature 2", hue="Site", style="Class", s=100, alpha=0.7, ax=ax)
    ax.set_title(f"Data distribution, mean CV AUC: {scores.mean():.4f}", fontsize=14, fontweight="bold")
    plt.tight_layout()

    # Fit the model and plot the decision boundary,
    # this is just for visualization purposes, the real evaluation was be done with cross-validation
    clf.fit(X, y)
    plot_decision_boundary_2d(ax, clf)

    print("We don't have real signal, and the classes are equally distributed across sites")
    print(f"Mean accuracy: {scores.mean():.4f}")


.. image-sg:: /auto_examples/01-basic-examples/images/sphx_glr_01-plot_eos_in_ml_003.png
   :alt: Data distribution, mean CV AUC: 0.4945
   :srcset: /auto_examples/01-basic-examples/images/sphx_glr_01-plot_eos_in_ml_003.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    2026-05-18 13:03:34 [warning  ] signal_strength is 0. Adding a delta (1e-6) to signal_strength to avoid degenerate data.
    We don't have real signal, and the classes are equally distributed across sites
    Mean accuracy: 0.4945


.. GENERATED FROM PYTHON SOURCE LINES 123-134

.. code-block:: Python


    X, y, sites = make_multisite_classification(
        n_sites=3,
        n_features=2,
        signal_strength=0,
        site_effect_strength=1,
        signal_type="blobs",
        random_state=23,
        balance_per_site=[0.1, 0.5, 0.9],
    )


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    2026-05-18 13:03:35 [warning  ] signal_strength is 0. Adding a delta (1e-6) to signal_strength to avoid degenerate data.


.. GENERATED FROM PYTHON SOURCE LINES 135-136

Test the visualization

.. GENERATED FROM PYTHON SOURCE LINES 136-158

.. code-block:: Python


    # Create DataFrame for easier plotting
    df = pd.DataFrame(
        {"Feature 1": X[:, 0], "Feature 2": X[:, 1], "Class": [f"Class {c}" for c in y], "Site": [f"Site {s}" for s in sites]}
    )
    scores = cross_val_score(clf, X, y, cv=cv, scoring="roc_auc")

    # Plot with site as hue and class as style
    fig, ax = plt.subplots(1, 1, figsize=(10, 8))
    sns.scatterplot(data=df, x="Feature 1", y="Feature 2", hue="Site", style="Class", s=100, alpha=0.7, ax=ax)
    ax.set_title(f"Data distribution, mean CV AUC: {scores.mean():.4f}", fontsize=14, fontweight="bold")
    plt.tight_layout()

    # Fit the model and plot the decision boundary,
    # this is just for visualization purposes, the real evaluation was be done with cross-validation
    clf.fit(X, y)
    plot_decision_boundary_2d(ax, clf)
    print("We don't have real signal, but now classes are not equally distributed across sites.")
    print("The ML models might pick up on the site differences to fraudulently perform the task.")
    print(f"Mean accuracy: {scores.mean():.4f}")


.. image-sg:: /auto_examples/01-basic-examples/images/sphx_glr_01-plot_eos_in_ml_004.png
   :alt: Data distribution, mean CV AUC: 0.4738
   :srcset: /auto_examples/01-basic-examples/images/sphx_glr_01-plot_eos_in_ml_004.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    We don't have real signal, but now classes are not equally distributed across sites.
    The ML models might pick up on the site differences to fraudulently perform the task.
    Mean accuracy: 0.4738


.. GENERATED FROM PYTHON SOURCE LINES 159-189

.. code-block:: Python


    X, y, sites = make_multisite_classification(
        n_sites=3,
        n_features=2,
        signal_strength=0,
        site_effect_strength=0,
        signal_type="blobs",
        random_state=23,
        balance_per_site=[0.1, 0.5, 0.9],
    )

    # Create DataFrame for easier plotting
    df = pd.DataFrame(
        {"Feature 1": X[:, 0], "Feature 2": X[:, 1], "Class": [f"Class {c}" for c in y], "Site": [f"Site {s}" for s in sites]}
    )
    scores = cross_val_score(clf, X, y, cv=cv, scoring="roc_auc")

    fig, ax = plt.subplots(1, 1, figsize=(10, 8))
    # Plot with site as hue and class as style
    sns.scatterplot(data=df, x="Feature 1", y="Feature 2", hue="Site", style="Class", s=100, alpha=0.7, ax=ax)

    ax.set_title(f"Data distribution mean CV AUC: {scores.mean():.4f}", fontsize=14, fontweight="bold")
    plt.tight_layout()

    # Fit the model and plot the decision boundary,
    # this is just for visualization purposes, the real evaluation was be done with cross-validation
    clf.fit(X, y)
    plot_decision_boundary_2d(ax, clf)
    print("We don't have real signal, nor site effects. Even with class imbalance across sites, there is nothing to pick up.")
    print(f"Mean accuracy: {scores.mean():.4f}")


.. image-sg:: /auto_examples/01-basic-examples/images/sphx_glr_01-plot_eos_in_ml_005.png
   :alt: Data distribution mean CV AUC: 0.5206
   :srcset: /auto_examples/01-basic-examples/images/sphx_glr_01-plot_eos_in_ml_005.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    2026-05-18 13:03:35 [warning  ] signal_strength is 0. Adding a delta (1e-6) to signal_strength to avoid degenerate data.
    We don't have real signal, nor site effects. Even with class imbalance across sites, there is nothing to pick up.
    Mean accuracy: 0.5206


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 4.373 seconds)


.. _sphx_glr_download_auto_examples_01-basic-examples_01-plot_eos_in_ml.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: 01-plot_eos_in_ml.ipynb <01-plot_eos_in_ml.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: 01-plot_eos_in_ml.py <01-plot_eos_in_ml.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: 01-plot_eos_in_ml.zip <01-plot_eos_in_ml.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_