Tutorial 01 — Quickstart

This notebook walks through a minimal Ageas workflow on synthetic data:

  1. Generate a synthetic AnnData with informative and noise genes

  2. Wrap it in a Multimodal_Corpus (the dataset object Ageas consumes)

  3. Load a model hangar from a config folder

  4. Run n_kfold_selection to pick the best classifiers

  5. Predict cell-type probabilities on the corpus

  6. Call Deck.debrief() to obtain per-class feature importances

The synthetic data has 2 classes and 20 features (2 informative, rest noise), so the whole notebook runs in under a minute on CPU.

[1]:
import shutil
import warnings

import pandas as pd
from sklearn.metrics import accuracy_score, roc_auc_score

from ageas import Hangar, n_kfold_selection
from ageas.tool import Multimodal_Corpus, make_fake_adata

warnings.filterwarnings("ignore")

1. Synthetic data

make_fake_adata creates an AnnData with:

  • obs['celltype'] — integer cell-type labels

  • var['name'] — gene symbol strings

  • 16 noise genes + 2 informative + 2 redundant by default

[2]:
adata = make_fake_adata(n_class=2, n_clusters_per_class=1)
print(adata)
adata.obs.head()
Seed set to 42
AnnData object with n_obs × n_vars = 100 × 20
    obs: 'celltype'
    var: 'name'
[2]:
celltype
fake_cell_0 0
fake_cell_1 1
fake_cell_2 0
fake_cell_3 0
fake_cell_4 0
[ ]:
# Save to disk — Multimodal_Corpus reads from a file path.
# In a real workflow, point this at your own .h5ad file.
adata_path = "ageas_tut01.h5ad"
adata.write_h5ad(adata_path)

2. Build corpus

Multimodal_Corpus wraps the AnnData file and tracks the label → integer mapping needed by the classifiers. It also implements PyTorch’s Dataset interface so Ageas can stream batches via DataLoaders.

[4]:
corpus = Multimodal_Corpus(
    adata_path,
    label_key="celltype",
    backed=False,   # load fully into RAM (fine for small data)
)
print("Label map:", corpus.label_dict)
print("Cells:", len(corpus))
Label map: {0: 0, 1: 1}
Cells: 100

3. Load hangar

A Hangar reads a folder of JSON config files, one sub-folder per model family (logreg, svc, xgb, mlp, resnet, rnn). The data/configs/sample_panel directory bundled with the repo is a good starting point for experimentation.

[5]:
# Adjust this path if you're running from outside the repo root.
config_folder = "../../data/configs/sample_panel"

hangar = Hangar(config_folder)
print(f"Hangar loaded {len(hangar.units)} units")
Hangar loaded 8 units

4. Model selection with n_kfold_selection

n_kfold_selection runs k-fold cross-validation and keeps only the models that clear the retention_point threshold. The survivors are then retrained on the full dataset in a “last mission” pass (skip_final=False).

[6]:
deck = n_kfold_selection(
    hangar=hangar,
    query_dataset=corpus,
    test_dataset=corpus,     # using same data for demo; use a real holdout in practice
    kfold_selection_list=[2],
    valid_fraction=0.1,
    monitor_metric="test.accuracy",
    n_dataloader_workers=1,
    retention_point=0.5,     # keep any model with accuracy > 0.5
    cutoff_point=0.0,
    seed=42,
    verbose=False,
)
shutil.rmtree("cache", ignore_errors=True)

print(f"Surviving units ({len(deck.squad)}):")
for uid in deck.squad:
    print(" ", uid)
GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

  | Name      | Type               | Params | Mode
---------------------------------------------------------
0 | embedder  | Sequential         | 1.5 K  | train
1 | blocks    | ModuleList         | 8.8 K  | train
2 | dropout   | Dropout            | 0      | train
3 | fc        | Linear             | 18     | train
4 | criterion | CrossEntropyLoss   | 0      | train
5 | accuracy  | MulticlassAccuracy | 0      | train
6 | f1        | MulticlassF1Score  | 0      | train
7 | auroc     | MulticlassAUROC    | 0      | train
---------------------------------------------------------
10.3 K    Trainable params
0         Non-trainable params
10.3 K    Total params
0.041     Total estimated model params size (MB)
29        Modules in train mode
0         Modules in eval mode
`Trainer.fit` stopped: `max_epochs=1` reached.
GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

  | Name      | Type               | Params | Mode
---------------------------------------------------------
0 | embedder  | Sequential         | 1.4 K  | train
1 | maxpool   | MaxPool1d          | 0      | train
2 | blocks    | ModuleList         | 2.8 K  | train
3 | avgpool   | AdaptiveAvgPool1d  | 0      | train
4 | dropout   | Dropout            | 0      | train
5 | fc        | Linear             | 34     | train
6 | criterion | CrossEntropyLoss   | 0      | train
7 | accuracy  | MulticlassAccuracy | 0      | train
8 | f1        | MulticlassF1Score  | 0      | train
9 | auroc     | MulticlassAUROC    | 0      | train
---------------------------------------------------------
4.2 K     Trainable params
0         Non-trainable params
4.2 K     Total params
0.017     Total estimated model params size (MB)
37        Modules in train mode
0         Modules in eval mode
`Trainer.fit` stopped: `max_epochs=1` reached.
GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

  | Name      | Type               | Params | Mode
---------------------------------------------------------
0 | embedder  | Sequential         | 1.4 K  | train
1 | maxpool   | MaxPool1d          | 0      | train
2 | blocks    | ModuleList         | 2.5 K  | train
3 | avgpool   | AdaptiveAvgPool1d  | 0      | train
4 | dropout   | Dropout            | 0      | train
5 | fc        | Linear             | 10     | train
6 | criterion | CrossEntropyLoss   | 0      | train
7 | accuracy  | MulticlassAccuracy | 0      | train
8 | f1        | MulticlassF1Score  | 0      | train
9 | auroc     | MulticlassAUROC    | 0      | train
---------------------------------------------------------
3.9 K     Trainable params
0         Non-trainable params
3.9 K     Total params
0.016     Total estimated model params size (MB)
30        Modules in train mode
0         Modules in eval mode
`Trainer.fit` stopped: `max_epochs=1` reached.
GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

  | Name      | Type               | Params | Mode
---------------------------------------------------------
0 | embedder  | Sequential         | 1.5 K  | train
1 | blocks    | ModuleList         | 5.7 K  | train
2 | dropout   | Dropout            | 0      | train
3 | fc        | Linear             | 18     | train
4 | criterion | CrossEntropyLoss   | 0      | train
5 | accuracy  | MulticlassAccuracy | 0      | train
6 | f1        | MulticlassF1Score  | 0      | train
7 | auroc     | MulticlassAUROC    | 0      | train
---------------------------------------------------------
7.2 K     Trainable params
0         Non-trainable params
7.2 K     Total params
0.029     Total estimated model params size (MB)
41        Modules in train mode
0         Modules in eval mode
`Trainer.fit` stopped: `max_epochs=2` reached.
GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

  | Name      | Type               | Params | Mode
---------------------------------------------------------
0 | embedder  | Sequential         | 1.5 K  | train
1 | blocks    | ModuleList         | 8.8 K  | train
2 | dropout   | Dropout            | 0      | train
3 | fc        | Linear             | 18     | train
4 | criterion | CrossEntropyLoss   | 0      | train
5 | accuracy  | MulticlassAccuracy | 0      | train
6 | f1        | MulticlassF1Score  | 0      | train
7 | auroc     | MulticlassAUROC    | 0      | train
---------------------------------------------------------
10.3 K    Trainable params
0         Non-trainable params
10.3 K    Total params
0.041     Total estimated model params size (MB)
29        Modules in train mode
0         Modules in eval mode
`Trainer.fit` stopped: `max_epochs=1` reached.
GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

  | Name      | Type               | Params | Mode
---------------------------------------------------------
0 | embedder  | Sequential         | 1.4 K  | train
1 | maxpool   | MaxPool1d          | 0      | train
2 | blocks    | ModuleList         | 2.8 K  | train
3 | avgpool   | AdaptiveAvgPool1d  | 0      | train
4 | dropout   | Dropout            | 0      | train
5 | fc        | Linear             | 34     | train
6 | criterion | CrossEntropyLoss   | 0      | train
7 | accuracy  | MulticlassAccuracy | 0      | train
8 | f1        | MulticlassF1Score  | 0      | train
9 | auroc     | MulticlassAUROC    | 0      | train
---------------------------------------------------------
4.2 K     Trainable params
0         Non-trainable params
4.2 K     Total params
0.017     Total estimated model params size (MB)
37        Modules in train mode
0         Modules in eval mode
`Trainer.fit` stopped: `max_epochs=1` reached.
GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

  | Name      | Type               | Params | Mode
---------------------------------------------------------
0 | embedder  | Sequential         | 1.4 K  | train
1 | maxpool   | MaxPool1d          | 0      | train
2 | blocks    | ModuleList         | 2.5 K  | train
3 | avgpool   | AdaptiveAvgPool1d  | 0      | train
4 | dropout   | Dropout            | 0      | train
5 | fc        | Linear             | 10     | train
6 | criterion | CrossEntropyLoss   | 0      | train
7 | accuracy  | MulticlassAccuracy | 0      | train
8 | f1        | MulticlassF1Score  | 0      | train
9 | auroc     | MulticlassAUROC    | 0      | train
---------------------------------------------------------
3.9 K     Trainable params
0         Non-trainable params
3.9 K     Total params
0.016     Total estimated model params size (MB)
30        Modules in train mode
0         Modules in eval mode
`Trainer.fit` stopped: `max_epochs=1` reached.
GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

  | Name      | Type               | Params | Mode
---------------------------------------------------------
0 | embedder  | Sequential         | 1.5 K  | train
1 | blocks    | ModuleList         | 5.7 K  | train
2 | dropout   | Dropout            | 0      | train
3 | fc        | Linear             | 18     | train
4 | criterion | CrossEntropyLoss   | 0      | train
5 | accuracy  | MulticlassAccuracy | 0      | train
6 | f1        | MulticlassF1Score  | 0      | train
7 | auroc     | MulticlassAUROC    | 0      | train
---------------------------------------------------------
7.2 K     Trainable params
0         Non-trainable params
7.2 K     Total params
0.029     Total estimated model params size (MB)
41        Modules in train mode
0         Modules in eval mode
`Trainer.fit` stopped: `max_epochs=2` reached.
GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

  | Name      | Type               | Params | Mode
---------------------------------------------------------
0 | embedder  | Sequential         | 1.5 K  | train
1 | blocks    | ModuleList         | 8.8 K  | train
2 | dropout   | Dropout            | 0      | train
3 | fc        | Linear             | 18     | train
4 | criterion | CrossEntropyLoss   | 0      | train
5 | accuracy  | MulticlassAccuracy | 0      | train
6 | f1        | MulticlassF1Score  | 0      | train
7 | auroc     | MulticlassAUROC    | 0      | train
---------------------------------------------------------
10.3 K    Trainable params
0         Non-trainable params
10.3 K    Total params
0.041     Total estimated model params size (MB)
29        Modules in train mode
0         Modules in eval mode
`Trainer.fit` stopped: `max_epochs=1` reached.
GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

  | Name      | Type               | Params | Mode
---------------------------------------------------------
0 | embedder  | Sequential         | 1.4 K  | train
1 | maxpool   | MaxPool1d          | 0      | train
2 | blocks    | ModuleList         | 2.8 K  | train
3 | avgpool   | AdaptiveAvgPool1d  | 0      | train
4 | dropout   | Dropout            | 0      | train
5 | fc        | Linear             | 34     | train
6 | criterion | CrossEntropyLoss   | 0      | train
7 | accuracy  | MulticlassAccuracy | 0      | train
8 | f1        | MulticlassF1Score  | 0      | train
9 | auroc     | MulticlassAUROC    | 0      | train
---------------------------------------------------------
4.2 K     Trainable params
0         Non-trainable params
4.2 K     Total params
0.017     Total estimated model params size (MB)
37        Modules in train mode
0         Modules in eval mode
`Trainer.fit` stopped: `max_epochs=1` reached.
GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

  | Name      | Type               | Params | Mode
---------------------------------------------------------
0 | embedder  | Sequential         | 1.4 K  | train
1 | maxpool   | MaxPool1d          | 0      | train
2 | blocks    | ModuleList         | 2.5 K  | train
3 | avgpool   | AdaptiveAvgPool1d  | 0      | train
4 | dropout   | Dropout            | 0      | train
5 | fc        | Linear             | 10     | train
6 | criterion | CrossEntropyLoss   | 0      | train
7 | accuracy  | MulticlassAccuracy | 0      | train
8 | f1        | MulticlassF1Score  | 0      | train
9 | auroc     | MulticlassAUROC    | 0      | train
---------------------------------------------------------
3.9 K     Trainable params
0         Non-trainable params
3.9 K     Total params
0.016     Total estimated model params size (MB)
30        Modules in train mode
0         Modules in eval mode
`Trainer.fit` stopped: `max_epochs=1` reached.
GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

  | Name      | Type               | Params | Mode
---------------------------------------------------------
0 | embedder  | Sequential         | 1.5 K  | train
1 | blocks    | ModuleList         | 5.7 K  | train
2 | dropout   | Dropout            | 0      | train
3 | fc        | Linear             | 18     | train
4 | criterion | CrossEntropyLoss   | 0      | train
5 | accuracy  | MulticlassAccuracy | 0      | train
6 | f1        | MulticlassF1Score  | 0      | train
7 | auroc     | MulticlassAUROC    | 0      | train
---------------------------------------------------------
7.2 K     Trainable params
0         Non-trainable params
7.2 K     Total params
0.029     Total estimated model params size (MB)
41        Modules in train mode
0         Modules in eval mode
`Trainer.fit` stopped: `max_epochs=2` reached.
Surviving units (8):
  svc_linear_0
  rnn_lstm_0
  resnet_bottleneck_0
  resnet_basic_0
  logreg_sample_log_reg
  mlp_resEnc_0
  xgb_mn_0
  xgb_mn_1

5. Predict

Deck.predict runs every surviving unit on the corpus and returns an ensemble probability matrix (shape [n_cells, n_classes]) and the true labels.

[7]:
all_preds, all_labels = deck.predict(query_dataset=corpus)

acc   = accuracy_score(all_labels, all_preds.argmax(axis=-1))
auroc = roc_auc_score(all_labels, all_preds[:, 1])

print(f"Ensemble accuracy : {acc:.4f}")
print(f"Ensemble AUROC    : {auroc:.4f}")
Ensemble accuracy : 1.0000
Ensemble AUROC    : 1.0000

6. Feature importance with Deck.debrief

Deck.debrief calls each unit’s explain() method (coefficients for linear models, SHAP for XGBoost, Integrated Gradients for neural nets), aggregates the per-class scores into a single DataFrame indexed by feature name, and prints each unit’s individual scores.

Positive Class_0_Scores means the feature pushes a cell toward class 0.

[8]:
importance = deck.debrief(exp_dataset=corpus, verbose=True)
print("\nAggregated importance table:")
importance

Aggregated importance table:
[8]:
Class_0_Scores Class_1_Scores
fake_gene_0 -1.103242 1.103242
fake_gene_1 14.013287 -14.013287
fake_gene_2 -0.503321 0.503321
fake_gene_3 1.012185 -1.012185
fake_gene_4 -1.052482 1.052482
fake_gene_5 -0.549032 0.549032
fake_gene_6 0.704132 -0.704132
fake_gene_7 -1.332842 1.332842
fake_gene_8 0.825337 -0.825337
fake_gene_9 1.004996 -1.004996
fake_gene_10 -0.020199 0.020199
fake_gene_11 0.755457 -0.755457
fake_gene_12 10.399876 -10.399876
fake_gene_13 -0.500709 0.500709
fake_gene_14 -0.491509 0.491509
fake_gene_15 -0.045541 0.045541
informative_gene_0 0.516871 -0.516871
informative_gene_1 0.312430 -0.312430
redundant_gene_0 -0.134174 0.134174
redundant_gene_1 -1.095171 1.095171
[9]:
# Top 5 features driving class 0
importance["Class_0_Scores"].sort_values(ascending=False).head(5)
[9]:
fake_gene_1     14.013287
fake_gene_12    10.399876
fake_gene_3      1.012185
fake_gene_9      1.004996
fake_gene_8      0.825337
Name: Class_0_Scores, dtype: float64