Tutorial 01 — Quickstart
This notebook walks through a minimal Ageas workflow on synthetic data:
Generate a synthetic AnnData with informative and noise genes
Wrap it in a
Multimodal_Corpus(the dataset object Ageas consumes)Load a model hangar from a config folder
Run
n_kfold_selectionto pick the best classifiersPredict cell-type probabilities on the corpus
Call
Deck.debrief()to obtain per-class feature importances
The synthetic data has 2 classes and 20 features (2 informative, rest noise), so the whole notebook runs in under a minute on CPU.
[1]:
import shutil
import warnings
import pandas as pd
from sklearn.metrics import accuracy_score, roc_auc_score
from ageas import Hangar, n_kfold_selection
from ageas.tool import Multimodal_Corpus, make_fake_adata
warnings.filterwarnings("ignore")
1. Synthetic data
make_fake_adata creates an AnnData with:
obs['celltype']— integer cell-type labelsvar['name']— gene symbol strings16 noise genes + 2 informative + 2 redundant by default
[2]:
adata = make_fake_adata(n_class=2, n_clusters_per_class=1)
print(adata)
adata.obs.head()
Seed set to 42
AnnData object with n_obs × n_vars = 100 × 20
obs: 'celltype'
var: 'name'
[2]:
| celltype | |
|---|---|
| fake_cell_0 | 0 |
| fake_cell_1 | 1 |
| fake_cell_2 | 0 |
| fake_cell_3 | 0 |
| fake_cell_4 | 0 |
[ ]:
# Save to disk — Multimodal_Corpus reads from a file path.
# In a real workflow, point this at your own .h5ad file.
adata_path = "ageas_tut01.h5ad"
adata.write_h5ad(adata_path)
2. Build corpus
Multimodal_Corpus wraps the AnnData file and tracks the label → integer mapping needed by the classifiers. It also implements PyTorch’s Dataset interface so Ageas can stream batches via DataLoaders.
[4]:
corpus = Multimodal_Corpus(
adata_path,
label_key="celltype",
backed=False, # load fully into RAM (fine for small data)
)
print("Label map:", corpus.label_dict)
print("Cells:", len(corpus))
Label map: {0: 0, 1: 1}
Cells: 100
3. Load hangar
A Hangar reads a folder of JSON config files, one sub-folder per model family (logreg, svc, xgb, mlp, resnet, rnn). The data/configs/sample_panel directory bundled with the repo is a good starting point for experimentation.
[5]:
# Adjust this path if you're running from outside the repo root.
config_folder = "../../data/configs/sample_panel"
hangar = Hangar(config_folder)
print(f"Hangar loaded {len(hangar.units)} units")
Hangar loaded 8 units
4. Model selection with n_kfold_selection
n_kfold_selection runs k-fold cross-validation and keeps only the models that clear the retention_point threshold. The survivors are then retrained on the full dataset in a “last mission” pass (skip_final=False).
[6]:
deck = n_kfold_selection(
hangar=hangar,
query_dataset=corpus,
test_dataset=corpus, # using same data for demo; use a real holdout in practice
kfold_selection_list=[2],
valid_fraction=0.1,
monitor_metric="test.accuracy",
n_dataloader_workers=1,
retention_point=0.5, # keep any model with accuracy > 0.5
cutoff_point=0.0,
seed=42,
verbose=False,
)
shutil.rmtree("cache", ignore_errors=True)
print(f"Surviving units ({len(deck.squad)}):")
for uid in deck.squad:
print(" ", uid)
GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
| Name | Type | Params | Mode
---------------------------------------------------------
0 | embedder | Sequential | 1.5 K | train
1 | blocks | ModuleList | 8.8 K | train
2 | dropout | Dropout | 0 | train
3 | fc | Linear | 18 | train
4 | criterion | CrossEntropyLoss | 0 | train
5 | accuracy | MulticlassAccuracy | 0 | train
6 | f1 | MulticlassF1Score | 0 | train
7 | auroc | MulticlassAUROC | 0 | train
---------------------------------------------------------
10.3 K Trainable params
0 Non-trainable params
10.3 K Total params
0.041 Total estimated model params size (MB)
29 Modules in train mode
0 Modules in eval mode
`Trainer.fit` stopped: `max_epochs=1` reached.
GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
| Name | Type | Params | Mode
---------------------------------------------------------
0 | embedder | Sequential | 1.4 K | train
1 | maxpool | MaxPool1d | 0 | train
2 | blocks | ModuleList | 2.8 K | train
3 | avgpool | AdaptiveAvgPool1d | 0 | train
4 | dropout | Dropout | 0 | train
5 | fc | Linear | 34 | train
6 | criterion | CrossEntropyLoss | 0 | train
7 | accuracy | MulticlassAccuracy | 0 | train
8 | f1 | MulticlassF1Score | 0 | train
9 | auroc | MulticlassAUROC | 0 | train
---------------------------------------------------------
4.2 K Trainable params
0 Non-trainable params
4.2 K Total params
0.017 Total estimated model params size (MB)
37 Modules in train mode
0 Modules in eval mode
`Trainer.fit` stopped: `max_epochs=1` reached.
GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
| Name | Type | Params | Mode
---------------------------------------------------------
0 | embedder | Sequential | 1.4 K | train
1 | maxpool | MaxPool1d | 0 | train
2 | blocks | ModuleList | 2.5 K | train
3 | avgpool | AdaptiveAvgPool1d | 0 | train
4 | dropout | Dropout | 0 | train
5 | fc | Linear | 10 | train
6 | criterion | CrossEntropyLoss | 0 | train
7 | accuracy | MulticlassAccuracy | 0 | train
8 | f1 | MulticlassF1Score | 0 | train
9 | auroc | MulticlassAUROC | 0 | train
---------------------------------------------------------
3.9 K Trainable params
0 Non-trainable params
3.9 K Total params
0.016 Total estimated model params size (MB)
30 Modules in train mode
0 Modules in eval mode
`Trainer.fit` stopped: `max_epochs=1` reached.
GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
| Name | Type | Params | Mode
---------------------------------------------------------
0 | embedder | Sequential | 1.5 K | train
1 | blocks | ModuleList | 5.7 K | train
2 | dropout | Dropout | 0 | train
3 | fc | Linear | 18 | train
4 | criterion | CrossEntropyLoss | 0 | train
5 | accuracy | MulticlassAccuracy | 0 | train
6 | f1 | MulticlassF1Score | 0 | train
7 | auroc | MulticlassAUROC | 0 | train
---------------------------------------------------------
7.2 K Trainable params
0 Non-trainable params
7.2 K Total params
0.029 Total estimated model params size (MB)
41 Modules in train mode
0 Modules in eval mode
`Trainer.fit` stopped: `max_epochs=2` reached.
GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
| Name | Type | Params | Mode
---------------------------------------------------------
0 | embedder | Sequential | 1.5 K | train
1 | blocks | ModuleList | 8.8 K | train
2 | dropout | Dropout | 0 | train
3 | fc | Linear | 18 | train
4 | criterion | CrossEntropyLoss | 0 | train
5 | accuracy | MulticlassAccuracy | 0 | train
6 | f1 | MulticlassF1Score | 0 | train
7 | auroc | MulticlassAUROC | 0 | train
---------------------------------------------------------
10.3 K Trainable params
0 Non-trainable params
10.3 K Total params
0.041 Total estimated model params size (MB)
29 Modules in train mode
0 Modules in eval mode
`Trainer.fit` stopped: `max_epochs=1` reached.
GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
| Name | Type | Params | Mode
---------------------------------------------------------
0 | embedder | Sequential | 1.4 K | train
1 | maxpool | MaxPool1d | 0 | train
2 | blocks | ModuleList | 2.8 K | train
3 | avgpool | AdaptiveAvgPool1d | 0 | train
4 | dropout | Dropout | 0 | train
5 | fc | Linear | 34 | train
6 | criterion | CrossEntropyLoss | 0 | train
7 | accuracy | MulticlassAccuracy | 0 | train
8 | f1 | MulticlassF1Score | 0 | train
9 | auroc | MulticlassAUROC | 0 | train
---------------------------------------------------------
4.2 K Trainable params
0 Non-trainable params
4.2 K Total params
0.017 Total estimated model params size (MB)
37 Modules in train mode
0 Modules in eval mode
`Trainer.fit` stopped: `max_epochs=1` reached.
GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
| Name | Type | Params | Mode
---------------------------------------------------------
0 | embedder | Sequential | 1.4 K | train
1 | maxpool | MaxPool1d | 0 | train
2 | blocks | ModuleList | 2.5 K | train
3 | avgpool | AdaptiveAvgPool1d | 0 | train
4 | dropout | Dropout | 0 | train
5 | fc | Linear | 10 | train
6 | criterion | CrossEntropyLoss | 0 | train
7 | accuracy | MulticlassAccuracy | 0 | train
8 | f1 | MulticlassF1Score | 0 | train
9 | auroc | MulticlassAUROC | 0 | train
---------------------------------------------------------
3.9 K Trainable params
0 Non-trainable params
3.9 K Total params
0.016 Total estimated model params size (MB)
30 Modules in train mode
0 Modules in eval mode
`Trainer.fit` stopped: `max_epochs=1` reached.
GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
| Name | Type | Params | Mode
---------------------------------------------------------
0 | embedder | Sequential | 1.5 K | train
1 | blocks | ModuleList | 5.7 K | train
2 | dropout | Dropout | 0 | train
3 | fc | Linear | 18 | train
4 | criterion | CrossEntropyLoss | 0 | train
5 | accuracy | MulticlassAccuracy | 0 | train
6 | f1 | MulticlassF1Score | 0 | train
7 | auroc | MulticlassAUROC | 0 | train
---------------------------------------------------------
7.2 K Trainable params
0 Non-trainable params
7.2 K Total params
0.029 Total estimated model params size (MB)
41 Modules in train mode
0 Modules in eval mode
`Trainer.fit` stopped: `max_epochs=2` reached.
GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
| Name | Type | Params | Mode
---------------------------------------------------------
0 | embedder | Sequential | 1.5 K | train
1 | blocks | ModuleList | 8.8 K | train
2 | dropout | Dropout | 0 | train
3 | fc | Linear | 18 | train
4 | criterion | CrossEntropyLoss | 0 | train
5 | accuracy | MulticlassAccuracy | 0 | train
6 | f1 | MulticlassF1Score | 0 | train
7 | auroc | MulticlassAUROC | 0 | train
---------------------------------------------------------
10.3 K Trainable params
0 Non-trainable params
10.3 K Total params
0.041 Total estimated model params size (MB)
29 Modules in train mode
0 Modules in eval mode
`Trainer.fit` stopped: `max_epochs=1` reached.
GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
| Name | Type | Params | Mode
---------------------------------------------------------
0 | embedder | Sequential | 1.4 K | train
1 | maxpool | MaxPool1d | 0 | train
2 | blocks | ModuleList | 2.8 K | train
3 | avgpool | AdaptiveAvgPool1d | 0 | train
4 | dropout | Dropout | 0 | train
5 | fc | Linear | 34 | train
6 | criterion | CrossEntropyLoss | 0 | train
7 | accuracy | MulticlassAccuracy | 0 | train
8 | f1 | MulticlassF1Score | 0 | train
9 | auroc | MulticlassAUROC | 0 | train
---------------------------------------------------------
4.2 K Trainable params
0 Non-trainable params
4.2 K Total params
0.017 Total estimated model params size (MB)
37 Modules in train mode
0 Modules in eval mode
`Trainer.fit` stopped: `max_epochs=1` reached.
GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
| Name | Type | Params | Mode
---------------------------------------------------------
0 | embedder | Sequential | 1.4 K | train
1 | maxpool | MaxPool1d | 0 | train
2 | blocks | ModuleList | 2.5 K | train
3 | avgpool | AdaptiveAvgPool1d | 0 | train
4 | dropout | Dropout | 0 | train
5 | fc | Linear | 10 | train
6 | criterion | CrossEntropyLoss | 0 | train
7 | accuracy | MulticlassAccuracy | 0 | train
8 | f1 | MulticlassF1Score | 0 | train
9 | auroc | MulticlassAUROC | 0 | train
---------------------------------------------------------
3.9 K Trainable params
0 Non-trainable params
3.9 K Total params
0.016 Total estimated model params size (MB)
30 Modules in train mode
0 Modules in eval mode
`Trainer.fit` stopped: `max_epochs=1` reached.
GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
| Name | Type | Params | Mode
---------------------------------------------------------
0 | embedder | Sequential | 1.5 K | train
1 | blocks | ModuleList | 5.7 K | train
2 | dropout | Dropout | 0 | train
3 | fc | Linear | 18 | train
4 | criterion | CrossEntropyLoss | 0 | train
5 | accuracy | MulticlassAccuracy | 0 | train
6 | f1 | MulticlassF1Score | 0 | train
7 | auroc | MulticlassAUROC | 0 | train
---------------------------------------------------------
7.2 K Trainable params
0 Non-trainable params
7.2 K Total params
0.029 Total estimated model params size (MB)
41 Modules in train mode
0 Modules in eval mode
`Trainer.fit` stopped: `max_epochs=2` reached.
Surviving units (8):
svc_linear_0
rnn_lstm_0
resnet_bottleneck_0
resnet_basic_0
logreg_sample_log_reg
mlp_resEnc_0
xgb_mn_0
xgb_mn_1
5. Predict
Deck.predict runs every surviving unit on the corpus and returns an ensemble probability matrix (shape [n_cells, n_classes]) and the true labels.
[7]:
all_preds, all_labels = deck.predict(query_dataset=corpus)
acc = accuracy_score(all_labels, all_preds.argmax(axis=-1))
auroc = roc_auc_score(all_labels, all_preds[:, 1])
print(f"Ensemble accuracy : {acc:.4f}")
print(f"Ensemble AUROC : {auroc:.4f}")
Ensemble accuracy : 1.0000
Ensemble AUROC : 1.0000
6. Feature importance with Deck.debrief
Deck.debrief calls each unit’s explain() method (coefficients for linear models, SHAP for XGBoost, Integrated Gradients for neural nets), aggregates the per-class scores into a single DataFrame indexed by feature name, and prints each unit’s individual scores.
Positive Class_0_Scores means the feature pushes a cell toward class 0.
[8]:
importance = deck.debrief(exp_dataset=corpus, verbose=True)
print("\nAggregated importance table:")
importance
Aggregated importance table:
[8]:
| Class_0_Scores | Class_1_Scores | |
|---|---|---|
| fake_gene_0 | -1.103242 | 1.103242 |
| fake_gene_1 | 14.013287 | -14.013287 |
| fake_gene_2 | -0.503321 | 0.503321 |
| fake_gene_3 | 1.012185 | -1.012185 |
| fake_gene_4 | -1.052482 | 1.052482 |
| fake_gene_5 | -0.549032 | 0.549032 |
| fake_gene_6 | 0.704132 | -0.704132 |
| fake_gene_7 | -1.332842 | 1.332842 |
| fake_gene_8 | 0.825337 | -0.825337 |
| fake_gene_9 | 1.004996 | -1.004996 |
| fake_gene_10 | -0.020199 | 0.020199 |
| fake_gene_11 | 0.755457 | -0.755457 |
| fake_gene_12 | 10.399876 | -10.399876 |
| fake_gene_13 | -0.500709 | 0.500709 |
| fake_gene_14 | -0.491509 | 0.491509 |
| fake_gene_15 | -0.045541 | 0.045541 |
| informative_gene_0 | 0.516871 | -0.516871 |
| informative_gene_1 | 0.312430 | -0.312430 |
| redundant_gene_0 | -0.134174 | 0.134174 |
| redundant_gene_1 | -1.095171 | 1.095171 |
[9]:
# Top 5 features driving class 0
importance["Class_0_Scores"].sort_values(ascending=False).head(5)
[9]:
fake_gene_1 14.013287
fake_gene_12 10.399876
fake_gene_3 1.012185
fake_gene_9 1.004996
fake_gene_8 0.825337
Name: Class_0_Scores, dtype: float64