Tutorial 02 — Data preparation and corpus construction

Ageas consumes AnnData (.h5ad) files through its corpus classes. This tutorial explains the expected AnnData layout, demonstrates how to build and inspect a Multimodal_Corpus, and shows the built-in k-fold splitter.

Topics:

Required AnnData structure (obs labels, var gene names, X values)
Multimodal_Corpus vs Repr_Corpus
kfold_random_split: training / validation / test splits
Oversampling with 'repeat' to balance classes
Inspecting batches and slicing the corpus

[1]:

import warnings

import numpy as np
import pandas as pd

from ageas.tool import (
    Multimodal_Corpus,
    Repr_Corpus,
    kfold_random_split,
    make_fake_adata,
)

warnings.filterwarnings("ignore")

1. AnnData layout required by Ageas

Slot	Key	Content
`obs`	`label_key`	Cell-type / condition string
`var`	`'name'`	Gene symbol (needed by `use_gene_names=True`)
`X`	—	Non-negative expression matrix (cells × genes), already normalised

make_fake_adata creates a compliant toy dataset:

[2]:

adata = make_fake_adata(n_class=2, n_clusters_per_class=2)
print(adata)
print("\nobs (first 5):")
display(adata.obs.head())
print("\nvar (first 5):")
display(adata.var.head())

Seed set to 42

AnnData object with n_obs × n_vars = 100 × 20
    obs: 'celltype'
    var: 'name'

obs (first 5):

	celltype
fake_cell_0	0
fake_cell_1	0
fake_cell_2	1
fake_cell_3	1
fake_cell_4	0


var (first 5):

	name
fake_gene_0	fake_gene_0
fake_gene_1	fake_gene_1
fake_gene_2	fake_gene_2
fake_gene_3	fake_gene_3
fake_gene_4	fake_gene_4

If you are bringing your own data, make sure:

adata.X contains non-negative values (counts or log1p-normalised counts)
adata.obs[label_key] exists and is a string or integer column
(Optional) adata.var['name'] holds human-readable gene symbols

[3]:

# Save to disk — corpora are always constructed from file paths.
adata_path = "/tmp/ageas_tut02.h5ad"
adata.write_h5ad(adata_path)

2. Multimodal_Corpus

Multimodal_Corpus is the standard corpus class. It:

reads the .h5ad file (optionally memory-mapped with backed=True)
builds a label_dict mapping integer index → string label
implements torch.utils.data.Dataset so batches can be streamed

[4]:

corpus = Multimodal_Corpus(
    adata_path,
    label_key="celltype",
    backed=False,   # True for large files (memory-maps the array)
)
print("Label map:", corpus.label_dict)
print("Cells:", len(corpus), "| Genes:", corpus.adata.n_vars)

Label map: {0: 0, 1: 1}
Cells: 100 | Genes: 20

[5]:

# Access a single sample (returns a (data_tensor, label_tensor) tuple)
data, label = corpus[0]
print(f"data shape: {data.shape}, label: {label} ({corpus.label_dict[label]})")

data shape: (1, 20), label: 0 (0)

3. Repr_Corpus: multiple representation layers

Repr_Corpus can stack multiple layers of the AnnData into one feature tensor — for example raw counts ('X') concatenated with a PCA embedding ('X_pca').

# Example (requires adata.obsm['X_pca'] to exist):
corpus = Repr_Corpus(adata_path, label_key='celltype', layers=['X', 'X_pca'])

When layers=['X'] (default), Repr_Corpus behaves identically to Multimodal_Corpus.

4. K-fold splitting

kfold_random_split returns three parallel lists — one entry per fold — each containing a sub-corpus. It handles optional class stratification and oversampling.

[6]:

train_list, valid_list, test_list = kfold_random_split(
    corpus,
    n_splits=3,             # 3-fold: each fold uses ~1/3 as test
    valid_fraction=0.1,     # hold out 10% of the training portion for validation
    stratified_test=True,   # preserve class ratio in each test fold
    stratified_valid=True,
    random_seed=42,
)

for i, (tr, va, te) in enumerate(zip(train_list, valid_list, test_list)):
    print(f"Fold {i+1}: train={len(tr)}, valid={len(va)}, test={len(te)}")

Fold 1: train=59, valid=7, test=34
Fold 2: train=60, valid=7, test=33
Fold 3: train=60, valid=7, test=33

5. Handling class imbalance with oversampling

Pass oversample_method='repeat' to duplicate minority-class samples in the training folds. The target size is controlled by oversample_by: 'max', 'mean', or 'median'.

[7]:

# Create an imbalanced corpus (class 1 has only 10 samples)
adata_imbal = make_fake_adata(n_class=2, n_clusters_per_class=1)
class1_idx = adata_imbal.obs[adata_imbal.obs["celltype"] == "celltype_1"].index
keep_mask = ~adata_imbal.obs.index.isin(class1_idx[10:])
adata_imbal = adata_imbal[keep_mask].copy()
adata_imbal.write_h5ad("cache/ageas_tut02_imbal.h5ad")

imbal_corpus = Multimodal_Corpus("cache/ageas_tut02_imbal.h5ad", label_key="celltype")

labels = [imbal_corpus[i][1] for i in range(len(imbal_corpus))]
print("Counts before:", pd.Series(labels).value_counts().to_dict())

train_over, _, _ = kfold_random_split(
    imbal_corpus,
    n_splits=2,
    oversample_method="repeat",
    oversample_by="max",
    random_seed=0,
)
labels_after = [train_over[0][i][1] for i in range(len(train_over[0]))]
print("Counts after (fold 0 train):", pd.Series(labels_after).value_counts().to_dict())

Seed set to 42

Counts before: {0: 50, 1: 50}
Counts after (fold 0 train): {tensor(0): 24, tensor(1): 13, tensor(1): 11}

6. Corpus slicing

corpus.copy() deep-copies a corpus so you can safely subset its .adata without touching the original. The iterative operation pipelines use this internally to trim the feature axis between rounds.

[8]:

sub_corpus = corpus.copy()
sub_corpus.adata = sub_corpus.adata[:, :5]   # keep first 5 genes

print(f"Sliced corpus genes : {sub_corpus.adata.n_vars}")
print(f"Original corpus genes: {corpus.adata.n_vars}  (unchanged)")

Sliced corpus genes : 5
Original corpus genes: 20  (unchanged)