Tutorial 02 — Data preparation and corpus construction

Ageas consumes AnnData (.h5ad) files through its corpus classes. This tutorial explains the expected AnnData layout, demonstrates how to build and inspect a Multimodal_Corpus, and shows the built-in k-fold splitter.

Topics:

  • Required AnnData structure (obs labels, var gene names, X values)

  • Multimodal_Corpus vs Repr_Corpus

  • kfold_random_split: training / validation / test splits

  • Oversampling with 'repeat' to balance classes

  • Inspecting batches and slicing the corpus

[1]:
import warnings

import numpy as np
import pandas as pd

from ageas.tool import (
    Multimodal_Corpus,
    Repr_Corpus,
    kfold_random_split,
    make_fake_adata,
)

warnings.filterwarnings("ignore")

1. AnnData layout required by Ageas

Slot

Key

Content

obs

label_key

Cell-type / condition string

var

'name'

Gene symbol (needed by use_gene_names=True)

X

Non-negative expression matrix (cells × genes), already normalised

make_fake_adata creates a compliant toy dataset:

[2]:
adata = make_fake_adata(n_class=2, n_clusters_per_class=2)
print(adata)
print("\nobs (first 5):")
display(adata.obs.head())
print("\nvar (first 5):")
display(adata.var.head())
Seed set to 42
AnnData object with n_obs × n_vars = 100 × 20
    obs: 'celltype'
    var: 'name'

obs (first 5):
celltype
fake_cell_0 0
fake_cell_1 0
fake_cell_2 1
fake_cell_3 1
fake_cell_4 0

var (first 5):
name
fake_gene_0 fake_gene_0
fake_gene_1 fake_gene_1
fake_gene_2 fake_gene_2
fake_gene_3 fake_gene_3
fake_gene_4 fake_gene_4

If you are bringing your own data, make sure:

  • adata.X contains non-negative values (counts or log1p-normalised counts)

  • adata.obs[label_key] exists and is a string or integer column

  • (Optional) adata.var['name'] holds human-readable gene symbols

[3]:
# Save to disk — corpora are always constructed from file paths.
adata_path = "/tmp/ageas_tut02.h5ad"
adata.write_h5ad(adata_path)

2. Multimodal_Corpus

Multimodal_Corpus is the standard corpus class. It:

  • reads the .h5ad file (optionally memory-mapped with backed=True)

  • builds a label_dict mapping integer index → string label

  • implements torch.utils.data.Dataset so batches can be streamed

[4]:
corpus = Multimodal_Corpus(
    adata_path,
    label_key="celltype",
    backed=False,   # True for large files (memory-maps the array)
)
print("Label map:", corpus.label_dict)
print("Cells:", len(corpus), "| Genes:", corpus.adata.n_vars)
Label map: {0: 0, 1: 1}
Cells: 100 | Genes: 20
[5]:
# Access a single sample (returns a (data_tensor, label_tensor) tuple)
data, label = corpus[0]
print(f"data shape: {data.shape}, label: {label} ({corpus.label_dict[label]})")
data shape: (1, 20), label: 0 (0)

3. Repr_Corpus: multiple representation layers

Repr_Corpus can stack multiple layers of the AnnData into one feature tensor — for example raw counts ('X') concatenated with a PCA embedding ('X_pca').

# Example (requires adata.obsm['X_pca'] to exist):
corpus = Repr_Corpus(adata_path, label_key='celltype', layers=['X', 'X_pca'])

When layers=['X'] (default), Repr_Corpus behaves identically to Multimodal_Corpus.

4. K-fold splitting

kfold_random_split returns three parallel lists — one entry per fold — each containing a sub-corpus. It handles optional class stratification and oversampling.

[6]:
train_list, valid_list, test_list = kfold_random_split(
    corpus,
    n_splits=3,             # 3-fold: each fold uses ~1/3 as test
    valid_fraction=0.1,     # hold out 10% of the training portion for validation
    stratified_test=True,   # preserve class ratio in each test fold
    stratified_valid=True,
    random_seed=42,
)

for i, (tr, va, te) in enumerate(zip(train_list, valid_list, test_list)):
    print(f"Fold {i+1}: train={len(tr)}, valid={len(va)}, test={len(te)}")
Fold 1: train=59, valid=7, test=34
Fold 2: train=60, valid=7, test=33
Fold 3: train=60, valid=7, test=33

5. Handling class imbalance with oversampling

Pass oversample_method='repeat' to duplicate minority-class samples in the training folds. The target size is controlled by oversample_by: 'max', 'mean', or 'median'.

[7]:
# Create an imbalanced corpus (class 1 has only 10 samples)
adata_imbal = make_fake_adata(n_class=2, n_clusters_per_class=1)
class1_idx = adata_imbal.obs[adata_imbal.obs["celltype"] == "celltype_1"].index
keep_mask = ~adata_imbal.obs.index.isin(class1_idx[10:])
adata_imbal = adata_imbal[keep_mask].copy()
adata_imbal.write_h5ad("cache/ageas_tut02_imbal.h5ad")

imbal_corpus = Multimodal_Corpus("cache/ageas_tut02_imbal.h5ad", label_key="celltype")

labels = [imbal_corpus[i][1] for i in range(len(imbal_corpus))]
print("Counts before:", pd.Series(labels).value_counts().to_dict())

train_over, _, _ = kfold_random_split(
    imbal_corpus,
    n_splits=2,
    oversample_method="repeat",
    oversample_by="max",
    random_seed=0,
)
labels_after = [train_over[0][i][1] for i in range(len(train_over[0]))]
print("Counts after (fold 0 train):", pd.Series(labels_after).value_counts().to_dict())
Seed set to 42
Counts before: {0: 50, 1: 50}
Counts after (fold 0 train): {tensor(0): 24, tensor(1): 13, tensor(1): 11}

6. Corpus slicing

corpus.copy() deep-copies a corpus so you can safely subset its .adata without touching the original. The iterative operation pipelines use this internally to trim the feature axis between rounds.

[8]:
sub_corpus = corpus.copy()
sub_corpus.adata = sub_corpus.adata[:, :5]   # keep first 5 genes

print(f"Sliced corpus genes : {sub_corpus.adata.n_vars}")
print(f"Original corpus genes: {corpus.adata.n_vars}  (unchanged)")
Sliced corpus genes : 5
Original corpus genes: 20  (unchanged)