{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Tutorial 02 — Data preparation and corpus construction\n", "\n", "Ageas consumes AnnData (`.h5ad`) files through its corpus classes. This\n", "tutorial explains the expected AnnData layout, demonstrates how to build and\n", "inspect a `Multimodal_Corpus`, and shows the built-in k-fold splitter.\n", "\n", "**Topics:**\n", "- Required AnnData structure (obs labels, var gene names, X values)\n", "- `Multimodal_Corpus` vs `Repr_Corpus`\n", "- `kfold_random_split`: training / validation / test splits\n", "- Oversampling with `'repeat'` to balance classes\n", "- Inspecting batches and slicing the corpus" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import warnings\n", "\n", "import numpy as np\n", "import pandas as pd\n", "\n", "from ageas.tool import (\n", " Multimodal_Corpus,\n", " Repr_Corpus,\n", " kfold_random_split,\n", " make_fake_adata,\n", ")\n", "\n", "warnings.filterwarnings(\"ignore\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. AnnData layout required by Ageas\n", "\n", "| Slot | Key | Content |\n", "|------|-----|---------|\n", "| `obs` | `label_key` | Cell-type / condition string |\n", "| `var` | `'name'` | Gene symbol (needed by `use_gene_names=True`) |\n", "| `X` | — | Non-negative expression matrix (cells × genes), already normalised |\n", "\n", "`make_fake_adata` creates a compliant toy dataset:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Seed set to 42\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "AnnData object with n_obs × n_vars = 100 × 20\n", " obs: 'celltype'\n", " var: 'name'\n", "\n", "obs (first 5):\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
celltype
fake_cell_00
fake_cell_10
fake_cell_21
fake_cell_31
fake_cell_40
\n", "
" ], "text/plain": [ " celltype\n", "fake_cell_0 0\n", "fake_cell_1 0\n", "fake_cell_2 1\n", "fake_cell_3 1\n", "fake_cell_4 0" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "var (first 5):\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
name
fake_gene_0fake_gene_0
fake_gene_1fake_gene_1
fake_gene_2fake_gene_2
fake_gene_3fake_gene_3
fake_gene_4fake_gene_4
\n", "
" ], "text/plain": [ " name\n", "fake_gene_0 fake_gene_0\n", "fake_gene_1 fake_gene_1\n", "fake_gene_2 fake_gene_2\n", "fake_gene_3 fake_gene_3\n", "fake_gene_4 fake_gene_4" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "adata = make_fake_adata(n_class=2, n_clusters_per_class=2)\n", "print(adata)\n", "print(\"\\nobs (first 5):\")\n", "display(adata.obs.head())\n", "print(\"\\nvar (first 5):\")\n", "display(adata.var.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you are bringing your own data, make sure:\n", "- `adata.X` contains non-negative values (counts or log1p-normalised counts)\n", "- `adata.obs[label_key]` exists and is a string or integer column\n", "- (Optional) `adata.var['name']` holds human-readable gene symbols" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Save to disk — corpora are always constructed from file paths.\n", "adata_path = \"/tmp/ageas_tut02.h5ad\"\n", "adata.write_h5ad(adata_path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Multimodal_Corpus\n", "\n", "`Multimodal_Corpus` is the standard corpus class. It:\n", "- reads the `.h5ad` file (optionally memory-mapped with `backed=True`)\n", "- builds a `label_dict` mapping integer index → string label\n", "- implements `torch.utils.data.Dataset` so batches can be streamed" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Label map: {0: 0, 1: 1}\n", "Cells: 100 | Genes: 20\n" ] } ], "source": [ "corpus = Multimodal_Corpus(\n", " adata_path,\n", " label_key=\"celltype\",\n", " backed=False, # True for large files (memory-maps the array)\n", ")\n", "print(\"Label map:\", corpus.label_dict)\n", "print(\"Cells:\", len(corpus), \"| Genes:\", corpus.adata.n_vars)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "data shape: (1, 20), label: 0 (0)\n" ] } ], "source": [ "# Access a single sample (returns a (data_tensor, label_tensor) tuple)\n", "data, label = corpus[0]\n", "print(f\"data shape: {data.shape}, label: {label} ({corpus.label_dict[label]})\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Repr_Corpus: multiple representation layers\n", "\n", "`Repr_Corpus` can stack multiple layers of the AnnData into one feature\n", "tensor — for example raw counts (`'X'`) concatenated with a PCA embedding\n", "(`'X_pca'`).\n", "\n", "```python\n", "# Example (requires adata.obsm['X_pca'] to exist):\n", "corpus = Repr_Corpus(adata_path, label_key='celltype', layers=['X', 'X_pca'])\n", "```\n", "\n", "When `layers=['X']` (default), `Repr_Corpus` behaves identically to\n", "`Multimodal_Corpus`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. K-fold splitting\n", "\n", "`kfold_random_split` returns three parallel lists — one entry per fold —\n", "each containing a sub-corpus. It handles optional class stratification and\n", "oversampling." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fold 1: train=59, valid=7, test=34\n", "Fold 2: train=60, valid=7, test=33\n", "Fold 3: train=60, valid=7, test=33\n" ] } ], "source": [ "train_list, valid_list, test_list = kfold_random_split(\n", " corpus,\n", " n_splits=3, # 3-fold: each fold uses ~1/3 as test\n", " valid_fraction=0.1, # hold out 10% of the training portion for validation\n", " stratified_test=True, # preserve class ratio in each test fold\n", " stratified_valid=True,\n", " random_seed=42,\n", ")\n", "\n", "for i, (tr, va, te) in enumerate(zip(train_list, valid_list, test_list)):\n", " print(f\"Fold {i+1}: train={len(tr)}, valid={len(va)}, test={len(te)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Handling class imbalance with oversampling\n", "\n", "Pass `oversample_method='repeat'` to duplicate minority-class samples in\n", "the training folds. The target size is controlled by `oversample_by`:\n", "`'max'`, `'mean'`, or `'median'`." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Seed set to 42\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Counts before: {0: 50, 1: 50}\n", "Counts after (fold 0 train): {tensor(0): 24, tensor(1): 13, tensor(1): 11}\n" ] } ], "source": [ "# Create an imbalanced corpus (class 1 has only 10 samples)\n", "adata_imbal = make_fake_adata(n_class=2, n_clusters_per_class=1)\n", "class1_idx = adata_imbal.obs[adata_imbal.obs[\"celltype\"] == \"celltype_1\"].index\n", "keep_mask = ~adata_imbal.obs.index.isin(class1_idx[10:])\n", "adata_imbal = adata_imbal[keep_mask].copy()\n", "adata_imbal.write_h5ad(\"cache/ageas_tut02_imbal.h5ad\")\n", "\n", "imbal_corpus = Multimodal_Corpus(\"cache/ageas_tut02_imbal.h5ad\", label_key=\"celltype\")\n", "\n", "labels = [imbal_corpus[i][1] for i in range(len(imbal_corpus))]\n", "print(\"Counts before:\", pd.Series(labels).value_counts().to_dict())\n", "\n", "train_over, _, _ = kfold_random_split(\n", " imbal_corpus,\n", " n_splits=2,\n", " oversample_method=\"repeat\",\n", " oversample_by=\"max\",\n", " random_seed=0,\n", ")\n", "labels_after = [train_over[0][i][1] for i in range(len(train_over[0]))]\n", "print(\"Counts after (fold 0 train):\", pd.Series(labels_after).value_counts().to_dict())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. Corpus slicing\n", "\n", "`corpus.copy()` deep-copies a corpus so you can safely subset its `.adata`\n", "without touching the original. The iterative operation pipelines use this\n", "internally to trim the feature axis between rounds." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sliced corpus genes : 5\n", "Original corpus genes: 20 (unchanged)\n" ] } ], "source": [ "sub_corpus = corpus.copy()\n", "sub_corpus.adata = sub_corpus.adata[:, :5] # keep first 5 genes\n", "\n", "print(f\"Sliced corpus genes : {sub_corpus.adata.n_vars}\")\n", "print(f\"Original corpus genes: {corpus.adata.n_vars} (unchanged)\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.6" } }, "nbformat": 4, "nbformat_minor": 5 }