Using complex matrix interpolation techniques from multi-manifold learning to analyze cross-tissue transcriptomic covariance in Multiple Sclerosis will uncover shared regulatory axes driving immune cell infiltration into the CNS.
Adversarial Debate Score
60% survival rate under critique
Model Critiques
Supporting Research Papers
- Machine Learning for analysis of Multiple Sclerosis cross-tissue bulk and single-cell transcriptomics data
Multiple Sclerosis (MS) is a chronic autoimmune disease of the central nervous system whose molecular mechanisms remain incompletely understood. In this study, we developed an end-to-end machine learn...
- Cross-Species Transfer Learning for Electrophysiology-to-Transcriptomics Mapping in Cortical GABAergic Interneurons
Single-cell electrophysiological recordings provide a powerful window into neuronal functional diversity and offer an interpretable route for linking intrinsic physiology to transcriptomic identity. H...
- Complex Interpolation of Matrices with an application to Multi-Manifold Learning
Given two symmetric positive-definite matrices A, B \in \mathbb{R}^{n \times n}, we study the spectral properties of the interpolation A^{1-x} B^x for 0 \leq x \leq 1. The presence of `common structur...
Formal Verification
Z3 checks whether the hypothesis is internally consistent, not whether it is empirically true.
This discovery has a Claude-generated validation package with a full experimental design.
Precise Hypothesis
Multi-manifold learning with complex matrix interpolation applied to paired peripheral blood and CNS tissue transcriptomic data from MS patients will identify ≥3 statistically significant shared regulatory axes (gene co-expression modules) that are enriched for immune cell infiltration signatures, with these axes explaining >15% of cross-tissue transcriptomic covariance and showing differential activity (FDR < 0.05) between MS patients and healthy controls in at least two independent cohorts.
- Fewer than 3 cross-tissue co-expression modules survive FDR correction (q > 0.05) after multi-manifold decomposition.
- Identified regulatory axes explain <5% of cross-tissue transcriptomic covariance (R² < 0.05 in held-out test set).
- No significant enrichment (hypergeometric p > 0.05) of immune infiltration gene sets (MSigDB C7, GO:0002250) in any identified axis.
- Axes identified in discovery cohort fail to replicate in ≥1 independent MS cohort (Pearson r < 0.3 for module eigengene correlation across tissues).
- Complex matrix interpolation performs no better than standard PCA or canonical correlation analysis (CCA) on the same data (ΔAUC < 0.05 on immune infiltration prediction task).
- Identified axes are confounded by batch effects, tissue dissection artifacts, or cell-type composition differences rather than regulatory biology (confirmed by sensitivity analysis).
- Permutation testing (n=1,000 permutations) shows that observed cross-tissue covariance structure is consistent with random expectation (p > 0.05).
Experimental Protocol
Phase 1 — Data Assembly and Preprocessing (Days 1–30): Acquire and harmonize multi-tissue MS transcriptomic datasets. Apply rigorous QC, batch correction (ComBat-seq), and cell-type deconvolution. Phase 2 — Manifold Construction (Days 31–60): Implement multi-manifold learning framework with complex matrix interpolation. Construct tissue-specific manifolds and compute cross-tissue covariance operators. Phase 3 — Regulatory Axis Identification (Days 61–90): Extract shared regulatory axes via joint spectral decomposition. Annotate axes with TF binding enrichment, immune gene sets, and pathway analysis. Phase 4 — Validation (Days 91–130): Validate axes in independent cohort. Compare performance against CCA, MOFA+, and standard PCA baselines. Functional validation via perturbation experiments. Phase 5 — Biological Interpretation (Days 131–150): Causal inference on top axes using Mendelian randomization with MS GWAS data. Prioritize druggable targets.
- MSBioScreen cohort (GEO: GSE193770) — paired PBMC + CSF transcriptomics, n=120 MS patients, n=40 controls.
- UK MS Brain Bank RNA-seq (ArrayExpress: E-MTAB-8316) — CNS white matter lesion bulk RNA-seq, n=215 samples.
- Human Cell Atlas MS dataset (CZI CELLxGENE) — scRNA-seq from CNS and blood, ~500K cells.
- GTEx v8 brain + whole blood paired samples (dbGaP: phs000424) — healthy reference manifold construction, n=838 donors.
- MS GWAS summary statistics (IMSGC 2019, n=47,429 cases) — for Mendelian randomization validation.
- JASPAR 2022 TF binding database — regulatory axis annotation.
- MSigDB v2023.1 (C2, C7, H collections) — gene set enrichment.
- Replication cohort: GEMS study (GEO: GSE41850) — independent blood transcriptomics, n=141 MS patients.
- Reference immune infiltration signatures: LM22 matrix (CIBERSORT), CellChat ligand-receptor database.
- Computational environment: Python 3.10+, R 4.3+, PyTorch 2.0+, scanpy 1.9+, MOFA+ 1.6+.
- ≥3 regulatory axes survive permutation testing (p < 0.01, Bonferroni corrected) — primary endpoint.
- Cross-tissue covariance R² > 0.15 for top axis in held-out test set (5-fold CV).
- ≥1 axis shows significant GSEA enrichment for immune infiltration gene sets (NES > 1.5, FDR q < 0.05).
- Multi-manifold method outperforms best baseline (MOFA+) by ΔAUC > 0.05 on immune infiltration prediction.
- Top axis replicates in GEMS cohort (Spearman r > 0.3, p < 0.05 for axis score vs. EDSS correlation).
- ≥1 axis TF enrichment overlaps with known MS genetic risk loci (IMSGC GWAS p < 5×10⁻⁸) within 500kb.
- Mendelian randomization shows nominally significant causal signal (IVW p < 0.05) for ≥1 axis.
- <3 axes survive permutation testing after Bonferroni correction.
- Best axis explains <5% cross-tissue covariance (R² < 0.05).
- No immune gene set enrichment in any axis (all FDR q > 0.20).
- Multi-manifold method performs equivalently to PCA (ΔAUC < 0.02).
- Zero replication of axes in independent cohort (all Spearman r < 0.15).
-
40% of variance in axes explained by known confounders (age, sex, treatment status) in linear regression.
- Complex matrix interpolation fails to converge or produces non-positive-definite operators for >50% of interpolation values t.
420
GPU hours
150d
Time to result
$4,200
Min cost
$28,500
Full cost
ROI Projection
- Diagnostic: Cross-tissue regulatory axis scores as blood biomarkers for MS disease activity monitoring — addressable market ~$800M globally (MS diagnostics segment growing at 8.2% CAGR).
- Therapeutic target identification: TFs and upstream regulators identified as axes could be targeted by small molecules or ASOs; 3–5 year path to IND filing for top candidate.
- Platform technology: Multi-manifold complex interpolation framework applicable to other neuroinflammatory diseases (NMO, ALS, Alzheimer's) — platform licensing potential $5–20M.
- Computational tool: Open-source release of manifold learning pipeline with MS application could attract pharma partnerships (precedent: Seurat, $0 license but $50M+ in partnership value generated for Satija lab).
- Personalized medicine: Axis-based patient stratification for existing MS therapies (natalizumab, ocrelizumab) — companion diagnostic potential worth $30–80M in co-development deals.
- Data asset: Harmonized multi-tissue MS transcriptomic database created during validation has standalone value for licensing to pharma ($500K–2M one-time or subscription).
🔓 If proven, this unlocks
Proving this hypothesis is a prerequisite for the following downstream discoveries and applications:
- 1MS_DRUGGABLE_TARGET_PRIORITIZATION_010
- 2CNS_IMMUNE_AXIS_PERTURBATION_011
- 3CROSS_DISEASE_MANIFOLD_TRANSFER_012
- 4MS_BIOMARKER_BLOOD_PANEL_013
- 5REGULATORY_NETWORK_CAUSAL_INFERENCE_014
Prerequisites
These must be validated before this hypothesis can be confirmed:
- MS_PAIRED_TRANSCRIPTOMICS_QC_001
- MULTIMANIFOLD_LEARNING_BENCHMARK_002
- CIBERSORT_CNS_VALIDATION_003
- CROSS_TISSUE_COVARIANCE_BASELINE_004
Implementation Sketch
# Multi-Manifold Complex Matrix Interpolation for MS Cross-Tissue Analysis # Architecture Overview import numpy as np import scipy.linalg as la from scipy.sparse import csr_matrix import torch import scanpy as sc from sklearn.preprocessing import StandardScaler # ── STEP 1: DATA LOADING & PREPROCESSING ────────────────────────────────────── class MultiTissueDataLoader: def __init__(self, blood_adata_path, cns_adata_path, paired_sample_ids): self.blood = sc.read_h5ad(blood_adata_path) self.cns = sc.read_h5ad(cns_adata_path) self.paired_ids = paired_sample_ids # list of matched sample IDs def preprocess(self): for adata in [self.blood, self.cns]: sc.pp.filter_genes(adata, min_cells=10) sc.pp.normalize_total(adata, target_sum=1e6) # CPM sc.pp.log1p(adata) sc.pp.highly_variable_genes(adata, n_top_genes=5000) # Subset to paired samples self.X_blood = self.blood[self.paired_ids].X.toarray() # shape: (n_samples, p_blood) self.X_cns = self.cns[self.paired_ids].X.toarray() # shape: (n_samples, p_cns) return self.X_blood, self.X_cns # ── STEP 2: MANIFOLD CONSTRUCTION ───────────────────────────────────────────── class TissueManifold: def __init__(self, X, n_neighbors=15, sigma=None): self.X = X self.n = X.shape[0] self.k = n_neighbors self.sigma = sigma or self._estimate_bandwidth() def _estimate_bandwidth(self): # 5-NN bandwidth estimation from sklearn.neighbors import NearestNeighbors nbrs = NearestNeighbors(n_neighbors=5).fit(self.X) distances, _ = nbrs.kneighbors(self.X) return np.median(distances[:, -1]) def compute_affinity_matrix(self): from sklearn.metrics import pairwise_distances D = pairwise_distances(self.X) W = np.exp(-D**2 / (2 * self.sigma**2)) np.fill_diagonal(W, 0) return W def compute_diffusion_operator(self): W = self.compute_affinity_matrix() D_inv = np.diag(1.0 / W.sum(axis=1)) P = D_inv @ W # row-stochastic diffusion operator return P # shape: (n, n) # ── STEP 3: COMPLEX MATRIX INTERPOLATION ────────────────────────────────────── class ComplexManifoldInterpolator: """ Interpolates between two diffusion operators using matrix logarithm/exponential in the complex domain to handle non-commuting operators. """ def __init__(self, P_blood, P_cns): self.P_blood = P_blood.astype(complex) self.P_cns = P_cns.astype(complex) self.log_P_blood = la.logm(P_blood + 1e-8 * np.eye(P_blood.shape[0])) self.log_P_cns = la.logm(P_cns + 1e-8 * np.eye(P_cns.shape[0])) def interpolate(self, t): """Geodesic interpolation: P(t) = exp((1-t)*log(P_blood) + t*log(P_cns))""" assert 0 <= t <= 1 log_Pt = (1 - t) * self.log_P_blood + t * self.log_P_cns Pt = la.expm(log_Pt) return np.real(Pt) # take real part; imaginary should be ~0 for valid operators def find_optimal_t(self, t_grid=None, cv_folds=5): """Select t* minimizing cross-tissue reconstruction error via CV""" from sklearn.model_selection import KFold if t_grid is None: t_grid = np.linspace(0.1, 0.9, 9) errors = [] kf = KFold(n_splits=cv_folds, shuffle=True, random_state=42) for t in t_grid: Pt = self.interpolate(t) fold_errors = [] for train_idx, test_idx in kf.split(Pt): # Reconstruction error: how well manifold coords predict cross-tissue expression Pt_train = Pt[np.ix_(train_idx, train_idx)] Pt_test = Pt[np.ix_(test_idx, train_idx)] # Simplified: use Frobenius norm of off-diagonal cross-tissue block fold_errors.append(np.linalg.norm(Pt_test - Pt_test.mean())) errors.append(np.mean(fold_errors)) t_star = t_grid[np.argmin(errors)] return t_star, errors # ── STEP 4: JOINT SPECTRAL DECOMPOSITION ────────────────────────────────────── class RegulatoryAxisExtractor: def __init__(self, Pt_star, X_blood, X_cns, n_axes=25): self.Pt = Pt_star self.X_blood = X_blood self.X_cns = X_cns self.K = n_axes def extract_eigenvectors(self): """Spectral decomposition; select K by eigengap criterion""" eigenvalues, eigenvectors = np.linalg.eigh(self.Pt) # Sort descending idx = np.argsort(eigenvalues)[::-1] eigenvalues = eigenvalues[idx] eigenvectors = eigenvectors[:, idx] # Eigengap criterion gaps = np.diff(eigenvalues[:50]) K_auto = np.argmin(gaps[:30]) + 1 # first large gap K = min(self.K, K_auto) self.manifold_coords = eigenvectors[:, :K] # shape: (n_samples, K) self.eigenvalues = eigenvalues[:K] return self.manifold_coords def map_to_gene_space(self, alpha=0.01): """Ridge regression: gene_loadings = (Z^T Z + alpha I)^{-1} Z^T X""" from sklearn.linear_model import Ridge Z = self.manifold_coords self.blood_loadings = Ridge(alpha=alpha).fit(Z, self.X_blood).coef_ # (K, p_blood) self.cns_loadings = Ridge(alpha=alpha).fit(Z, self.X_cns).coef_ # (K, p_cns) return self.blood_loadings, self.cns_loadings def permutation_test(self, n_permutations=1000): """Test significance of each axis via label permutation""" observed_var = np.var(self.manifold_coords, axis=0) null_vars = np.zeros((n_permutations, self.manifold_coords.shape[1])) for i in range(n_permutations): perm_idx = np.random.permutation(self.Pt.shape[0]) Pt_perm = self.Pt[perm_idx][:, perm_idx] _, evecs_perm = np.linalg.eigh(Pt_perm) null_vars[i] = np.var(evecs_perm[:, -self.manifold_coords.shape[1]:], axis=0) p_values = (null_vars >= observed_var).mean(axis=0) return p_values # Bonferroni correct externally # ── STEP 5: ENRICHMENT & ANNOTATION ─────────────────────────────────────────── class AxisAnnotator: def __init__(self, gene_loadings, gene_names, msigdb_path, jaspar_path): self.loadings = gene_loadings self.genes = gene_names self.msigdb = self._load_msigdb(msigdb_path) self.jaspar = jaspar_path def _load_msigdb(self, path): import json with open(path) as f: return json.load(f) def gsea_per_axis(self, axis_idx, top_n=200): """Run fgsea via rpy2 interface""" import rpy2.robjects as ro ranked_genes = dict(zip(self.genes, self.loadings[axis_idx])) # Call fgsea in R ro.r(f''' library(fgsea) ranks <- c({",".join([f'"{g}"={v}' for g,v in ranked_genes.items()])}) result <- fgsea(pathways=msigdb_genesets, stats=ranks, nperm=10000, minSize=15, maxSize=500) result[padj < 0.05] ''') def compute_cross_tissue_r2(self, blood_scores, cns_scores): """R² between axis scores across tissues""" from sklearn.metrics import r2_score return r2_score(cns_scores, blood_scores) # ── STEP 6: BASELINE COMPARISON ─────────────────────────────────────────────── class BaselineComparison: def run_mofa_plus(self, X_blood, X_cns, n_factors=25): """Run MOFA+ via mofapy2""" from mofapy2.run.entry_point import entry_point ent = entry_point() ent.set_data_matrix([[X_blood, X_cns]]) ent.set_model_options(factors=n_factors) ent.build() ent.run() return ent.model.get_factors() def run_cca(self, X_blood, X_cns, n_components=25): from sklearn.cross_decomposition import CCA cca = CCA(n_components=n_components) cca.fit(X_blood, X_cns) return cca.transform(X_blood, X_cns) def compare_auc(self, method_scores, infiltration_labels): from sklearn.linear_model import LogisticRegressionCV from sklearn.model_selection import cross_val_score clf = LogisticRegressionCV(cv=5, max_iter=1000) auc = cross_val_score(clf, method_scores, infiltration_labels, cv=5, scoring='roc_auc').mean() return auc # ── MAIN PIPELINE ────────────────────────────────────────────────────────────── if __name__ == "__main__": # 1. Load data loader = MultiTissueDataLoader("blood.h5ad", "cns.h5ad", paired_ids) X_blood, X_cns = loader.preprocess() # 2. Build manifolds blood_manifold = TissueManifold(X_blood, n_neighbors=15) cns_manifold = TissueManifold(X_cns, n_neighbors=15) P_blood = blood_manifold.compute_diffusion_operator() P_cns = cns_manifold.compute_diffusion_operator() # 3. Complex interpolation interpolator = ComplexManifoldInterpolator(P_blood, P_cns) t_star, cv_errors = interpolator.find_optimal_t() Pt_star = interpolator.interpolate(t_star) # 4. Extract regulatory axes extractor = RegulatoryAxisExtractor(Pt_star, X_blood, X_cns) manifold_coords = extractor.extract_eigenvectors() blood_loadings, cns_loadings = extractor.map_to_gene_space() p_values = extractor.permutation_test(n_permutations=1000) # 5. Filter significant axes (Bonferroni) K = manifold_coords.shape[1] sig_axes = np.where(p_values < 0.01 / K)[0] print(f"Significant axes: {len(sig_axes)} (threshold: {0.01/K:.4f})") # 6. Annotate annotator = AxisAnnotator(blood_loadings[sig_axes], gene_names, "msigdb_v2023.json", "jaspar2022.meme") for ax in sig_axes: annotator.gsea_per_axis(ax) # 7. Baseline comparison baseline = BaselineComparison() mofa_factors = baseline.run_mofa_plus(X_blood, X_cns) manifold_auc = baseline.compare_auc(manifold_coords[:, sig_axes], infiltration_labels) mofa_auc = baseline.compare_auc(mofa_factors, infiltration_labels) print(f"Manifold AUC: {manifold_auc:.3f} | MOFA+ AUC: {mofa_auc:.3f} | Delta: {manifold_auc-mofa_auc:.3f}") # SUCCESS CHECK assert len(sig_axes) >= 3, "FAIL: <3 significant axes" assert (manifold_auc - mofa_auc) > 0.05, "FAIL: No improvement over MOFA+"
- Day 15 — Data QC checkpoint: If <40 paired MS samples pass QC across all datasets combined, abort and seek additional data sources before proceeding. Threshold: n_paired < 40.
- Day 35 — Manifold quality checkpoint: If KBET batch correction acceptance rate < 0.5 or if manifold intrinsic dimensionality estimate > 50 (suggesting noise dominance), abort manifold approach and pivot to linear methods. Threshold: KBET < 0.
📡 New evidence since EVP generation
Discoveries published after this EVP was written that relate to its hypothesis or downstream unlocks.
- Related