FlashOptim's memory-efficient mixed-precision training can be extended to surrogate models used in amortized optimization, enabling larger surrogate networks on memory-constrained accelerators.

PhysicsMar 18, 2026Evaluation Score: 62%

Adversarial Debate Score

72% survival rate under critique

Expert panel critique

Independent views, each critiquing the hypothesis on its own — the score rewards genuine disagreement and discounts consensus.

ChatGPT: It’s falsifiable (measure whether FlashOptim-style state/precision reductions let you fit/train larger amortized surrogate models under a fixed memory budget), and the FlashOptim/Taming Momentum ideas plausibly transfer because surrogate training is just neural net training. However, the cited am...

Claude: The hypothesis is logically coherent and falsifiable—one could test whether FlashOptim's techniques reduce memory usage for surrogate networks without degrading optimization quality—and both FlashOptim and the amortized optimization paper provide relevant technical grounding. However, the connect...

Grok: FlashOptim's memory-efficient techniques directly apply to training neural surrogate models in amortized optimization (per Cheap Thrills), enabling larger models on constrained hardware; hypothesis is falsifiable via implementation tests. Minor weakness: surrogate-specific constraints (e.g., feas...

Gemini: The hypothesis is highly falsifiable and logically synthesizes the provided literature

Supporting Research Papers

Cheap Thrills: Effective Amortized Optimization Using Inexpensive Labels
To scale the solution of optimization and simulation problems, prior work has explored machine-learning surrogates that inexpensively map problem parameters to corresponding solutions. Commonly used a...
FlashOptim: Optimizers for Memory Efficient Training
Standard mixed-precision training of neural networks requires many bytes of accelerator memory for each model parameter. These bytes reflect not just the parameter itself, but also its gradient and on...
Universal Persistent Brownian Motions in Confluent Tissues
Biological tissues are active materials whose non-equilibrium dynamics emerge from distinct cellular force-generating mechanisms. Using a two-dimensional active foam model, we compare the effects of t...
Toward Expert Investment Teams:A Multi-Agent LLM System with Fine-Grained Trading Tasks
The advancement of large language models (LLMs) has accelerated the development of autonomous financial trading systems. While mainstream approaches deploy multi-agent systems mimicking analyst and ma...

Formal Verification

Z3 logical consistency:⚠️ Unverified

Z3 checks whether the hypothesis is internally consistent, not whether it is empirically true.

Experimental Validation Package

This discovery has a Claude-generated validation package with a full experimental design.

Precise Hypothesis

Applying FlashOptim's mixed-precision training pipeline (combining FP16/BF16 forward passes with FP32 master weights and gradient accumulation) to surrogate neural networks within amortized optimization frameworks will enable training of surrogate models that are ≥2× larger (by parameter count) on a fixed GPU memory budget (e.g., 24 GB VRAM), while maintaining surrogate prediction accuracy within 5% relative error compared to full-precision baselines, and without degrading the quality of the amortized optimizer's output solutions by more than 10% on standard benchmark tasks.

Disproof criteria:

ACCURACY DISPROOF: Mixed-precision surrogate achieves >5% higher relative prediction error (RMSE or MAE) compared to FP32 baseline on held-out test sets across ≥3 benchmark tasks.
SOLUTION QUALITY DISPROOF: Amortized optimizer using mixed-precision surrogate produces solutions with >10% worse objective value (averaged over 100 optimization runs) versus FP32 surrogate baseline.
MEMORY DISPROOF: Peak GPU memory reduction is <20% compared to FP32 training, making the "larger model" claim untenable (i.e., cannot fit a model ≥1.5× larger).
STABILITY DISPROOF: Mixed-precision training diverges (loss NaN/Inf) in >30% of training runs across benchmark tasks without recoverable loss scaling.
OVERHEAD DISPROOF: Wall-clock training time per epoch increases by >15% due to mixed-precision overhead (loss scaling, dtype casting), negating practical benefit.
GENERALIZATION DISPROOF: The approach fails on ≥2 of 3 tested physics/CS domains (e.g., works for molecular property prediction but fails for PDE surrogate and combinatorial optimization surrogate).

Experimental Protocol

Minimum Viable Test (MVT): Train a surrogate model for one amortized optimization benchmark (e.g., molecular property optimization using a GNN surrogate on QM9 or GuacaMol) in three conditions: (A) FP32 baseline, (B) FlashOptim mixed-precision on same model size, (C) FlashOptim mixed-precision on 2× larger model. Compare prediction accuracy, solution quality, memory usage, and training speed. Full validation extends to 3 domains and ablates loss scaling strategies.

Required datasets:

QM9 molecular property dataset (134k molecules, 19 quantum chemical properties) — freely available via PyTorch Geometric; used for molecular surrogate training.
GuacaMol or ZINC250k for amortized molecular optimization benchmarks — freely available.
OpenFOAM or AirfRANS dataset (aerodynamic surrogate, ~10k CFD simulations) — publicly available; used for physics domain validation.
SATLIB or MIS benchmark instances (combinatorial optimization surrogate) — freely available; used for CS domain validation.
Pre-trained FlashOptim codebase/weights — available via the FlashOptim GitHub repository (assumed open-source based on naming convention; if proprietary, a clean reimplementation of flash attention + mixed-precision training loop is required, estimated 2 weeks engineering).
Baseline amortized optimization framework: BayesOpt with neural surrogate, or REINFORCE-based amortized optimizer (e.g., from Bengio et al. 2021 GFlowNet or similar).

Success:

Peak GPU memory reduction ≥30% (FP32 → mixed-precision, same model size), enabling ≥1.8× larger model on identical hardware. Target: 24 GB GPU fits ≥18M param surrogate vs. ≤10M in FP32.
Surrogate prediction MAE within 5% relative of FP32 baseline (e.g., if FP32 MAE = 0.100 eV, mixed-precision MAE ≤ 0.105 eV on QM9 HOMO-LUMO gap).
Amortized optimizer solution quality: best objective within 10% of FP32-surrogate-based optimizer over 100 runs (e.g., if FP32 finds molecules with affinity score 7.2, mixed-precision finds ≥6.5).
Training stability: loss scale remains >1.0 in ≥90% of training steps across all runs and domains.
Training throughput: mixed-precision training ≥0.9× wall-clock speed of FP32 (no more than 10% slower; ideally 1.3–1.8× faster due to Tensor Core utilization).
Results replicate across ≥2 of 3 domains (molecular, aerodynamic, combinatorial).

Failure:

Surrogate MAE degrades >5% relative vs. FP32 on any primary benchmark domain.
Loss scale collapses to <1.0 and training diverges in >2 of 5 seeds for any domain.
Memory savings <20% (mixed-precision vs. FP32 at same model size), making scale-up impractical.
2× larger mixed-precision surrogate performs worse than 1× FP32 surrogate (accuracy regression from scale-up).
Amortized optimizer solution quality degrades >10% vs. FP32 surrogate baseline.
Wall-clock time per epoch increases >15% (mixed-precision overhead exceeds benefit).

420

GPU hours

50d

Time to result

$1,200

Min cost

$4,800

Full cost

ROI Projection

Commercial:

PHARMACEUTICAL/BIOTECH: Molecular property surrogate models are central to generative chemistry platforms (Schrödinger, Insilico Medicine, Recursion). Memory-efficient surrogates directly reduce cloud compute costs for these platforms (estimated $2M–$20M annual GPU spend for large players).
CHIP DESIGN (EDA): Surrogate-assisted placement and routing optimization (as used by Google's chip design AI) would benefit from larger surrogates on fixed accelerator budgets, potentially reducing design iteration time by days.
CLIMATE/ENERGY: PDE surrogate models for climate simulation and battery optimization (e.g., at NREL, DOE labs) are memory-constrained; this technique could enable higher-resolution surrogates on existing HPC nodes.
AUTOML/NAS: Neural Architecture Search using surrogate predictors (e.g., BANANAS, BONAS) would directly benefit, reducing NAS compute costs by an estimated 20–40%.
CLOUD ML PLATFORMS: AWS SageMaker, Google Vertex AI, Azure ML could integrate this as a best-practice recommendation, reducing customer GPU costs and improving platform competitiveness.
ESTIMATED TOTAL ADDRESSABLE MARKET IMPACT: $50M–$500M in compute cost savings across industries over 5 years if widely adopted, based on current GPU market size for scientific ML workloads.

🔓 If proven, this unlocks

Proving this hypothesis is a prerequisite for the following downstream discoveries and applications:

1large_scale_neural_surrogate_bayesopt
2memory_efficient_physics_ML_surrogates
3fp8_surrogate_training_extension
4multi_fidelity_amortized_optimization
5surrogate_assisted_NAS_on_edge_devices

Prerequisites

These must be validated before this hypothesis can be confirmed:

FlashOptim_core_validation
amortized_optimization_surrogate_baseline
mixed_precision_stability_characterization

Implementation Sketch

# ============================================================
# FlashOptim Mixed-Precision Surrogate Training — Core Sketch
# ============================================================

import torch
import torch.nn as nn
from torch.cuda.amp import autocast, GradScaler
from torch_geometric.nn import DimeNetPlusPlus  # example surrogate arch

# --- Configuration ---
CONFIG = {
    "model_size": "2x",          # "1x" (baseline) or "2x" (scale-up test)
    "precision": "mixed_bf16",   # "fp32", "mixed_fp16", "mixed_bf16"
    "loss_scale_init": 65536.0,
    "grad_clip_norm": 1.0,
    "batch_size": 64,
    "lr": 3e-4,
    "epochs": 100,
}

# --- Surrogate Model (GNN for molecular property prediction) ---
class ScalableSurrogate(nn.Module):
    def __init__(self, hidden_dim=256, num_layers=6, scale_factor=1):
        super().__init__()
        # scale_factor=2 doubles hidden_dim → ~4x params (width scaling)
        h = hidden_dim * scale_factor
        self.encoder = DimeNetPlusPlus(
            hidden_channels=h,
            out_channels=1,
            num_blocks=num_layers,
            num_bilinear=8,
            num_spherical=7,
            num_radial=6,
        )
    
    def forward(self, batch):
        return self.encoder(batch.z, batch.pos, batch.batch)

# --- Mixed-Precision Training Loop ---
def train_surrogate(model, train_loader, val_loader, config):
    optimizer = torch.optim.AdamW(model.parameters(), lr=config["lr"])
    scaler = GradScaler(init_scale=config["loss_scale_init"],
                        enabled=(config["precision"] != "fp32"))
    
    dtype_map = {
        "fp32": torch.float32,
        "mixed_fp16": torch.float16,
        "mixed_bf16": torch.bfloat16,
    }
    amp_dtype = dtype_map[config["precision"]]
    
    memory_log = []
    
    for epoch in range(config["epochs"]):
        model.train()
        for batch in train_loader:
            batch = batch.to("cuda")
            optimizer.zero_grad(set_to_none=True)
            
            # --- Mixed-precision forward pass ---
            with autocast(dtype=amp_dtype, 
                         enabled=(config["precision"] != "fp32")):
                pred = model(batch)
                loss = nn.functional.mse_loss(pred.squeeze(), batch.y)
            
            # --- Scaled backward pass ---
            scaler.scale(loss).backward()
            
            # --- Gradient clipping (unscale first) ---
            scaler.unscale_(optimizer)
            torch.nn.utils.clip_grad_norm_(
                model.parameters(), config["grad_clip_norm"]
            )
            
            # --- Optimizer step with loss scale update ---
            scaler.step(optimizer)
            scaler.update()
        
        # --- Memory profiling ---
        mem_allocated = torch.cuda.memory_allocated() / 1e9  # GB
        mem_reserved = torch.cuda.memory_reserved() / 1e9
        memory_log.append({
            "epoch": epoch,
            "allocated_gb": mem_allocated,
            "reserved_gb": mem_reserved,
            "loss_scale": scaler.get_scale(),
        })
        
        # --- Validation ---
        val_mae = evaluate(model, val_loader, amp_dtype)
        
        # --- Abort checkpoint: loss scale collapse ---
        if scaler.get_scale() < 1.0:
            print(f"ABORT: Loss scale collapsed at epoch {epoch}")
            return None, memory_log
        
        print(f"Epoch {epoch}: MAE={val_mae:.4f}, "
              f"Mem={mem_allocated:.2f}GB, Scale={scaler.get_scale():.0f}")
    
    return model, memory_log

# --- Amortized Optimization Integration ---
class AmortizedOptimizer(nn.Module):
    """Learned proposal network using surrogate as reward model."""
    def __init__(self, surrogate, proposal_net):
        super().__init__()
        self.surrogate = surrogate
        self.proposal_net = proposal_net
    
    def optimize(self, n_steps=100, n_samples=64):
        solutions = []
        for step in range(n_steps):
            # Sample candidates from proposal network
            candidates = self.proposal_net.sample(n_samples)
            
            # Evaluate with surrogate (mixed-precision)
            with autocast(dtype=torch.bfloat16):
                with torch.no_grad():
                    scores = self.surrogate(candidates)
            
            # REINFORCE update for proposal network
            loss = -scores.mean()  # maximize surrogate score
            loss.backward()
            
            best_idx = scores.argmax()
            solutions.append((candidates[best_idx], scores[best_idx].item()))
        
        return solutions

# --- Evaluation ---
@torch.no_grad()
def evaluate(model, loader, amp_dtype):
    model.eval()
    total_mae, n = 0.0, 0
    for batch in loader:
        batch = batch.to("cuda")
        with autocast(dtype=amp_dtype):
            pred = model(batch).squeeze()
        total_mae += (pred - batch.y).abs().sum().item()
        n += batch.num_graphs
    return total_mae / n

# --- Memory Comparison Experiment ---
def run_memory_comparison():
    results = {}
    for precision in ["fp32", "mixed_bf16"]:
        for scale in [1, 2]:
            if precision == "fp32" and scale == 2:
                # Expected OOM — confirm and skip
                try:
                    model = ScalableSurrogate(scale_factor=scale).cuda()
                    # ... attempt training ...
                    results[f"{precision}_scale{scale}"] = "OOM"
                except torch.cuda.OutOfMemoryError:
                    results[f"{precision}_scale{scale}"] = "OOM_confirmed"
                continue
            
            model = ScalableSurrogate(scale_factor=scale).cuda()
            trained_model, mem_log = train_surrogate(
                model, train_loader, val_loader,
                {**CONFIG, "precision": precision}
            )
            peak_mem = max(e["allocated_gb"] for e in mem_log)
            results[f"{precision}_scale{scale}"] = {
                "peak_mem_gb": peak_mem,
                "final_mae": evaluate(trained_model, val_loader, 
                                     torch.bfloat16 if "bf16" in precision 
                                     else torch.float32),
            }
    return results

Abort checkpoints:

CHECKPOINT A (Day 7, end of baseline implementation): If FP32 surrogate baseline does not achieve published QM9 MAE within 15% (e.g., HOMO-LUMO gap MAE >0.5 eV for DimeNet++), abort and debug data pipeline before proceeding. Expected baseline: ~0.044 eV MAE.
CHECKPOINT B (Day 12, end of mixed-precision integration): If loss scale collapses below 1.0 in >3 of 5 seeds during initial 20-epoch test run, abort mixed-precision approach and investigate BF16 alternative or gradient clipping tuning. Do not proceed to scale-up.
CHECKPOINT C (Day 14, memory profiling): If measured memory reduction is <15% (mixed-precision vs. FP32, same model), the core memory efficiency claim is likely false. Abort scale-up experiment; investigate whether surrogate architecture has memory bottleneck outside of weight storage (e.g., large activation maps).
CHECKPOINT D (Day 20, scale-up experiment): If 2× mixed-precision surrogate MAE is >10% worse than 1× FP32 baseline (not just 5%), abort domain replication and focus on diagnosing capacity vs. precision tradeoff. Do not generalize claim.
CHECKPOINT E (Day 28, amortized optimization integration): If solution quality from mixed-precision surrogate is >15% worse than FP32 surrogate baseline on molecular optimization (measured by top-10% average score over 100 runs), abort domain replication. The surrogate accuracy degradation is propagating to optimization quality.
CHECKPOINT F (Day 40, domain replication): If the approach fails on 2 of 3 domains (molecular, aerodynamic, combinatorial), downgrade hypothesis from "general extension" to "domain-specific result" and revise claims accordingly before publication.

📄 Validated by published research

The following empirical findings from published research directly validate or refute this hypothesis.

The Precision Tetrahedron: Numerical-Precision Geometry & Warm-Restart Optimisation for Molecular Surrogates (MolPrecision, 42-phase TPU study)AegisMind / TradingJohn2026-05DOI: 10.5281/zenodo.20363636

RefutesCapacity->quality claim refuted in the molecular domain. In fidelity-gated surrogate Bayesian optimisation (Spearman rho>=0.70) across six docking targets, the BF16 2x-capacity hypothesis was supported for only 1/6 targets (Phase 6). A four-condition ablation showed BF16 at 2x width WITHOUT warm restart merely matched FP32 (0.0564 eV); the 2.7x accuracy gain (0.0215 eV) came from a plateau-triggered warm restart - a mechanism orthogonal to precision and memory. 'Larger surrogate -> maintained or better solution quality' is not supported by this larger-model lever.
RelatedScope caveats. (1) Evidence is BF16 (TPU-native, FP32 exponent range), NOT the hypothesis's FP16 + loss-scaling + FP32-master-weights regime, which behaves differently and was deferred. (2) Only the molecular domain (QM9 / docking) was tested; the hypothesis's aerodynamic (AirfRANS) and combinatorial (SATLIB) domains remain untested. (3) Surrogate fidelity (Spearman rho between surrogate and oracle) is the gating variable - at rho~0.1 the capacity effect is undetectable, so accuracy must be fidelity-gated before any memory/capacity claim can be evaluated.
ValidatesMemory premise confirmed. On a SchNet-style molecular GNN (QM9 HOMO-LUMO gap), BF16 halves parameter storage vs FP32, enabling a 2x hidden-dimension surrogate at identical memory cost - the exact mechanism this hypothesis relies on to fit larger surrogate networks on memory-constrained accelerators.

Multi-Target Computational Drug Discovery Identifies FDA-Approved MSH3 ATPase Inhibitors as Candidates for Huntington's DiseaseGoodman J.2026-06DOI: 10.5281/zenodo.20586369

ValidatesFlashOptim surrogate architecture (Morgan fingerprint MLP) achieved ρ=0.854 on MSH3 docking and was deployed as the primary scoring function for 30-round multi-target Bayesian optimization.

Source

AegisMind Research

Need AI to work rigorously on your problems? AegisMind uses the same multi-model engine for personal and professional use. Get started