solver.press

FlashOptim's memory-efficient mixed-precision training can be extended to surrogate models used in amortized optimization, enabling larger surrogate networks on memory-constrained accelerators.

PhysicsMar 18, 2026Evaluation Score: 72%

Adversarial Debate Score

72% survival rate under critique

Model Critiques

openai: It’s falsifiable (measure whether FlashOptim-style state/precision reductions let you fit/train larger amortized surrogate models under a fixed memory budget), and the FlashOptim/Taming Momentum ideas plausibly transfer because surrogate training is just neural net training. However, the cited am...
anthropic: The hypothesis is logically coherent and falsifiable—one could test whether FlashOptim's techniques reduce memory usage for surrogate networks without degrading optimization quality—and both FlashOptim and the amortized optimization paper provide relevant technical grounding. However, the connect...
grok: FlashOptim's memory-efficient techniques directly apply to training neural surrogate models in amortized optimization (per Cheap Thrills), enabling larger models on constrained hardware; hypothesis is falsifiable via implementation tests. Minor weakness: surrogate-specific constraints (e.g., feas...
google: The hypothesis is highly falsifiable and logically synthesizes the provided literature

Supporting Research Papers

Formal Verification

Z3 logical consistency:⚠️ Unverified

Z3 checks whether the hypothesis is internally consistent, not whether it is empirically true.

Experimental Validation Package

This discovery has a Claude-generated validation package with a full experimental design.

Precise Hypothesis

Applying FlashOptim's mixed-precision training pipeline (combining FP16/BF16 forward passes with FP32 master weights and gradient accumulation) to surrogate neural networks within amortized optimization frameworks will enable training of surrogate models that are ≥2× larger (by parameter count) on a fixed GPU memory budget (e.g., 24 GB VRAM), while maintaining surrogate prediction accuracy within 5% relative error compared to full-precision baselines, and without degrading the quality of the amortized optimizer's output solutions by more than 10% on standard benchmark tasks.

Disproof criteria:
  1. ACCURACY DISPROOF: Mixed-precision surrogate achieves >5% higher relative prediction error (RMSE or MAE) compared to FP32 baseline on held-out test sets across ≥3 benchmark tasks.
  2. SOLUTION QUALITY DISPROOF: Amortized optimizer using mixed-precision surrogate produces solutions with >10% worse objective value (averaged over 100 optimization runs) versus FP32 surrogate baseline.
  3. MEMORY DISPROOF: Peak GPU memory reduction is <20% compared to FP32 training, making the "larger model" claim untenable (i.e., cannot fit a model ≥1.5× larger).
  4. STABILITY DISPROOF: Mixed-precision training diverges (loss NaN/Inf) in >30% of training runs across benchmark tasks without recoverable loss scaling.
  5. OVERHEAD DISPROOF: Wall-clock training time per epoch increases by >15% due to mixed-precision overhead (loss scaling, dtype casting), negating practical benefit.
  6. GENERALIZATION DISPROOF: The approach fails on ≥2 of 3 tested physics/CS domains (e.g., works for molecular property prediction but fails for PDE surrogate and combinatorial optimization surrogate).

Experimental Protocol

Minimum Viable Test (MVT): Train a surrogate model for one amortized optimization benchmark (e.g., molecular property optimization using a GNN surrogate on QM9 or GuacaMol) in three conditions: (A) FP32 baseline, (B) FlashOptim mixed-precision on same model size, (C) FlashOptim mixed-precision on 2× larger model. Compare prediction accuracy, solution quality, memory usage, and training speed. Full validation extends to 3 domains and ablates loss scaling strategies.

Required datasets:
  1. QM9 molecular property dataset (134k molecules, 19 quantum chemical properties) — freely available via PyTorch Geometric; used for molecular surrogate training.
  2. GuacaMol or ZINC250k for amortized molecular optimization benchmarks — freely available.
  3. OpenFOAM or AirfRANS dataset (aerodynamic surrogate, ~10k CFD simulations) — publicly available; used for physics domain validation.
  4. SATLIB or MIS benchmark instances (combinatorial optimization surrogate) — freely available; used for CS domain validation.
  5. Pre-trained FlashOptim codebase/weights — available via the FlashOptim GitHub repository (assumed open-source based on naming convention; if proprietary, a clean reimplementation of flash attention + mixed-precision training loop is required, estimated 2 weeks engineering).
  6. Baseline amortized optimization framework: BayesOpt with neural surrogate, or REINFORCE-based amortized optimizer (e.g., from Bengio et al. 2021 GFlowNet or similar).
Success:
  1. Peak GPU memory reduction ≥30% (FP32 → mixed-precision, same model size), enabling ≥1.8× larger model on identical hardware. Target: 24 GB GPU fits ≥18M param surrogate vs. ≤10M in FP32.
  2. Surrogate prediction MAE within 5% relative of FP32 baseline (e.g., if FP32 MAE = 0.100 eV, mixed-precision MAE ≤ 0.105 eV on QM9 HOMO-LUMO gap).
  3. Amortized optimizer solution quality: best objective within 10% of FP32-surrogate-based optimizer over 100 runs (e.g., if FP32 finds molecules with affinity score 7.2, mixed-precision finds ≥6.5).
  4. Training stability: loss scale remains >1.0 in ≥90% of training steps across all runs and domains.
  5. Training throughput: mixed-precision training ≥0.9× wall-clock speed of FP32 (no more than 10% slower; ideally 1.3–1.8× faster due to Tensor Core utilization).
  6. Results replicate across ≥2 of 3 domains (molecular, aerodynamic, combinatorial).
Failure:
  1. Surrogate MAE degrades >5% relative vs. FP32 on any primary benchmark domain.
  2. Loss scale collapses to <1.0 and training diverges in >2 of 5 seeds for any domain.
  3. Memory savings <20% (mixed-precision vs. FP32 at same model size), making scale-up impractical.
  4. 2× larger mixed-precision surrogate performs worse than 1× FP32 surrogate (accuracy regression from scale-up).
  5. Amortized optimizer solution quality degrades >10% vs. FP32 surrogate baseline.
  6. Wall-clock time per epoch increases >15% (mixed-precision overhead exceeds benefit).

420

GPU hours

50d

Time to result

$1,200

Min cost

$4,800

Full cost

ROI Projection

Implementation Sketch

# ============================================================
# FlashOptim Mixed-Precision Surrogate Training — Core Sketch
# ============================================================

import torch
import torch.nn as nn
from torch.cuda.amp import autocast, GradScaler
from torch_geometric.nn import DimeNetPlusPlus  # example surrogate arch

# --- Configuration ---
CONFIG = {
    "model_size": "2x",          # "1x" (baseline) or "2x" (scale-up test)
    "precision": "mixed_bf16",   # "fp32", "mixed_fp16", "mixed_bf16"
    "loss_scale_init": 65536.0,
    "grad_clip_norm": 1.0,
    "batch_size": 64,
    "lr": 3e-4,
    "epochs": 100,
}

# --- Surrogate Model (GNN for molecular property prediction) ---
class ScalableSurrogate(nn.Module):
    def __init__(self, hidden_dim=256, num_layers=6, scale_factor=1):
        super().__init__()
        # scale_factor=2 doubles hidden_dim → ~4x params (width scaling)
        h = hidden_dim * scale_factor
        self.encoder = DimeNetPlusPlus(
            hidden_channels=h,
            out_channels=1,
            num_blocks=num_layers,
            num_bilinear=8,
            num_spherical=7,
            num_radial=6,
        )
    
    def forward(self, batch):
        return self.encoder(batch.z, batch.pos, batch.batch)

# --- Mixed-Precision Training Loop ---
def train_surrogate(model, train_loader, val_loader, config):
    optimizer = torch.optim.AdamW(model.parameters(), lr=config["lr"])
    scaler = GradScaler(init_scale=config["loss_scale_init"],
                        enabled=(config["precision"] != "fp32"))
    
    dtype_map = {
        "fp32": torch.float32,
        "mixed_fp16": torch.float16,
        "mixed_bf16": torch.bfloat16,
    }
    amp_dtype = dtype_map[config["precision"]]
    
    memory_log = []
    
    for epoch in range(config["epochs"]):
        model.train()
        for batch in train_loader:
            batch = batch.to("cuda")
            optimizer.zero_grad(set_to_none=True)
            
            # --- Mixed-precision forward pass ---
            with autocast(dtype=amp_dtype, 
                         enabled=(config["precision"] != "fp32")):
                pred = model(batch)
                loss = nn.functional.mse_loss(pred.squeeze(), batch.y)
            
            # --- Scaled backward pass ---
            scaler.scale(loss).backward()
            
            # --- Gradient clipping (unscale first) ---
            scaler.unscale_(optimizer)
            torch.nn.utils.clip_grad_norm_(
                model.parameters(), config["grad_clip_norm"]
            )
            
            # --- Optimizer step with loss scale update ---
            scaler.step(optimizer)
            scaler.update()
        
        # --- Memory profiling ---
        mem_allocated = torch.cuda.memory_allocated() / 1e9  # GB
        mem_reserved = torch.cuda.memory_reserved() / 1e9
        memory_log.append({
            "epoch": epoch,
            "allocated_gb": mem_allocated,
            "reserved_gb": mem_reserved,
            "loss_scale": scaler.get_scale(),
        })
        
        # --- Validation ---
        val_mae = evaluate(model, val_loader, amp_dtype)
        
        # --- Abort checkpoint: loss scale collapse ---
        if scaler.get_scale() < 1.0:
            print(f"ABORT: Loss scale collapsed at epoch {epoch}")
            return None, memory_log
        
        print(f"Epoch {epoch}: MAE={val_mae:.4f}, "
              f"Mem={mem_allocated:.2f}GB, Scale={scaler.get_scale():.0f}")
    
    return model, memory_log

# --- Amortized Optimization Integration ---
class AmortizedOptimizer(nn.Module):
    """Learned proposal network using surrogate as reward model."""
    def __init__(self, surrogate, proposal_net):
        super().__init__()
        self.surrogate = surrogate
        self.proposal_net = proposal_net
    
    def optimize(self, n_steps=100, n_samples=64):
        solutions = []
        for step in range(n_steps):
            # Sample candidates from proposal network
            candidates = self.proposal_net.sample(n_samples)
            
            # Evaluate with surrogate (mixed-precision)
            with autocast(dtype=torch.bfloat16):
                with torch.no_grad():
                    scores = self.surrogate(candidates)
            
            # REINFORCE update for proposal network
            loss = -scores.mean()  # maximize surrogate score
            loss.backward()
            
            best_idx = scores.argmax()
            solutions.append((candidates[best_idx], scores[best_idx].item()))
        
        return solutions

# --- Evaluation ---
@torch.no_grad()
def evaluate(model, loader, amp_dtype):
    model.eval()
    total_mae, n = 0.0, 0
    for batch in loader:
        batch = batch.to("cuda")
        with autocast(dtype=amp_dtype):
            pred = model(batch).squeeze()
        total_mae += (pred - batch.y).abs().sum().item()
        n += batch.num_graphs
    return total_mae / n

# --- Memory Comparison Experiment ---
def run_memory_comparison():
    results = {}
    for precision in ["fp32", "mixed_bf16"]:
        for scale in [1, 2]:
            if precision == "fp32" and scale == 2:
                # Expected OOM — confirm and skip
                try:
                    model = ScalableSurrogate(scale_factor=scale).cuda()
                    # ... attempt training ...
                    results[f"{precision}_scale{scale}"] = "OOM"
                except torch.cuda.OutOfMemoryError:
                    results[f"{precision}_scale{scale}"] = "OOM_confirmed"
                continue
            
            model = ScalableSurrogate(scale_factor=scale).cuda()
            trained_model, mem_log = train_surrogate(
                model, train_loader, val_loader,
                {**CONFIG, "precision": precision}
            )
            peak_mem = max(e["allocated_gb"] for e in mem_log)
            results[f"{precision}_scale{scale}"] = {
                "peak_mem_gb": peak_mem,
                "final_mae": evaluate(trained_model, val_loader, 
                                     torch.bfloat16 if "bf16" in precision 
                                     else torch.float32),
            }
    return results
Abort checkpoints:
  1. CHECKPOINT A (Day 7, end of baseline implementation): If FP32 surrogate baseline does not achieve published QM9 MAE within 15% (e.g., HOMO-LUMO gap MAE >0.5 eV for DimeNet++), abort and debug data pipeline before proceeding. Expected baseline: ~0.044 eV MAE.
  2. CHECKPOINT B (Day 12, end of mixed-precision integration): If loss scale collapses below 1.0 in >3 of 5 seeds during initial 20-epoch test run, abort mixed-precision approach and investigate BF16 alternative or gradient clipping tuning. Do not proceed to scale-up.
  3. CHECKPOINT C (Day 14, memory profiling): If measured memory reduction is <15% (mixed-precision vs. FP32, same model), the core memory efficiency claim is likely false. Abort scale-up experiment; investigate whether surrogate architecture has memory bottleneck outside of weight storage (e.g., large activation maps).
  4. CHECKPOINT D (Day 20, scale-up experiment): If 2× mixed-precision surrogate MAE is >10% worse than 1× FP32 baseline (not just 5%), abort domain replication and focus on diagnosing capacity vs. precision tradeoff. Do not generalize claim.
  5. CHECKPOINT E (Day 28, amortized optimization integration): If solution quality from mixed-precision surrogate is >15% worse than FP32 surrogate baseline on molecular optimization (measured by top-10% average score over 100 runs), abort domain replication. The surrogate accuracy degradation is propagating to optimization quality.
  6. CHECKPOINT F (Day 40, domain replication): If the approach fails on 2 of 3 domains (molecular, aerodynamic, combinatorial), downgrade hypothesis from "general extension" to "domain-specific result" and revise claims accordingly before publication.

Source

AegisMind Research
Need AI to work rigorously on your problems? AegisMind uses the same multi-model engine for personal and professional use. Get started