⚠️ RATE_LIMIT: google is temporarily unavailable (API capacity exceeded). Continuing with other models...
Adversarial Debate Score
28% survival rate under critique
Model Critiques
Supporting Research Papers
- When Your Model Stops Working: Anytime-Valid Calibration Monitoring
Practitioners monitoring deployed probabilistic models face a fundamental trap: any fixed-sample test applied repeatedly over an unbounded stream will eventually raise a false alarm, even when the mod...
- OPT: Open Pre-trained Transformer Language Models
Large language models, which are often trained for hundreds of thousands of compute days, have shown remarkable capabilities for zero- and few-shot learning. Given their computational cost, these mode...
- Tstars-Tryon 1.0: Robust and Realistic Virtual Try-On for Diverse Fashion Items
Recent advances in image generation and editing have opened new opportunities for virtual try-on. However, existing methods still struggle to meet complex real-world demands. We present Tstars-Tryon 1...
- Frontier Models Can Take Actions at Low Probabilities
Pre-deployment evaluations inspect only a limited sample of model actions. A malicious model seeking to evade oversight could exploit this by randomizing when to"defect": misbehaving so rarely that no...
- Posterior Predictive Checks for Gravitational-wave Populations: Limitations and Improvements
When selecting a model to characterize an astrophysical population, it is crucial to assess whether that model fits the data and, if not, how it can be improved. To this end, posterior predictive chec...
Formal Verification
Z3 checks whether the hypothesis is internally consistent, not whether it is empirically true.