智能体护卫：利用形式化证明提升模型安全性

To mitigate against models going off the rails during inference, people often use so-called guardrails to dynamically monitor, filter, and control model responses for problematic content. Guardrails, however, come with their own set of problems such as false positives caused by pattern matching against a fixed set of forbidden words. This mathematical proof-based approach addresses these limitations by providing deterministic and verifiable assurances of safety without the need to trust the AI nor any of the artifacts it produces.