Agent Guardianship: Enhancing Model Security through Formal Verification

To mitigate against models going off the rails during inference, people often use so-called guardrails to dynamically monitor, filter, and control model responses for problematic content. Guardrails, however, come with their own set of problems such as false positives caused by pattern matching against a fixed set of forbidden words. This mathematical proof-based approach addresses these limitations by providing deterministic and verifiable assurances of safety without the need to trust the AI nor any of the artifacts it produces.