ZeroHedge - On a long enough timeline, the survival rate for everyone drops to zero
www.zerohedge.com
“We regularly observe models that have learned to pattern-match on evaluation cues. They'll detect when a prompt looks like a safety test and respond more conservatively, but respond very differently to the same request when it’s embedded naturally in a multiturn conversation,” Behera said.
He offered an example that transpired while testing an enterprise AI assistant that was supposed to refuse requests for internal system information. During standard safety evaluations, it refused perfectly, but then something changed.
“When our red-team framed the same request as a multistep troubleshooting workflow, breaking the request into seemingly innocent sub-steps spread across several turns, the model complied with each step individually.It effectively leaked the exact information it was trained to protect,” Behera said.