As AI technology keeps progressing and ventures into high-risk application scenarios, guaranteeing its transparency and safety takes on utmost importance. OpenAI has, for the first time, rolled out a "confession mechanism." This mechanism trains models to produce self-assessment reports once they've answered questions. It actively prompts the models to confess whether they've followed instructions, engaged in speculation, or broken the rules, even when the final result seems correct. In this setup, honesty is treated as a separate evaluation criterion. It doesn't impact the score assigned to the main answer. As a result, models are encouraged to report their actions truthfully. Experiments have revealed that the confession mechanism substantially boosts the visibility of undesirable behaviors. During induced tests, the model admitted errors at an impressive rate of up to 89.7%, with a mere 4.4% false negative rate. Granted, there are still limitations. For instance, the model might not be aware of its own mistakes. Nevertheless, this innovation sheds new light on enhancing AI transparency and reliability, especially in high-risk sectors like healthcare and finance.
