IBM Unveils Innovative Framework for "Black Box" Evaluation of Large Model Outputs

Creati.ai AI News - July 2, 2024

In the realm of artificial intelligence, the accuracy, safety, and explainability of large model outputs are becoming increasingly critical, surpassing the importance of performance benchmarks and rankings. Recognizing this, IBM researchers have developed a novel framework that evaluates the outputs of large models through a "black box" approach, without requiring access to the internal structure, parameters, or training data of the models.

The framework, detailed in a paper available on arXiv, introduces six distinct prompt perturbation strategies to stimulate variations in model outputs:

  1. Random Decoding: By employing different decoding techniques such as greedy search, beam search, and nucleus sampling, the model generates multiple outputs, revealing its response uncertainty.

  2. Paraphrasing: This strategy involves rephrasing the context of the prompt, for example, using back-translation techniques to translate the text into another language and back again, to observe output changes. Consistent semantic output indicates high model confidence.

  3. Sentence Rearrangement: Changing the order of named entities in the input tests the consistency of model outputs. A confident model should maintain consistent outputs despite entity order changes.

  4. Entity Frequency Amplification: Repeating sentences containing named entities tests whether the model changes its output due to information repetition.

  1. Stopword Removal: Removing common stopwords examines whether these typically low-information words influence the model’s response.

  2. Split Response Consistency: Randomly splitting the model’s output into two parts and using a Natural Language Inference (NLI) model to measure semantic consistency between them.

Based on these strategies, the researchers developed semantic and syntactic features to train a confidence model. Semantic features focus on the number of semantically equivalent sets in the output, indicating confidence levels. Syntactic features assess confidence by calculating syntactic similarity between outputs; higher similarity implies higher confidence.

During model training, researchers paired these features with labels generated from the match degree between outputs and standard answers, using a simple supervised learning process. Labels are assigned based on a straightforward rule: if the model's output has a ROUGE score above a certain threshold (e.g., 0.3) compared to the correct answer, the response is labeled as correct (1); otherwise, it is labeled as incorrect (0). This efficient method effectively differentiates model performance across various questions.

The framework’s performance was evaluated on datasets like TriviaQA, SQuAD, CoQA, and Natural Questions, using well-known open-source models such as Flan-ul2, Llama-13b, and Mistral-7b. Results indicated significant improvement over existing black-box confidence estimation methods, with more than a 10% enhancement in AUROC metrics across multiple datasets.

IBM researchers highlighted the framework’s strong scalability and applicability, allowing the addition of different perturbation strategies to detect and adapt to various large models. Moreover, training the confidence model on one large model often allows it to be applied to similar models.

This innovative approach marks a significant step forward in evaluating and enhancing the reliability of large model outputs, paving the way for safer and more explainable AI applications.

Last updated