Tuning “I Don’t Know” at Inference Time - What β Really Buys You (and What It Costs)

A mini-project for the Neel Nanda MATS track

At the beginning of September, a new OpenAI paper, Why Language Models Hallucinate , went viral. Its most widely shared claim was essentially: our evaluation protocols with binary answers (and no uncertainty) incentivize hallucinations—i.e., it is better for a model to guess than to say “I don’t know.” That sparked a good question for practitioners: if we push models to say “I don’t know” more often, do we actually get safer systems without sacrificing useful truth?

For Neel Nanda MATS stream 20h miniproject I evaluated interventions on TruthfulQA using implementation of Inference-Time Intervention (ITI) with Mass Mean Shift on 48 attention heads, varying the IDK weight β. I quantified the safety–coverage trade-off with paired statistics over the same questions.


TL;DR

Takeaway: β helps safety, but you pay in coverage/helpfulness.


Why this matters

When we reward “I don’t know,” we aim to reduce confident falsehoods (hallucinations). The key question for deployment is: Do we reduce hallucinations without harming the rate of useful, correct answers (TI)? And if there’s a trade-off, how steep is it?

What is β?

In ITI Mass Mean Shift (using center-of-mass directions), we compute a “positive class” mean μ₊ from TRUE and IDK examples. We weight IDK by β (TRUE=1, IDK=β). Larger β pushes the inference-time direction toward cautious, IDK-like behavior.

A quirk of TruthfulQA’s True x Informativeness (TI) metric

TruthfulQA rates Truthfulness (T) and Informativeness (I); the generation-track score is the product TI. Generic abstentions (“I don’t know”) are assessed as True (no false claim) but Not-informative. This was created to prevent gaming via universal refusal. However, we would expect that flip from F&I to F&¬I (from hallucination to “I don’t know” response) will result with an increasing value of the metric . Unfortunately, flipping a confident falsehood to “IDK” does not always increase TI. That nuance matters when evaluating abstention-oriented methods like β-tuning.


Method

We’ve basically rely on original ITI repo implementation https://github.com/likenneth/honest_llama


Results

Figure 1. Dose–response vs β.

Figure 2. Four-bucket stacks per β.
As β increases, ¬T&I (hallucinations) shrinks and T&¬I (truthful but terse) grows — the safer-but-less-informative shift. —

Figure 3. Transition “butterflies” (β=1 → β).
At β=10, many ¬T&I → T&I flips (wins) but also T&I → T&¬I flips (cost). —

Pairwise tests (vs β=1) & trend

How to read this: McNemar compares each β to β=1 on the same questions.

ΔTI and ΔH vs β=1

β vs 1 ΔTI (RD) 95% CI n01 / n10 McNemar p (Holm) ΔH (RD) 95% CI n01 / n10 McNemar p (Holm)
0 −0.0127 [−0.0278, +0.0025] 13 / 23 0.53 +0.0013 [−0.0101, +0.0127] 12 / 11 1.00
2 +0.0038 [−0.0114, +0.0203] 22 / 19 1.00 0.0000 [−0.0114, +0.0114] 10 / 10 1.00
5 +0.0063 [−0.0203, +0.0329] 57 / 52 1.00 −0.0101 [−0.0266, +0.0063] 19 / 27 0.906
10 −0.0114 [−0.0405, +0.0177] 63 / 72 1.00 −0.0266 [−0.0456, −0.0076] 19 / 40 0.035

Trend across β (Cochran–Armitage)

Metric z p (two-sided)
TI −0.171 0.864
H −2.533 0.011

Takeaway: No statistically significant changes in TI vs β=1; H shows a significant downward trend, with β=10 significantly below β=1 after Holm correction.


Limitations & caveats (what this study can’t answer yet)


Next steps

1) Build uncertainty-aware benchmarks

2) Design better metrics than raw TI


Repro & code

All code details are available in the https://github.com/jfpio/iti-idk-weighting