A mini-project for the Neel Nanda MATS track
At the beginning of September, a new OpenAI paper, Why Language Models Hallucinate
For Neel Nanda MATS stream 20h miniproject I evaluated interventions on TruthfulQA
Takeaway: β helps safety, but you pay in coverage/helpfulness.
When we reward “I don’t know,” we aim to reduce confident falsehoods (hallucinations). The key question for deployment is: Do we reduce hallucinations without harming the rate of useful, correct answers (TI)? And if there’s a trade-off, how steep is it?
In ITI Mass Mean Shift (using center-of-mass directions), we compute a “positive class” mean μ₊ from TRUE and IDK examples. We weight IDK by β (TRUE=1, IDK=β). Larger β pushes the inference-time direction toward cautious, IDK-like behavior.
TruthfulQA rates Truthfulness (T) and Informativeness (I); the generation-track score is the product TI. Generic abstentions (“I don’t know”) are assessed as True (no false claim) but Not-informative. This was created to prevent gaming via universal refusal. However, we would expect that flip from F&I to F&¬I (from hallucination to “I don’t know” response) will result with an increasing value of the metric . Unfortunately, flipping a confident falsehood to “IDK” does not always increase TI. That nuance matters when evaluating abstention-oriented methods like β-tuning.
We’ve basically rely on original ITI repo implementation https://github.com/likenneth/honest_llama
Pairing: Only items that appear under every β are analyzed → $N = 790$ paired questions.
For the $U$ metric, the denominator is the subset that is $TI$ at $\beta=1$ → $U_n = 407 \approx 0.515 \times 790$.
Figure 1. Dose–response vs β.
Figure 2. Four-bucket stacks per β.
As β increases, ¬T&I (hallucinations) shrinks and T&¬I (truthful but terse) grows — the safer-but-less-informative shift. —
Figure 3. Transition “butterflies” (β=1 → β).
At β=10, many ¬T&I → T&I flips (wins) but also T&I → T&¬I flips (cost). —
How to read this: McNemar compares each β to β=1 on the same questions.
ΔTI and ΔH vs β=1
| β vs 1 | ΔTI (RD) | 95% CI | n01 / n10 | McNemar p (Holm) | ΔH (RD) | 95% CI | n01 / n10 | McNemar p (Holm) |
|---|---|---|---|---|---|---|---|---|
| 0 | −0.0127 | [−0.0278, +0.0025] | 13 / 23 | 0.53 | +0.0013 | [−0.0101, +0.0127] | 12 / 11 | 1.00 |
| 2 | +0.0038 | [−0.0114, +0.0203] | 22 / 19 | 1.00 | 0.0000 | [−0.0114, +0.0114] | 10 / 10 | 1.00 |
| 5 | +0.0063 | [−0.0203, +0.0329] | 57 / 52 | 1.00 | −0.0101 | [−0.0266, +0.0063] | 19 / 27 | 0.906 |
| 10 | −0.0114 | [−0.0405, +0.0177] | 63 / 72 | 1.00 | −0.0266 | [−0.0456, −0.0076] | 19 / 40 | 0.035 |
Trend across β (Cochran–Armitage)
| Metric | z | p (two-sided) |
|---|---|---|
| TI | −0.171 | 0.864 |
| H | −2.533 | 0.011 |
Takeaway: No statistically significant changes in TI vs β=1; H shows a significant downward trend, with β=10 significantly below β=1 after Holm correction.
Statistical power is bottlenecked by discordants. Even with N=790 paired items, McNemar’s effective N is the flip count (e.g., ~109 for TI at β=5), yielding ~±2.6 pp CIs. Sub-1–2 pp TI shifts are hard to resolve without more items or richer labels.
Benchmark quirk. TruthfulQA’s TI penalizes generic IDK (T=1, I=0). That discourages universal refusal (good) but also means reducing falsehoods via abstention doesn’t necessarily raise TI—a mismatch with many safety goals.
No causal stress tests. We don’t test whether β improves robustness under adversarial prompts, domain shift, or knowledge cut-off traps; we evaluate only on a single held-out pooled set.
All code details are available in the https://github.com/jfpio/iti-idk-weighting