Boosted by kornel ("Kornel"):
pamelafox@fosstodon.org ("Pamela Fox") wrote:
BullshitBench: a benchmark that measures whether models detect nonsense, call it out clearly, and avoid confidently continuing with invalid assumptions.
https://github.com/petergpt/bullshit-benchmark
