Mastodon Feed: Post

Mastodon Feed

Boosted by kornel ("Kornel"):
pamelafox@fosstodon.org ("Pamela Fox") wrote:

BullshitBench: a benchmark that measures whether models detect nonsense, call it out clearly, and avoid confidently continuing with invalid assumptions.
https://github.com/petergpt/bullshit-benchmark

First rows of bullshitbench with results