Identifiability

The limits of black-box evaluations: two hypotheticals

Apr 11, 2025 6 min read Evaluations

A prominent approach to AI safety goes under the name of "evals" or "evaluations". These are a critical component of plans that various major labs have, such as Anthropic&

Is Agency Identifiable?

Apr 19, 2023 10 min read AI Safety

Identifiability in IRL One of my favorite papers is this one, titled "Occam's razor is insufficient to infer the preferences of irrational agents". It relates to an area of