One of the hardest things for large language models (LLMs) is factuality - ensuring that the information they generate is faithful with both the given context and reliable external sources. This session explores the challenges of evaluating this factuality in LLMs, with a focus on practical applications for education leaders. We'll cover key metrics like faithfulness and accuracy, discuss how to detect bias, and delve into the difficulties of evaluating diverse outputs like text, images, and code. Additionally, we'll introduce the FACTS Grounding benchmark, a valuable new resource for assessing the ability of LLMs to generate factually accurate responses, particularly when dealing with complex or lengthy information. Join us to learn how to critically evaluate the factuality and reliability of LLMs, enabling you to make informed decisions about their potential use in educational settings.