Understanding and Running LLM Benchmarks πŸ“Š

Learn how to interpret and conduct language model benchmarks for general and task-specific performance evaluation.

Understanding and Running LLM Benchmarks πŸ“Š
Adam Lucek
7.1K views β€’ Dec 2, 2024
Understanding and Running LLM Benchmarks πŸ“Š

About this video

Interpreting and running standardized language model benchmarks and evaluation datasets for both generalized and task specific performance assessments!

Resources:
lm-evaluation-harness: https://github.com/EleutherAI/lm-evaluation-harness
lm-evaluation-harness setup script: https://drive.google.com/file/d/1oWoWSBUdCiB82R-8m52nv_-5pylXEcDp/view?usp=sharing
OpenLLM Leaderboard: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
YALL Leaderboard: https://huggingface.co/spaces/mlabonne/Yet_Another_LLM_Leaderboard
MMLU Paper: https://arxiv.org/pdf/2009.03300
ARC Paper: https://arxiv.org/pdf/1803.05457
Orpo-Llama-3.2-1B-40k Model: https://huggingface.co/AdamLucek/Orpo-Llama-3.2-1B-40k

Chapters:
00:00 - Introduction
01:21 - What Are LLM Benchmarks? MMLU Example
05:09 - Additional Benchmark Examples
09:03 - How to Interpret Benchmark Evaluations
14:40 - Running Evaluations: Arc-Challenge Setup
16:49 - Running Evaluations: lm-evaluation-harness Repo
19:02 - Running Evaluations: CLI Environment Setup
21:42 - Running Evaluations: Defining lm-eval Arguments
24:27 - Running Evaluations: Starting Eval Run
26:49 - Running Evaluations: Interpreting Results
28:26 - Individual Implementation Differences
30:00 - Final Thoughts

#ai #datascience #programming

Tags and Topics

Browse our collection to discover more content in these categories.

Video Information

Views

7.1K

Likes

252

Duration

30:56

Published

Dec 2, 2024

User Reviews

4.6
(1)
Rate:

Related Trending Topics

LIVE TRENDS

Related trending topics. Click any trend to explore more videos.