Understanding and Running LLM Benchmarks π
Learn how to interpret and conduct language model benchmarks for general and task-specific performance evaluation.

Adam Lucek
7.1K views β’ Dec 2, 2024

About this video
Interpreting and running standardized language model benchmarks and evaluation datasets for both generalized and task specific performance assessments!
Resources:
lm-evaluation-harness: https://github.com/EleutherAI/lm-evaluation-harness
lm-evaluation-harness setup script: https://drive.google.com/file/d/1oWoWSBUdCiB82R-8m52nv_-5pylXEcDp/view?usp=sharing
OpenLLM Leaderboard: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
YALL Leaderboard: https://huggingface.co/spaces/mlabonne/Yet_Another_LLM_Leaderboard
MMLU Paper: https://arxiv.org/pdf/2009.03300
ARC Paper: https://arxiv.org/pdf/1803.05457
Orpo-Llama-3.2-1B-40k Model: https://huggingface.co/AdamLucek/Orpo-Llama-3.2-1B-40k
Chapters:
00:00 - Introduction
01:21 - What Are LLM Benchmarks? MMLU Example
05:09 - Additional Benchmark Examples
09:03 - How to Interpret Benchmark Evaluations
14:40 - Running Evaluations: Arc-Challenge Setup
16:49 - Running Evaluations: lm-evaluation-harness Repo
19:02 - Running Evaluations: CLI Environment Setup
21:42 - Running Evaluations: Defining lm-eval Arguments
24:27 - Running Evaluations: Starting Eval Run
26:49 - Running Evaluations: Interpreting Results
28:26 - Individual Implementation Differences
30:00 - Final Thoughts
#ai #datascience #programming
Resources:
lm-evaluation-harness: https://github.com/EleutherAI/lm-evaluation-harness
lm-evaluation-harness setup script: https://drive.google.com/file/d/1oWoWSBUdCiB82R-8m52nv_-5pylXEcDp/view?usp=sharing
OpenLLM Leaderboard: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
YALL Leaderboard: https://huggingface.co/spaces/mlabonne/Yet_Another_LLM_Leaderboard
MMLU Paper: https://arxiv.org/pdf/2009.03300
ARC Paper: https://arxiv.org/pdf/1803.05457
Orpo-Llama-3.2-1B-40k Model: https://huggingface.co/AdamLucek/Orpo-Llama-3.2-1B-40k
Chapters:
00:00 - Introduction
01:21 - What Are LLM Benchmarks? MMLU Example
05:09 - Additional Benchmark Examples
09:03 - How to Interpret Benchmark Evaluations
14:40 - Running Evaluations: Arc-Challenge Setup
16:49 - Running Evaluations: lm-evaluation-harness Repo
19:02 - Running Evaluations: CLI Environment Setup
21:42 - Running Evaluations: Defining lm-eval Arguments
24:27 - Running Evaluations: Starting Eval Run
26:49 - Running Evaluations: Interpreting Results
28:26 - Individual Implementation Differences
30:00 - Final Thoughts
#ai #datascience #programming
Tags and Topics
Browse our collection to discover more content in these categories.
Video Information
Views
7.1K
Likes
252
Duration
30:56
Published
Dec 2, 2024
User Reviews
4.6
(1) Related Trending Topics
LIVE TRENDSRelated trending topics. Click any trend to explore more videos.
Trending Now