Spaces:
Running
on
CPU Upgrade
eval_harness_v043_updates
Change 1
WHAT: Updates requirements.txt to the newest lm_eval version, 0.4.3. This also requires accelerate>=0.26.0
Change 2
WHAT: Removes no_cache
argument for lm_eval simple_evaluate function.
WHY: no_cache
(bool) was replaced with use_cache
(str), a path to a sqlite db file for caching model responses, andNone
if not caching ; see https://github.com/EleutherAI/lm-evaluation-harness/commit/fbd712f723d39e60949abeabd588f1a6f7fb8dcb#diff-6cc182ce4ebf9431fdf0ef577412f518d45396d4153a3825496304fa0f857c2d
FILES AFFECTED:
- src/backend/run_eval_suite_harness.py
- main_backend_harness.py
Change 3
WHAT: Changes the import of run_auto_eval to call from the lm_eval task Harnesss, not lighteval
WHY: The description of the templates specifies that the Harness is being used: "launches evaluations through the main_backend.py file, using the Eleuther AI Harness."
FILES AFFECTED:
- main_backend_harness.py
Change 4:
WHAT: Set batch_size
to "auto"
WHY: The Harness will automatically determine the batch size, based on the compute the user has set up.
FILES AFFECTED:
- main_backend_harness.py
- src/backend/run_eval_suite_harness.py (a typing change to accept "auto" string)
Change 5
WHAT: Additional updates to src/backend/run_eval_suite_harness.py for running the Harness code in v0.4.3:
ALL_TASKS
constant as previously defined is deprecated. This commit introduces another way to get those same values, using TaskManager(). NB: there appears to be be another alternative option that I have not tested, from lm_eval.api.registry import ALL_TASKS- Specifies "hf" as the the model value, which is the recommended default. Previously defined "hf-causal-experimental" has been deprecated. See: https://github.com/EleutherAI/lm-evaluation-harness/issues/1235#issuecomment-1873940238
- Removes
output_path
argument, which is no longer supported in lm_eval simple_evaluate. See: https://github.com/EleutherAI/lm-evaluation-harness/commit/6a2620ade383b8d30592fc2342eb1d213ad4b4cb NB: There may be an option to add something similar or comparable in another way, which I'm not experimenting with here. The argumentlog_samples
, for example, might be added here and set to True. - Additional minor: The definition of
device
uses the term "gpu:0" -- I think "cuda:0" is meant.
FILES AFFECTED: - src/backend/run_eval_suite_harness.py
LGTM, thanks!