How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data
Abstract
Recently, there has been a growing interest in studying how to construct better code instruction tuning data. However, we observe Code models trained with these datasets exhibit high performance on HumanEval but perform worse on other benchmarks such as LiveCodeBench. Upon further investigation, we find that many datasets suffer from severe data leakage. After cleaning up most of the leaked data, some well-known high-quality datasets perform poorly. This discovery reveals a new challenge: identifying which dataset genuinely qualify as high-quality code instruction data. To address this, we propose an efficient code data pruning strategy for selecting good samples. Our approach is based on three dimensions: instruction complexity, response quality, and instruction diversity. Based on our selected data, we present XCoder, a family of models finetuned from LLaMA3. Our experiments show XCoder achieves new state-of-the-art performance using fewer training data, which verify the effectiveness of our data strategy. Moreover, we perform a comprehensive analysis on the data composition and find existing code datasets have different characteristics according to their construction methods, which provide new insights for future code LLMs. Our models and dataset are released in https://github.com/banksy23/XCoder
Community
Recently, there has been a growing interest in studying how to construct better code instruction tuning data. However, we observe Code models trained with these datasets exhibit high performance on HumanEval but perform worse on other benchmarks such as LiveCodeBench. Upon further investigation, we find that many datasets suffer from severe data leakage. After cleaning up most of the leaked data, some well-known high-quality datasets perform poorly. This discovery reveals a new challenge: identifying which dataset genuinely qualify as high-quality code instruction data. To address this, we propose an efficient code data pruning strategy for selecting good samples. Our approach is based on three dimensions: instruction complexity, response quality, and instruction diversity. Based on our selected data, we present XCoder, a family of models finetuned from LLaMA3. Our experiments show XCoder achieves new state-of-the-art performance using fewer training data, which verify the effectiveness of our data strategy. Moreover, we perform a comprehensive analysis on the data composition and find existing code datasets have different characteristics according to their construction methods, which provide new insights for future code LLMs.
Models and dataset are released in https://github.com/banksy23/XCoder
Hi @dongguanting congrats on your work!
Would be great to link the models and dataset to this paper, see here on how to do that: https://huggingface.co/docs/hub/en/paper-pages#linking-a-paper-to-a-model-dataset-or-space.
Cheers!
Our sources are listed:
Xcoder-80K instruction tuning dataset: https://huggingface.co/datasets/banksy235/XCoder-80K
XCoder-8B checkpoint: https://modelscope.cn/models/banksy235/XCoder-8B
XCoder-70B checkpoint: https://modelscope.cn/models/banksy235/XCoder-70B
XCoder-Complexity-Scorer: https://modelscope.cn/models/banksy235/XCoder-Complexity-Scorer
Unit test model: https://modelscope.cn/models/banksy235/Unit_Test_Model
Really cool work 🔥 It would be nice to add more detailed information on the model card : )
byr bd!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- CodeACT: Code Adaptive Compute-efficient Tuning Framework for Code LLMs (2024)
- CodeV: Empowering LLMs for Verilog Generation through Multi-Level Summarization (2024)
- FANNO: Augmenting High-Quality Instruction Data with Open-Sourced LLMs Only (2024)
- Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models (2024)
- ECCO: Can We Improve Model-Generated Code Efficiency Without Sacrificing Functional Correctness? (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper