walterShen
's Collections
Code LMs Evaluation
updated
A Survey on Language Models for Code
Paper
•
2311.07989
•
Published
•
21
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Paper
•
2310.06770
•
Published
•
4
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
Paper
•
2401.03065
•
Published
•
10
Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming
Paper
•
2402.14261
•
Published
•
10
CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code
Paper
•
2302.05527
•
Published
•
1
Copilot Refinement: Addressing Code Smells in Copilot-Generated Python
Code
Paper
•
2401.14176
•
Published
CodeFuse-13B: A Pretrained Multi-lingual Code Large Language Model
Paper
•
2310.06266
•
Published
•
1
TACO: Topics in Algorithmic COde generation dataset
Paper
•
2312.14852
•
Published
•
4
CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code
Completion
Paper
•
2310.11248
•
Published
•
3
DevEval: Evaluating Code Generation in Practical Software Projects
Paper
•
2401.06401
•
Published
CodeApex: A Bilingual Programming Evaluation Benchmark for Large
Language Models
Paper
•
2309.01940
•
Published
•
1
Improving Natural Language Capability of Code Large Language Model
Paper
•
2401.14242
•
Published
•
1
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of
Large Language Models for Code Generation
Paper
•
2305.01210
•
Published
•
4
A Static Evaluation of Code Completion by Large Language Models
Paper
•
2306.03203
•
Published
•
3
RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems
Paper
•
2306.03091
•
Published
•
1
MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural
Code Generation
Paper
•
2208.08227
•
Published
•
1
Large Language Models Are State-of-the-Art Evaluators of Code Generation
Paper
•
2304.14317
•
Published
•
2
Textbooks Are All You Need II: phi-1.5 technical report
Paper
•
2309.05463
•
Published
•
87
Textbooks Are All You Need
Paper
•
2306.11644
•
Published
•
142
Evaluating Large Language Models Trained on Code
Paper
•
2107.03374
•
Published
•
6
Design2Code: How Far Are We From Automating Front-End Engineering?
Paper
•
2403.03163
•
Published
•
93
Large Language Models Meet NL2Code: A Survey
Paper
•
2212.09420
•
Published
•
1
Large Language Models for Software Engineering: A Systematic Literature
Review
Paper
•
2308.10620
•
Published
•
1
Pitfalls in Language Models for Code Intelligence: A Taxonomy and Survey
Paper
•
2310.17903
•
Published
•
1
A Survey on Pretrained Language Models for Neural Code Intelligence
Paper
•
2212.10079
•
Published
•
1
An Empirical Comparison of Pre-Trained Models of Source Code
Paper
•
2302.04026
•
Published
•
1
Towards an Understanding of Large Language Models in Software
Engineering Tasks
Paper
•
2308.11396
•
Published
•
1
StarCoder 2 and The Stack v2: The Next Generation
Paper
•
2402.19173
•
Published
•
134
DeepSeek-Coder: When the Large Language Model Meets Programming -- The
Rise of Code Intelligence
Paper
•
2401.14196
•
Published
•
46
Unsupervised Evaluation of Code LLMs with Round-Trip Correctness
Paper
•
2402.08699
•
Published
•
1
Understanding the Effectiveness of Large Language Models in Detecting
Security Vulnerabilities
Paper
•
2311.16169
•
Published
•
1
Magicoder: Source Code Is All You Need
Paper
•
2312.02120
•
Published
•
79
On the Effectiveness of Large Language Models in Domain-Specific Code
Generation
Paper
•
2312.01639
•
Published
•
1
CodeScope: An Execution-based Multilingual Multitask Multidimensional
Benchmark for Evaluating LLMs on Code Understanding and Generation
Paper
•
2311.08588
•
Published
Fusion-Eval: Integrating Evaluators with LLMs
Paper
•
2311.09204
•
Published
•
5
CoderEval: A Benchmark of Pragmatic Code Generation with Generative
Pre-trained Models
Paper
•
2302.00288
•
Published
Beyond Accuracy: Evaluating Self-Consistency of Code Large Language
Models with IdentityChain
Paper
•
2310.14053
•
Published
Can ChatGPT replace StackOverflow? A Study on Robustness and Reliability
of Large Language Model Code Generation
Paper
•
2308.10335
•
Published
CodeScore: Evaluating Code Generation by Learning Code Execution
Paper
•
2301.09043
•
Published
InterCode: Standardizing and Benchmarking Interactive Coding with
Execution Feedback
Paper
•
2306.14898
•
Published
Language Models for Code Completion: A Practical Evaluation
Paper
•
2402.16197
•
Published
•
1
DevBench: A Comprehensive Benchmark for Software Development
Paper
•
2403.08604
•
Published
•
2
CodeEditorBench: Evaluating Code Editing Capability of Large Language
Models
Paper
•
2404.03543
•
Published
•
15
CodeCompose: A Large-Scale Industrial Deployment of AI-assisted Code
Authoring
Paper
•
2305.12050
•
Published
•
1
Stable Code Technical Report
Paper
•
2404.01226
•
Published
•
1
CodeShell Technical Report
Paper
•
2403.15747
•
Published
•
1
A Survey of Neural Code Intelligence: Paradigms, Advances and Beyond
Paper
•
2403.14734
•
Published
•
22
PRD: Peer Rank and Discussion Improve Large Language Model based
Evaluations
Paper
•
2307.02762
•
Published
Prometheus 2: An Open Source Language Model Specialized in Evaluating
Other Language Models
Paper
•
2405.01535
•
Published
•
116
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of
Diverse Models
Paper
•
2404.18796
•
Published
•
68