All LLMs Write Great Code, But Some Make (A Lot) Fewer Mistakes

Community Article Published September 12, 2024

A hypothetical scenario

If you are a seasoned software engineer who has been through many coding interviews, you must have run into such candidates. They understand the problem on spot and dive into action immediately. They code fast. Reading their final code is a simple pleasure. Excellent structure, appropriate indentation, easy-to-understand naming etc. Except there is a non-fatal BUG, should be an oversight. You nudged the candidate a few times but they still didn’t get it. So with some regret in your reviewer comments, you recommend hiring as a junior engineer who possesses enormous potentials. So far so good, ... UNTIL you meet the next candidate.

Besides speed, readability, commenting, etc, the code is spotless with NO BUG. Your first reaction is "I myself cannot pull this off". You next thought is "we must have this person on our team", assuming you are not concerned with your own job security.

Motivation

Above is the mental picture in my mind after I finished the paper Insights from Benchmarking Frontier Language Models on Web App Code Generation. I designed the WebApp1K benchmark to be easy to run and fair among competitors. The rule is simple: make your code pass the given unit tests. Also I reused the pass@k metric proposed by HumanEval.

The outcome is surprising: the performance gap among models is much wider than I expected. The problem is not very hard (the average lines of code <=50), and all the code look great by eyeballing. Using the analogy from the coding interview scenario, GPT and Claude are the perfect candidates, and top open source models are the candidates with great potentials.

On the other hand, the small code output presents opportunity: we now have a case study in a controlled environment, and the complexity is more manageable. So I decided to dig into the logs hoping to learn something, hence the paper.

They make the same bugs

There are seven types of bugs (details in the paper). Even the best-performing GPT-4o models make all the bugs. The differentiator here is that the top models make 10x fewer bugs. image/png

How helpful is prompt engineering

Now we know what kind of bugs a model can make, can we prompt it to avoid the the bugs? I ran lots of experiments and my only success is to remind the model not to call useHistory, a deprecated function of the React framework. All other attempts failed.

In fact, all bugs are due to failures to meet expectations by unit tests, which are already in the prompt. For example, type B bug is mismatched text. The tests expect the rendered HTML to carry certain UI elements showing texts like "Submit", "Share", etc. The successful code meets all expectations of course. The failed code also meets expectations, except getting the text slightly different, like "submit" or "share".

Right code is different from wrong code

This is obviously true in terms of correctness. But there are statistical differences between the two data bodies. Below is the success vs failure lines-of-code (LOC) distribution of one application. One is bimodal (success) and the other is unimodal (failure). The paper has a lot more such charts. image/png

What is next

First thing is to enhance the benchmark with more tasks and more programming frameworks. If we raise LOC to 200 or 500, will the same observations still hold? There will also be new observations to catch.

Second is to continue digging logs for truths. There are lots of open questions. It is definitely not the knowledge gap that differentiates the models. You can say all other factors (post-training, alignment, instruction following, etc.) are influential, but I believe insights can be obtained, hence the formula to level model performances.

A huge thank to 🤗HuggingFace🤗

When I decided to share the benchmark (dataset and leaderboard), choosing HuggingFace is a no-brainer. Onboarding is quite easy and self explanatory. Tooling is just the right stuff for you.

But I am just blown away by their speed of execution. They featured my paper the moment it showed up on Arxiv, before I plug it!

So, my sincere thank to HF. They truly understand the community and each individual in it, from grand scheme of things to daily churns.