Empirical Study of Test Generation with LLM's

By nickpsecurity at 8:10 AM

fovc | 6 comments | 3 days ago

I still don’t understand how people are getting value out of AI coders. I’ve tried really hard and the commits produced are just a step up from garbage. Writing code from scratch is generally decent. But after a few rounds of edits the assistant just starts piling in conditionals into existing functions until it’s a rats nest 4 layers deep and 100+ lines long. The other day it got into a loop trying to resolve a type error, where it would make a change, then revert it, then make it again

ETA: Sorry forgot about the relevancy in my rant! The one area where I’ve found the AIs helpful is enumerating and then creating test cases

TacticalCoder | 2 comments | 3 days ago

> I still don’t understand how people are getting value out of AI coders. I’ve tried really hard and the commits produced are just a step up from garbage.

These aren't mutually exclusive. I pay for ChatGPT. It sucks fat balls at coding but it's okay to do things like "Bash: ensure exactly two params are passed, 1st one is a dir, 2nd is a file". This is slightly faster than writing it myself so it's worth $20 a month but that's about it.

"from now on no explanation, code only" also helps.

Does is still stuck? Definitely. But it's one more tool. I can understand why one wouldn't even bother though.

Lerc | 0 comments | 3 days ago

>"from now on no explanation, code only" also helps.

Does it? Without using a model with a internal monologue interface, the explanation is the only way for the model to do any long form thinking. I would have thought that requesting an explanation of the code it is about to write would be better. An explanation after it has written the code would be counterproductive, because it would be flavouring the explanation to what it actually wrote instead of what it was wanting to achieve.

godelski | 1 comment | 3 days ago

At that point, why not learn bash?

I often hear how great GPT is at bash but imo it is terrible. Granted, most bash code is terrible, though I'm not sure why. It's like people don't even know how to use functions. It's petty quick to get to an okay level too! (I suspect few people sit down to learn it and instead learn it a line at a time over a very sparse timeframe)

The other part is compounding returns. This is extra obvious with bash. Getting good at shell scripting also helps you be really good at using the shell and vise versa. The returns aren't always obvious though, but you'll quickly find yourself piping into sed or xargs or writing little for loops or feeling like find actually makes sense. Pretty soon you'll be living in the terminal and questioning why others don't.

Bash scripting is an insanely underrated skill.

In general this is something I find problematic with AI code generation. The struggle is part of the learning process. Training wheels are great and at face value AI should help you learn. But it's like having a solution manual to your math homework. We both know 9/10 people go for the answer and not use it to get unstuck. It's also a bit hard with LLMs because they aren't great doing one line at a time and not spoiling the next steps. But I'm sure you can prompt engineer this to a decent degree of success.

jitl | 1 comment | 2 days ago

I am fine with bash as a programming language but find Cursor cmd-k very nice:

For boilerplate like the grandparent comment poster - pop 2 args off $@ and check each matches some condition in an if, write error to stderr, return 2, check $@ has no args left: i can write all this code, but it's much faster to type "cmd-k, Bash: ensure exactly two params are passed, 1st one is a dir, 2nd is a file, enter" than to type the code directly. Here AI takes this task from "30 seconds" to "1 second".

For awk | sed | cut style transform pipelines, I can provide an example input text and describe my desired output, and AI does a great job writing the pipeline. Again, I can write this code (although usually requires multiple rounds of trail and error) but using AI takes it from "few minutes" to "few seconds" of time.

godelski | 0 comments | 14 hours ago

  > Bash: ensure exactly two params are passed, 1st one is a dir, 2nd is a file, enter" than to type the code directly

This seems much simpler than you made it out to be. Why not use ${#FOO[@]} for the number of args.

If you have strict ordering, your program is just as fast to type as it is to ask cursor (imo)

  [[ $# == 2 && -d "$1" && -a "$2" ]] || exit 2

  alternatively conditionally assigning

  [[ $# == 2 && -d "$1" && -a "$2" ]] && PREFIX="$1" && FILE="$2" || exit 2

I'll give you that it is probably faster if you are wanting to write it more correct and be order invariant (not what you asked), but with boiler plate I often just use macros.

I'm curious, what is Cursor's solution? Here's a pretty quick way that leaves some room for expandablity

  #!/usr/bin/env bash
  
  declare -ri MAXARGS=2
  declare -rA ECODES=(
      [GENERAL]=1
      [BADARGS]=2
  )

  FILE=
  PREFIX=

  fail_with_log() {
      echo "$1"
      exit ${ECODES[$2]}
  }

  check_args() {
      [[ $# == $MAXARGS ]] \
          || fail_with_log "Wrong number of arguments, expected ${MAXARGS} got $#" "BADARGS"
  
      # We might want to use -f instead of -a 
      # Check if PREFIX + FILE or FILE + PREFIX
      if [[ -d "$1" && -a "$2" ]];
      then
          PREFIX="$1"
          FILE="$2"
      elif [[ -a "$1" && -d "$2" ]];
      then
          FILE="$1"
          PREFIX="$2"
      else
          fail_with_log "Must provide both a file and a directory" "BADARGS"
      fi 
  } 

  main() {
      check_args "$@"
      ...
  }

  main "$@" || exit ${ECODES["GENERAL"]}

I mean a bunch of stuff above are superfluous but it doesn't take long to write (can be macro'd) and that little bit adds some flexibility if we want to extend down the line. There's of course, many ways to write this too. We could use a loop to provide more extensibility or even more conditional assignments. But this feels clearer, even if not very flexible.

nzach | 2 comments | 3 days ago

> the commits produced

Maybe this is the problem ? I quite like using LLMs for coding, but I don't think we are in a position where a LLM is able to create a reasonable commit.

For me using LLMs for coding is like a pair programming session where YOU are the co-pilot. The AI will happily fill you screen with a lot of text, but you have the responsibility to steer the session. Recently I've been using Supermaven in my editor. I like to think of it as 'LSP on steroids', it's not that smart but is pretty fast and for me this is important.

Another way I use LLMs to help me is by asking open-ended questions to a more capable but slower LLM. Something like "What happens when I read a message from a deleted offset in a Kafka topic?" to o1. Most of the time it doesn't give great answers, but it generally gives good keywords to start a more focused Google search.

godelski | 2 comments | 3 days ago

Do you think this is actually faster than reading the docs? Every attempt I've had with LLM pair programming (I do it at least once a month trying to measure performance) ends up taking more time than if I turned to Google out the docs, even with how bad Google has gotten. Though it _feels_ faster because it _feels_ like you're making continual progress where reading docs doesn't have the same feeling. I suspect this is a confounder but I'm open to just being bad at AI programming (though isn't it meant to be universal? I mean I'm a ML researcher and that's what the papers promise).

I'm also curious if you think it helps you improve. Docs tend to give extra information that turns out to be useful now and many times later.

I still like and use LLMs a lot though. I find them useful in a similar way to your last paragraph. My favorite usage is to ask it domain topics where I'm not a domain expert. I never trust the response (it's commonly at best oversimplified/misleading but often wrong), but since it will use similar language to those in the field I can pick out keywords to improve a google search, especially when caught in term collision hell (i.e. Google overfitting and ignoring words/quotes/etc).

I do also find it helpful in validating what I think some code does. But same as above, low trust and use as a launching off point for deeper understanding.

Basically, I'm using LLMs as a fuzzy database optimized towards the data median with a human language interface. That is, after all, what they are.

jitl | 1 comment | 2 days ago

Often the docs for a library suck or lack suggestions about suggested structure, but LLM has ingested many open-source example uses from Github etc and so can offer code examples or explanations totally absent from the docs. I find this if a library has lots of comments for individual functions or struct types, but little tutorial-style and module-level docs.

Other times the docs are hundreds of pages, "read all the docs" is too much reading for a simple task, and so asking AI for just the code please is the right move to get started.

godelski | 0 comments | 2 days ago

  > Often the docs for a library suck or lack suggestions about suggested structure

I don't disagree here. I do preach that one should document as you code. This gets a lot of pushback. But tbh, I think the one who benefits the most is yourself. What's the old joke? "What idiot wrote this code? Oh... that idiot was me." Give it a week and there's a good chance I forgot my though process, what I was doing, and why. Depends how many projects I have and how attention must be split.

But there's a lot of ways that documentation happens. One of my favorites is unit tests. Before I turn to reading source code to figure out how something works, I go look at unit tests. That way I prime myself for what to look for when I do need to look at source.

FWIW, even with highly documented things I go look at source pretty regularly. I'm a ML researcher and I'll tell you that I go poking around torch's code at least once a month. If you're using libraries quite frequently, it is probably worth doing this.

I also want to say that I remember the fear of doing this when I was much more junior. That it all looked like gibberish and inparseable. But this is actually the same thing that happens to any fledgling PhD student. Where you're required to read a bunch of papers that don't make sense. The magic is that after reading enough, they do start to make sense. Same is true for code. There is a certain "critical mass" needed to pull all the ideas together. Yes, LLMs can help make this process easier and probably reduces the requisite mass, but I also HIGHLY suggest that you still read source (not for everything! Of course! But anything you use regularly). It is scarier than it seems and you'll honestly be able to parse it before you know it. Learning requires struggling and as humans we like to avoid struggle. But the investment pays off big time and quickly compounds. Were I to go back in time and advise myself, I'd tell myself to do this earlier. I was just fortunate enough that I had a few projects where this ended up being requires and (un)fortunately extremely difficult to parse code (I was in template metaprogramming hell).

There's a bad habit in CS: move fast and break things; or alternatively, just ship. This is great in the learning process. You learn from breaking things and getting your minimal viable product is hard because you don't even know everything you need till you are knee deep into the problem. But the bad habit is to not go back and fix things. Little optimizations also compound quickly. You can never escape tech debt, but boy does it pile on invisibly and quickly. There's a saying that a lot of tradesmen and engineers use: "do it right, or do it twice." (alternatively: "why is there always time to do things twice but never enough to do it right?") I think we could learn from this. Software is so prolific these days that we need to move from living in the wild west. Truth be told, the better LLMs get, the more important this will be for you.

nzach | 1 comment | 3 days ago

> Do you think this is actually faster than reading the docs?

It depends on your goals, I guess. Do I really need to read the whole D3.js docs just to transform a csv into a pretty map? I'm not arguing against the docs, I genuinely think that by reading the D3.js docs I would become a better professional. But what is the ROI for this effort ?

Nowadays learning about a topic is a choice we can make, we can create things that solve real problems without the need to fully understand the solution. This wasn't feasible a couple years back. And choosing 'not to learn' too many times is a great recipe for disaster, so I understand why a lot of people are worried about giving this option to people.

Besides that the "always read the docs" theory makes an assumption that isn't always true. This assumes you know what are you looking for and where you can find it. When I was younger I was assigned to a task that required me to put a new feature in a jenga tower of custom bash scripts and I've found a bug that completely stumped me, it took me an entire week to figure out I was missing some quotes around $@ when passing arguments from one script to the next. I've spent several hours trowing random combinations of keywords to try finding something relevant to my problem. Now I know this is a bash-related problem, but at the time this wasn't clear. It might have been something in my logic, or something to do with not being an TTY, or something to do with the version of the tools I was using.... Having a LLM would have saved a week of frustration because I could just vaguely describe my problem and ask questions about my issue to point me to right direction faster.

> I never trust the response (it's commonly at best oversimplified/misleading but often wrong)

This reminds me of my middle school. When I was there technology and especially Wikipedia was just starting to get popular and almost every week some teacher would lecture us about the dangers of these new technologies and how you should never trust anything on the internet. As time passed the quality of Wikipedia content increased, but this idea of never blindly trusting something you found on the internet really stuck with me. And now LLMs are just another thing on the internet you should never blindly trust for me. Maybe that is part of the reason why I don't get too angry when the LLM tell me something that is wrong.

> I'm using LLMs as a fuzzy database

That is a really good way of putting it. And I agree this is a great use-case for LMMs. But I think we still have a lot to learn about how to effectively use LLMs. I just hope people don't get too caught up in all the marketing surrounding this AI hype cycle we are living right now.

godelski | 0 comments | 2 days ago

Sure, I agree.

When in the "move fast and break things" phase, yeah, LLMs can be useful. But do you ever move out to clear the tech debt or do you just always break things and leave a mess behind?

The issue is that it all depends. If something doesn't matter, then yeah, who fucking cares how inefficient or broken it is. But if it does, it's probably better to go the slow way because that knowledge will compound and be important later. You are going to miss lessons. Maybe you don't need them, maybe you don't need them now. It depends and is hard so say. But I would say that for juniors, it is bad to become reliant upon LLMs.

It's dangerous in exactly the same way as having a solution manual. They can be quite useful and make you learn a lot faster. OR they can completely rob you of the learning experience. It all depends how you use it, right? And for what purpose. That's the point I'm trying to get across.

  > it took me an entire week to figure out I was missing some quotes around $@ when passing arguments from one script to the next

This is a common issue, and one I faced too (though not that long). But it also led to more understanding. I learned a lot more about the quotes and variable passing. If someone had just given me the answer, I wouldn't have gotten the rest. Don't undermine the utility of second order effects.

fovc | 2 comments | 3 days ago

The “pair programming” approach with good models is just slow enough that I lose focus on each step. The faster models I’ve tried are not good enough except for straightforward things where it’s faster to just use emacs/LSP refactoring and editing tools. Maybe supermaven manages to beat the “good enough, fast enough” bar; I’ll have to try it!

nzach | 0 comments | 3 days ago

One thing I've realized after using a really fast model is that the time it takes the model to generate a suggestion is proportional to the size of suggestion I'm willing to accept. And in my experience the quality of suggestions decreases when the suggestion size increases.

If the model takes a couple seconds to generate a suggestion I get inclined to accept several lines of code. But if the suggestion takes just 300ms to generate I don't feel the "need" to accept the suggested code.

I'm not really sure why that happens, maybe that's just the sunk cost fallacy happening right in my editor? If I wait 5 seconds for a suggestion and don't use the suggestion did I effectively just wasted 5 seconds of my life for no good reason?

jitl | 0 comments | 2 days ago

I think having supermaven or cursor-style IDE integration is really key to making LSP worth it, otherwise friction around the workflow overwhelms the gains for many tasks. Like, I would never use AI to generate code or tests if it involved copy-pasting code in and out of webpage text boxes. But with IDE integration, I often write a function signature and doc comment with no plans to ask AI for anything, but the AI happens to offer a correct tab completion of the whole function. That's great, much less typing for me.

lumost | 0 comments | 3 days ago

I use it for the boiler plate, uber google, automated reviewer, rubber duck design reviewer, and junior engineer given extremely precise instructions.

The latest models (o1, Claude sonnet new) are decent at generating code for up to 1k lines. More than that, they start to struggle. On large code bases they lose the plot quickly and generate gibberish. I only use them as code summarizers and as a Google replacement in that context.

Art9681 | 0 comments | 2 days ago

The best developers of the future arent the ones who mastered Python or Rust. It will be people who can describe complex things using a new permutation of English. That new English is evolving right now.

So using Ai isnt just about learning some stack or tinkering with a model. Its learning how to communicate with the AI under the current constraints.

Perhaps this goes against the ultimate goal. Speak plainly and AI understands. That's AGI. Today, we need to speak AI English, and that's a new knowledge domain all its own. Hopefully it will be shortlived.

weitendorf | 0 comments | 3 days ago

Wielding GenAI effectively is genuinely a skill. I’m working on an AI developer tool product (a more agentic Cursor, but not fully agentic because the underlying LLM tech/hardware is simply not there yet) and have seen a lot of different techniques, good and bad, used by developers including myself. I could go on at length but relevant to what you commented:

1. Generally you want to give the LLM one well-specified task at a time. If you weren’t specific enough initially try clarifying a bit maybe, and if it makes a mistake maybe try one round of fixing it in the same conversation. Otherwise I always recommend putting followups and separate microtasks in a separate, new conversation (with some context carried over and some no-longer-relevant context pruned).

Every time you call an LLM it takes the entire conversation history as a parameter and generates the most likely response, which at least last I checked was O(n^2) for leading models. Long conversations force it to sift through tons of junk, bias responses towards what the model did previously, and often confuse it regarding objectives and instructions.

2. Don’t let the model make you forget that you know how to write software, and don’t believe everything it says. Make sure you actually read and understand the code it spits out, and try to think at least a little about any errors it causes. If it got most of the way there it’s usually easier to just do the last bit of code yourself IME, and you can still Google your errors as engineers have done for decades now.

3. Treat the model like the “doer” and don’t let it do the thinking. It’s great at converting instructions and code to more code, and for knowing lots of stuff about most things on the Internet, and to use as a sounding board. Anything more intellectually challenging than that you probably want to scope down into simpler stuff.

TLDR is you need to build a Theory of Mind for how to interact with LLM coding tools and know when to take over.

godelski | 0 comments | 3 days ago

I have a simple answer for you: most people write garbage code[00]. I know this because I write garbage code but usually less garbage.

A little background...

I got my undergrad in physics (where I fell in love with math), spent some years working, and got really interested in coding and especially ML (especially the math). So I went to grad school. Unsurprising I had major impostor syndrome being surrounded by CS people[0], so I spent a huge amount of time trying to fill in the gap. The real problem was that many of my friends were PL people, so they're math heavy and great programmers. But after teaching a bunch of high level CS classes I realized I wasn't behind. After having to fix a lot of autograders made by my peers, I didn't feel so behind. When my lab grew and I got to meet a lot more ML people, I felt ahead, and confused. I realized the problem: I was trying to be a physicist in CS. Trying to understand things at very fundamental levels and using that to build up, not feeling like I "knew" a topic until I knew that chain. I realized people were just saying they knew at a different threshold.

Back to ML:

Working and researching in ML I've noticed one common flaw. People are ignoring details. I thought HuggingFace would be "my savior" where people would see that their generation outputs weren't nearly the quality you see in papers. But this didn't happen. We cherry picked results and ignored failures. It feels like writing proofs but people only look at the last line (I'd argue this is analogous to code! It's about so much more than the output! The steps are the thing that matters).

So there's two camps of ML people now: the hype people and "the skeptics" (interestingly there's a large population of people with physics and math backgrounds here). I put the latter in quotes because we're not trying to stop ML progress. I'd argue we're trying to make it! The argument is we need to recognize flaws so we know what needs to be fixed. This is why Francois Chollet made the claim that GPT has delayed progress towards AGI. Because we are doing the same thing that caused the last AI winter: putting all our eggs in one basket. We've made it hard to pursue other ideas and models because to get published you need to beat benchmarks (good luck doing so out of the gate and without thousands of GPUs). Because we don't look at the limitations in benchmarks. Because we don't even check for God damn information spoilage anymore. Even HumanEval is littered with spoilage, and obviously so...

There's tons of uses for LLMs and ML systems. My "rage" (as with many others) is more about over promising. Because we know if you don't fulfill those promises quickly, sentiment turns against you and funding quickly goes away. Just look at how even HN went from extremely positive on AI to a similar dichotomy (though the "skeptics" are probably more skeptical than researchers.[1]). Is playing with fire. Prometheus gave it to man to enlighten themselves but they also burned themselves quite frequently.

The answer is:

you evaluate in more detail than others.

[00] of course it is. LLMs replicate average human code. They're optimizers. They optimize fitting data, not fitting optimal symbolic manipulation. If everyone was far better at code, LLMs would be too. That's how they work

[0] boy, us physicists have big egos but CS people give us a run for the money

[1] I have no doubt that AGI can be created. I have no doubt we humans can make it. But I highly doubt LLMs will get us there and we need to look in other directions. I'm not saying we shouldn't stop perusing LLMs, I'm saying don't stop the other research from happening. It's not a zero sum game. Most things in the real world are not (but for some god damn reason we always think it is)

jitl | 2 comments | 3 days ago

My cursor workflow for getting tests is to make the test file, import the code under test, and then type cmd-k, “unit tests for <class>”, enter. Add additional cmd-k to prompt for method tests and cases as needed.

Pretty basic, would adding more shenanigans get me better results?

I try to write doc comments for methods with contracts and it seems cursor/claude does a good job reading and testing those.

arnvald | 0 comments | 3 days ago

I tried that with Cody (from Sourcegraph) using probably o1 model and I struggled. I had a long function with a number of conditions and early returns.

At first Cody generated a single test case, then I asked if to consider all possible scenarios in that function (around 10), it generated 5 cases, 3 of which were correct and 2 scenarios were made up (it tried to use nonexistent enum values).

In the end I used it to generate some mocks and a few minor things but then I copy pasted the test myself a few times and changed the values manually.

Did it save me some time in the end? Possibly, but also it caused a lot of frustration. I still hope it gets better because generating tests should be a great use case for LLMs

peterldowns | 1 comment | 3 days ago

What value are the tests giving you? Asking truly naively, I haven't tried generating tests for code before, and I'm skeptical that it would be useful for me.

jitl | 1 comment | 3 days ago

The tests provide the same value as usual: early detection of coding mistakes, (somewhat) format specification of contracts, assurance the code does what it's supposed to do and continues to do what it's supposed to do as the codebase evolves over time.

peterldowns | 1 comment | 3 days ago

Could you share an example test (if it doesn't violate your privacy)? Your answer is too vague for me to understand. I can't square my experience of LLMs (generate braindead code for well-documented situations) with the kinds of tests I write (enforcing domain-specific business logic during execution of specific scenarios) about which LLMs know nothing (not the domain-specific logic, not the scenarios, not my test helpers and mock implementations.)

jitl | 1 comment | 2 days ago

This is the open-source example I have: https://github.com/justjake/eslint-seatbelt/blob/main/src/Se...

Maybe you can declare this is braindead code for well-documented situations? not sure. My rough draft implementation of this class had bugs, I asked Cursor to write tests - prompt was something like "can you set up tests for SeatbeltFile.ts using node:test framework", it added "test" script to package.json and wrote 90% of the test file, then I fixed the bugs, added another case myself, committed the tests.

peterldowns | 0 comments | 2 days ago

Interesting, thank you very much for sharing. I wouldn't have expected an LLM to write those tests. This makes me want to try again for some of the work I've been doing. Thank you!

hitchstory | 2 comments | 3 days ago

Using LLMs to generate tests is bizarre. If anything, humans should write the tests and the LLMs should make it pass.

Tests lose value the further away from the spec they drift. Writing a test after the code causes drift. Writing with an LLM causes even more drift.

It ends up being a ritual to appease the unit testing gods rather than an investment that is supposed to pay dividends.

weitendorf | 1 comment | 3 days ago

The paper is pretty vacuous IMO but there are at least a few reasons I think LLM testing is pretty nice:

* It’s actually easier to do TDD or black box testing with LLMs. Yes, the lazy approach is to feed it a function implementation and tell it to make a unit test. But you can instead feed it the function definition and a description of its behavior (which may be what you used to generate the implementation too!) and have it generate a unit test with no visibility to the spec.

* Unit tests tend to have a lot of boilerplate sometimes, often not copy-pastable (eg Go table test cases) and LLMs can knock that out super quickly.

* Sometimes you do actually want to add a ton of unit tests even if they’re a little too implementation-focused. It’s a nice step towards later having actually-good tests, and some projects are so poorly tested and plagued with basic breakages/bugs that it’s worth slowing down feature development to keep things stable.

Personally I hate when people try to automate this stuff though, because it does trend towards junk. I find it better to treat writing tests with LLMs tactically, basically the same way you use them to write code.

hitchstory | 1 comment | 3 days ago

>Unit tests tend to have a lot of boilerplate sometimes, often not copy-pastable

When people use LLMs to write code and they find if helpful, invariably it is because they are spewing boilerplate.

If you dont systematically eliminate boilerplate the codebase eventually turns into an unmaintainable mess.

>Sometimes you do actually want to add a ton of unit tests even if they’re a little too implementation-focused.

Really? Id consider this an antipattern.

>I find it better to treat writing tests with LLMs tactically

I find the prospect of using them to write production code / tests pretty depressing.

The best thing that can be said is that they will create lots of jobs with the mess they make.

jitl | 0 comments | 2 days ago

Unit tests and implementation for something like "parse this well-defined file format" are perfect for AI, low-scope, clear success criteria. Plenty of production code I write is more like "parse this well-defined file format".

Etheryte | 0 comments | 3 days ago

Agreed, tacking on an arbitrary set of tests after the fact is a great way to pour glue over the code, nothing more.

bediger4000 | 1 comment | 5 days ago

Doesn't seem to compare to human generated tests. I'm guessing that comparison is mine too favorable.

nickpsecurity | 0 comments | 5 days ago

I think humans are probably still better (a) *when they write tests and (b) when they know how to. Neither is true for a lot of programmers but might happen if automated cheaply.

I also think traditional tools are better for this. The test, generation methods include path-based, combinatorial, concolic, and adaptive fuzzing. Fire and forget tools that do each one, suppressing duplicates, would be helpful.

What would be easier to train developers on are contracts (or annotations). Then, add test generation either from those or aided by static analysis that leverages them. Then, software that analyzes the code, annotates what it can, and then asks the user to fill in the holes. The LLM’s could turn their answers into formal contracts or specs.

That’s how I see it happening. Maybe add patch generation into that, too. At least two tools that predate LLM’s already find errors and suggest fixes.

nickpsecurity | 0 comments | 5 days ago

Here’s the EvoSuite that they refer to:

https://www.evosuite.org/evosuite/