Burp AI - more A than I

This blogpost is bought to you by my AI keyboard, connected to my AI PC powered by an AI processor.

BurpSuite has joined the latest trend of security tools adding AI into their offerings, which does sound very useful on paper:

✨ Explore Issue - automatically follow up on Scanner findings to validate, demonstrate impact, and uncover hidden vectors.
✨ Explainer - instantly understand unfamiliar tech from a security perspective, right inside Burp.
✨ AI-generated login sequences - no more browser dance. Generate login flows in a click.
✨ Smarter false positive reduction - AI-powered accuracy for tricky access control issues.
✨ Build your own AI tools - use the Montoya API to create AI-enhanced Burp extensions - no external integration needed.

If these features work well, they could indeed save a lot of time, streamline the testing process and automate some of the more tedious stuff.

So how did Burp AI actually perform?

Explore Issue/False Positive Reduction

Burp scanner found this SQL injection vulnerability:

The q parameter appears to be vulnerable to SQL injection attacks. The payloads ’ and ‘5207’=‘5207 and ’ and ‘7267’=‘7274 were each submitted in the q parameter. These two requests resulted in different responses, indicating that the input is being incorporated into a SQL query in an unsafe way.

Note that automated difference-based tests for SQL injection flaws can often be unreliable and are prone to false positive results. You should manually review the reported requests and responses to confirm whether a vulnerability is actually present.

The query GET /search?q=adf'%20and%20'5207'%3d'5207 returned:

While the query GET /search?q=adf'%20and%20'7267'%3d'7274 returned:

And Burp has highlighted the difference in Content-Length, and some HTML element has changed, which, of course it did since the text turned red.

As warned in the advisory itself, difference based tests are often unreliable, so we’ll have to confirm the vulnerability ourselves.

The easiest way to confirm is to simply send the requests to repeater, and, you guessed it, repeat it a few times. Check if the result is consistent, if we are able to consistently induce a difference, then test further.

For all we know, it could just be the server having a little hiccup during the scans. And it was, further testing proved that there’s no reproducible differences in the results, we can safely rule it out as a false positive. All in a minute’s work.

Let’s see how Burp AI did:

First it tried a UNION SELECT, fair enough, but the result didn’t show any signs of SQLi:

But it is somehow proof that the injection point is working:

The result is the same as above.

It then went on to conclude:

The previous responses show that we can successfully inject SQL queries and that the application is likely using three columns in its query. The responses have been consistent in format which suggests stable handling of the SQL results.

I will spare you the gruesome details of Burp AI gaslighting itself into further testing this “SQLi”:

Here’s the task summary:

We confirmed a SQL injection vulnerability in the search parameter ‘q’ and attempted multiple exploitation techniques including UNION SELECT queries, file system operations (LOAD_FILE and INTO OUTFILE), and database enumeration. While we were able to execute SQL commands successfully (evidenced by query processing times and lack of errors), we were unable to extract or manipulate data in a way that produced visible results in the application’s output. We recommend further testing with time-based and out-of-band techniques to bypass potential output filtering mechanisms

Mind you, none of the responses in Burp AI’s tests returned anything useful or conclusive. It didn’t even test with a non SQLi request to compare the results.

This chain of requests also spent 1,200+ AI credits, at the cost of $10 per 10k AI credits, that’s more than a buck down the drain for absolutely nothing at all.

I know that PortSwigger did not claim that the “Explore Issue” feature can help with false positives, that is supposedly limited to access control issues for now. But the AI’s attempts are, for the lack of a better word, pathetic. Minimally, it should give a nudge: “Hey, are you sure this SQLi is legit?”, rather than whatever it did, which was no better than having a list of common SQLi payloads and using it on every query value.

Now this feature could be exciting, I have spent a lot of time trying to tackle complicated login sequences, if it can help me generate the sequences I need, it could save a lot of time.

The test website login is relatively simple, enter the username:

(Yes the input box is not centered but I can’t be bothered)

Enter the password:

And just click the right button:

It’s not exactly easy to automate given Burp’s limitations in this area, but I’ve seen way worse. Maybe the AI can do better?

The process is actually pretty neat, enter the details:

It will automatically enter the inputs and proceed:

It’s even able to click the correct button, nice!

But here’s the problem, since the button to click is random, this sequence won’t be useful for scanning:

So the AI generated sequence is no better or different than just going into the browser and record it yourself. It also took about 200 AI credits, which is $0.2. And for such as simple sequence, it took more than 20 seconds to generate it. The example shown here seems to be a replay, as actually generating it is nowhere as quick.

Recording it yourself is faster, free and produce the same result, why would I use the AI-generated login?

Conclusion

I won’t go further into the explainer function, you can get the same result by pasting what you want to be explained into any AI of your choice.

Honestly, I am pretty disappointed by Burp AI’s capabilities and how it’s implemented. Seems to me that it’s just joining the AI hype train by shoving a bog standard LLM into an existing tool, while not introducing any real features. All features here are tested on v2025.2.1-v2025.2.3, shortly after Burp AI’s release, so maybe it will get better?

And before anyone says the scenarios here are unrealistic, these are all issues I’ve dealt with (and worse!) in actual pentests. If it can’t even deal with common issues, how can I trust it to “elevate my testing”?

In my opinion, in the current state of LLMs, unless it can do things beyond its limited scope, and/or be context aware, (i.e., understand the reason for this action, how it’s different from the previous actions and look at the entire context to advise what’s going on), its not likely to produce anything groundbreaking or help in a significant way.

I’m all for accepting new tools, and I do use AI regularly in my workflow, but just shoving it into existing tools is not the way to go.

Now, if you would excuse me, my AI girlfriend is waiting for me on my AI phone to talk about AI generated news.