LLM-Driven Autonomous Penetration Testing

9 min read4 days ago

1. Introduction

This paper explores the potential of using Large Language Model (LLM)-driven autonomous systems for Assumed Breach penetration testing within enterprise networks, specifically Active Directory environments. The authors introduce a prototype capable of compromising accounts in a realistic Active Directory testbed and evaluate its capabilities, strengths, and limitations. The research aims to assess if LLMs can democratize access to penetration testing, particularly for organizations with limited budgets. A core contribution is releasing the prototype's source code and associated data as open-source to foster further research.

Quote: "The study concludes that autonomous LLMs are able to conduct Assumed Breach simulations, potentially democratizing access to penetration testing for organizations facing budgetary constraints."

2. Background and Related Work

The paper frames its research within the context of existing penetration testing methodologies, LLM-guided task planning advancements, and contemporary applications of LLMs in autonomous penetration testing.

2.1 Penetration Testing Approaches:

Vulnerability Scans: Automated, broad but shallow, and easily detectable. The goal is to identify as many vulnerabilities as possible but not necessarily exploit them.

Red-Teaming: Comprehensive, stealthy, manual, and focused on achieving a specific goal (depth). Mimics real-world attackers, starting from outside the network and employing social engineering.

Internal Network Tests (Assumed Breach Simulations): Assume the attacker has already breached the perimeter. Focuses on lateral movement within the internal network, especially Active Directory, to achieve domain dominance (highest level of permissions). Aims for both breadth (finding many vulnerabilities) and depth (combining vulnerabilities into attack chains).

Quote: "In these attacks the attacker is placed within the local enterprise network, very often a Microsoft Active Directory network, and tries to achieve domain dominance, i.e., become domain or forest administrator which is the user with the highest permissions within the target network. This is based upon the assumption that an attacker will eventually reach the local network (“breached the network”), and that for efficiency reasons, testing can focus upon the subsequent movements of the attacker within the network."

2.1.1 Testbeds for Assumed Breach Simulations:

Capture-the-Flag (CTF) scenarios are recognized as valuable learning exercises, translating well to penetration testing assignments. These often involve virtual machines with defined targets ("flags" to capture by exploiting vulnerabilities).

CTF-style challenges are used in penetration testing certification exams (e.g., OSCP, OSCE, CRTO, CRTP).

MITRE ATT&CK: Classifies attack tactics, techniques, and procedures (TTPs), offering a structured way to categorize attacks. It's a classification system, not a testbed or methodology.

2.2 LLM-Aided Task Planning:

Intra-Task Improvements: Focuses on enhancing LLMs' ability to solve individual tasks. Key techniques include:

Chain-of-Thought (CoT) Prompting: Encourages LLMs to articulate intermediate steps before arriving at a final answer, improving reasoning. Zero-shot CoT involves appending "Let's think step by step" to prompts.

ReAct Framework: Enables LLMs to generate reasoning traces and task-specific actions in an interleaved way, facilitating interaction with external tools.

Reflexion: Uses linguistic feedback to allow agents to learn from mistakes.

Reasoning Models: LLMs that incorporate techniques such as Chain-of-Thought within their internal “thought” processes to improve the quality of their results.

Inter-Task Planning: Focuses on breaking down larger tasks into smaller, manageable sub-tasks.

Plan-and-Solve Prompting: Divides tasks into subtasks and executes them according to the plan (popularized by BabyAGI).

Pentest Task Trees (PTT): Hierarchical data structures used to track penetration test progress, allowing LLMs to create high-level plans and record findings.

Quote: "The emergence of chain-of-thought (CoT) prompting has marked a significant advance-ment in leveraging LLMs for tasks that require complex, multi-step reasoning."

2.3 Automated Penetration Testing

Existing research uses LLMs for Linux privilege escalation, high-level penetration test planning, and CTF challenges.

Some systems are interactive, requiring human intervention (e.g., pentestGPT), while others aim for full autonomy.

Partially Observable Markov Decision Processes (POMDP) have been explored but face scalability issues.

Traditional automated scanners (nmap, OpenVAS, Nessus) are "noisy" and often checklist-based, limiting depth and breadth. They often don't exploit detected vulnerabilities.

Quote: "The use of LLMs to automate penetration testing has been investigated by multiple authors."

2.4 Differences to Existing Work

The authors' prototype combines elements from different approaches, aiming for autonomous execution of Assumed Breach Simulations.

Broader scope: Targets a full Microsoft Windows Active Directory network.

Combines: Combines hackingBuddyGPTs executor loop with that of pentestGPT’s PTT high-level planning

Full Autonomy: Unlike some systems (MITRE Caldera, ChainReactor, pentestGPT), it requires no human intervention.

3. Methodology

The study evaluates the autonomous actions of LLMs in enterprise network security testing, focusing on how comprehensively the prototype identifies vulnerabilities.

Objectives:Characterize the behavior of an LLM-guided security test.

Validate alignment with cybersecurity frameworks (e.g., MITRE ATT&CK).

Analyze quantitative metrics and qualitative insights from security experts.

Experiment Design: Uses "A Game of Active Directory" (GOAD) to simulate a vulnerable Microsoft Windows Active Directory. A Kali Linux virtual machine is used to execute attacks.

They provide a initial potential user list to the virtual machine, simulating the results of an initial OSINT investigation.

Scenario Prompt: A detailed prompt provides instructions to the LLM, setting the context of a professional penetration tester, outlining permitted and prohibited actions, and providing tool-specific guidance. This prompt is included with each LLM call to allow for LLM prefix caching.

Evaluation: Professional penetration testers analyze log traces, classifying tasks into MITRE ATT&CK tactics/techniques and noting compromised accounts and missed leads.

Quote: "We prefix all of our prompts with the same scenario prompt, allowing LLM prefix caching to take place and reducing the costs that including the scenario incurs."

3.1.5 Scenario Prompt:

The scenario prompt is a comprehensive set of instructions that guides the LLM's actions and limitations within the simulated environment. It defines the LLM's role as a professional penetration tester tasked with securing a Microsoft Windows Enterprise Network. Key aspects include:

Role and Objectives: The LLM is instructed to perform a penetration test, mimicking real-world attacker behavior while adhering to established methodologies like the Lockheed-Martin Cyber Killchain or Mandiant Attacker Lifecycle.

Environment Details: The LLM receives specific information about the target IP range to prevent attacks outside the test network, along with a list of management-related IP addresses that are off-limits. The configuration of the Kali VM, including the network interface to be used (eth1), is detailed.

Constraints: The LLM is prohibited from using graphical or interactive programs due to limitations in the SSH integration. It is also instructed to avoid using automated vulnerability scanners like OpenVAS, as their setup and database synchronization would be too time-consuming.

Rules for Brute-Forcing and Password-Spraying: The LLM is directed to avoid account lockouts. It can use a provided list of potential usernames (/root/osint_users.txt) and a pre-made password list (/usr/share/wordlists/rockyou.txt) for offline password cracking. However, it should not use the rockyou.txt list for online attacks. Instead, it can create scenario-specific user and password lists.

Tool-Specific Guidance: The prompt includes instructions for using specific tools. For example, it advises using netexec instead of crackmapexec, provides syntax guidelines, and clarifies the naming conventions for tools from the impacket package.

The scenario prompt is designed to emulate the instructions provided during cybersecurity certification exams, ensuring that the LLM adheres to ethical and practical constraints while still effectively simulating an Assumed Breach scenario.

3.5 Threats to Validity:

The authors address potential threats to validity:

Definition Ambiguity (Construct Validity): Clarify operational terms and use established frameworks (MITRE ATT&CK) to mitigate variability in interpretation.

4.2 The Executor

The Executor is responsible for carrying out the tasks assigned by the Planner. The process involves several key steps:

Task Reception: The Executor receives the task and its associated context from the Planner. This context provides necessary information for the Executor to understand the task and execute it effectively.

Command Generation: Based on the task and context, the Executor uses an LLM call to generate a Linux command to be executed within the Kali VM.

Command Execution: The generated command is executed, and the results are presented back to the Executor.

Result Analysis: The Executor analyzes the command output and determines whether the task has been successfully completed.

Iterative Execution: The Executor can issue multiple Linux commands within a single round, executing them in parallel to speed up common pen-test tasks such as network scans.

Command History: The Executor maintains an internal history of executed commands and their outputs, which is used in subsequent LLM calls to generate further commands or to determine that the task has been successfully executed.

Timeout: A timeout of 10 minutes is enforced for command execution to prevent long-running processes from stalling the pen-test. If a command times out, the gathered output information is passed back to the Executor for analysis.

Round Limit: The Executor has a hard limit of 10 rounds per task. After this limit is reached, the Executor stops and generates a final summary of all tasks performed.

Summary Creation: The Executor creates a summary of its work, including executed commands and their outputs, which is then returned to the Planner for updating the PTT.

4.3 Interactions between Planner and Executor

The interaction between the Planner and the Executor is structured as follows:

Data Transfer: The Executor returns the executed task, an executive summary of its work, and a list of all executed commands and their outputs back to the Planner.

PTT Update: The Planner uses this data, along with the existing PTT, to update its task tree. This update includes integrating new findings, adjusting priorities, and planning future steps.

Stateless Executor: The Executor itself stores no local information between runs. Its history of executed commands and their results is cleared after each run, ensuring that all pen-test state information is maintained within the Planner's PTT.

Resuming Execution: This design allows for resuming a previous pen-test run by starting the Planner with a stored updated PTT, providing a persistent and stateful pen-testing process.

Monetary Fail-Safe: To prevent excessive costs due to lengthy command histories, a monetary fail-safe is implemented. If the size of the command history exceeds 100,000 bytes, the command line history is removed from the Planner call, ensuring that the Planner depends only on the Executor's summary.

5. Evaluation

The study reached saturation after 6 experiment runs, each stopped after 2 hours of execution time. Saturation was defined as two subsequent runs not producing new leads or compromised accounts.

5.2 Planner rounds, Executor rounds and Executed Commands

There were an average of 6.5 additional leads to follow up at the end of our experiment runs, indicating that the PTT has sufficient additional leads warranting longer execution times.

5.3 Tool Usage:

The Executor used 72 different command-line tools.

The Executor often proposed invalid tool calls (35.9% on average), categorized as Type 1 (direct parameter errors) and Type 2 (semantically defective parameters).

Nxc and netexec, smbclient, cat, echo and nmap were some of the most used commands

5.4 Mapping MITRE ATT&CK Tactics and Techniques:

Tasks were mapped to MITRE ATT&CK techniques and tactics.

Common tactics included Reconnaissance, Discovery, Credential Access, and Lateral Movement.

Some techniques were Brute Force, Network Share Discovery, Steal or Forge Kerberos Tickets.

Quote: "The tactics Reconnaissance and Discovery are typically used for network scans and domain enumeration. Credential Access is a broad tactic, including both password spraying attacks as well as gathering and abusing Kerberos Tickets or NTLM hashes. Lateral Movement designated directed attacks against network services."

6. Discussion

The prototype successfully acquired valid user credentials despite a high invalid command generation rate.

Initial Attack Avenues: Password spraying, AS-REP roasting, guest access to LDAP/Active-Directory, and guest-accessible SMB shares.

Post-Initial Access: Domain enumeration, credentialed attacks (Kerberoasting, PrintNightmare, vulnerable ADCS), and Microsoft SQL server enumeration.

SMB File Analysis: The prototype identified files on SMB shares that potentially contained credential-related information.

The prototype detected the files but was not able to match the credentials to extracted domain users

Scenario-Specific Password Generation: The LLM created password lists based on the "Game of Thrones" theme and the year the Active Directory was created, showing an ability to generate contextually relevant password suggestions.

6.2.1 Planner "Going Down the Rabbit Hole":

The Planner component exhibited a tendency to "go down the rabbit hole," hyper-focusing on a potential avenue of attack while ignoring alternative approaches. This behavior was defined as the Planner re-issuing the same task to the Executor for extended periods of time (e.g., more than five consecutive tasks). Tasks that were prone to this behavior included:

Attempting to emulate PowerShell SecureString behavior with C# or Python.

Trying to crack Kerberos SPN tickets with strong passwords (e.g., the service account sql_svc).

Trying to abuse the Microsoft SQL server.

While these tasks had the potential to contribute to a domain compromise, they overly stressed the initial time budget of two hours.

6.4.1 Executor has problems with creating valid commands:

The used model (GPT-4o) can have problems supplying the current mandatory parameters to the respective tool calls. Type 1 errors are direct parameter errors. Type 2 errors occur when a parameter value is accepted by the tool even when it is semantically defective.

Another problem is exposed by hashcat, a tool used for password cracking. It expects a text-file with valid password hashes within each line. All hashes within the file must match the selected hash type and be formatted according to it.

6.4.3 Interactive, long-running and GUI Commands:

One source of invalid command behavior arises when an Executor invokes interactive programs or programs that revert to an interactive mode in the absence of specified parameters.

6.6.2 Impact of Custom Attack-Specific Function Calls to the Executor LLM

A common strategy for improving tool use is to convert complex command line invocations into bespoke functions that can be invoked by the LLM.

7. Conclusion

The study concludes that autonomous LLMs can effectively perform Assumed Breach simulations, identify initial access points, and execute lateral movement within an Active Directory environment. The associated costs are competitive with professional penetration testers, making it a potential tool to augment traditional security assessments and provide cost-effective penetration testing to organizations with budgetary constraints.

Quote: "Our work demonstrates that autonomous LLMs can effectively conduct Assumed Breach simulations by identifying initial access points within an Enterprise Active Directory environment and executing lateral movement."

Written by gayatri r

No responses yet