Project: WASC Threat Classification
Threat Type: Weakness
Reference ID: WASC-48
Insecure Indexing
Insecure Indexing is a threat to the data confidentiality of the web-site. Indexing web-site contents via a process that has access to files which are not supposed to be publicly accessible has the potential of leaking information about the existence of such files, and about their content. In the process of indexing, such information is collected and stored by the indexing process, which can later be retrieved (albeit not trivially) by a determined attacker, typically through a series of queries to the search engine. The attacker does not thwart the security model of the search engine. As such, this attack is subtle and very hard to detect and to foil - it’s not easy to distinguish the attacker’s queries from a legitimate user’s queries.
Background
As websites becomes larger and more complex, the user's problem of how to find the information he/she needs in the site becomes more central to the site owner. This is where search engines come in handy. A search engine first "learns" the website by looking at its pages, associating keywords to them and updating its internal database (this is called indexing), and then, when a user submits a query to the search engine, the search engine consults its database and pulls out the list of relevant pages. The indexing process is ongoing, to ensure that the search engine is up to date with the site (which changes periodically). There are two kinds of indexing - remote (web/HTTP based) and local (file based). In web/HTTP based indexing, the search engine traverses the website by "crawling" it through the site's native web server, typically starting at the homepage of the site and recursively following links from it. This process can be conducted remotely (and locally), and it is indeed used by remote (3rd party) search engines such as Google and Yahoo. In file based indexing, on the other hand, the search engine needs to have direct access to the web server's file-system (hence it has to be run locally), and it indexes the site by going over all files in the file system (up to some exceptions) under the virtual root. Many local search engines make use of this technique. In some cases, this indexing method may open up the site for attacks, as we can see below.
Example 1: Finding a hidden file
Suppose the attacker suspects that vendor X is about to publish a security advisory on their website. Also suppose that the attacker knows that part of the publishing process, the file is uploaded to the website few days (or weeks) before the advisory is published. The file resides on the web server, yet it is not linked from anyplace. Further suppose that the file name is unpredictable. Assuming that the site operates a search engine that *locally* indexes server *files*, and that it has recently indexed the site (so it encountered the advisory file as well), the attacker can now guess a word or two that are likely to appear in an advisory (e.g. maybe "Vendor X advisory X-Adv-07-"), and with luck, the search engine will display a URL to the unpublished advisory. And if the site is really insecure, the URL will be downloadable by the attacker.
The main issue demonstrated above is that the mere indexing of the file leaked sensitive information (namely, that such file exists).
Example 2: Retrieving file contents
Suppose the attacker knows that a certain file exists, yet it is not publicly available (e.g. it requires basic authentication). This can be done via the technique demonstrated in Example 1, or it may happen that the file name is predictable. Now, since this file is indexed, every time the attacker queries the search engine for a word (or a sequence of words) that exists in the file, the URL is returned by the search engine. Some engines also provide a short "context", i.e. the surrounding words/sentences that encompass the found query text. The attacker can reconstruct wide sections of the file (ideally: the whole file) by first guessing a word or two that exist in the file, and then widening the search. For instance, if the search engine returns contextual data, and resorting to the advisory example above, the initial guess may be "buffer overflow". This will return:
... Remotely exploitable buffer overflow in server XYZ ...
Now the attacker widens the search, by querying:
"overflow in server XYZ"
The search engine returns:
... exploitable buffer overflow in server XYZ, version 0.1 for Linux.
And the attacker slides the search window to the right:
"in server XYZ, version 0.1 for Linux."
The search engine returns:
... buffer overflow in server XYZ, version 0.1 for Linux. By sending a series ...
And so forth. As long as the sliding window contains enough information for the attacker to locate the advisory text (from other candidates presented by the search engine), this attack may succeed. The main issue demonstrated is that search engines can leak information to which they have access, yet the public does not.
Example 3: Retrieving file contents, the hard way
In Example 2, we assumed that some "context" was returned by the search engine, which is very helpful for the attacker. However, some engines do not provide such data, which makes the information received from such engine into a single Boolean (bit) value - "true" (query was found in the file) or "false" (query was not found). Not all is lost though - if the attacker is willing to throw many (and we mean many!) queries at the search engine, the file (or sections thereof) may still be reconstructed. This is not as theoretic as some may think. Sometimes, reconstructing a single sentence from a sensitive file can mean a lot, and may worth bombarding the site with hundreds of thousands of requests. The attack proceeds as following. The attacker has an initial guess (e.g. "buffer overflow"). The attacker queries the search engine and gets back the URL for the file, or in our Boolean variable terms, "true". Now the attacker is out of ideas, but he may try the short version of the English dictionary, peppered with computer science terms, vendor and product names, etc. Let’s say the dictionary contains 100,000 such words. Appending each such word to the already known string "buffer overflow" and querying the search engine (100,000 times!), the attacker gets "false" for each attempt, except for the word "in". So "buffer overflow in" it is. Next, with additional 100,000 queries, the attacker can reconstruct "buffer overflow in server", and with additional 100,000 - "buffer overflow in server XYZ" (assuming XYZ is a well known vendor name, hence in the extended dictionary). In short, for 700,000 queries, the attacker can reconstruct "buffer overflow in server XYZ, version 0.1 for Linux". And this can obviously be much improved by taking into account language syntax and probabilities for pairs of words (e.g. "buffer overflow" is likely to be followed by "in", hence guessing "buffer overflow in" among the first guesses will save the attacker the vast majority of the 100,000 queries in this case. Likewise, "version x.y for" is likely to be followed by an O/S name, again shortening the guess list to few dozen instead of 100,000).
The main issue is just like Example 2, except that the information leakage is more subtle here (at most one bit per query), which makes the attack is less trivial (but nonetheless feasible).
References
"The Insecure Indexing Vulnerability - Attacks Against Local Search Engines" (WASC article), Amit Klein, February 28th, 2005
[1] http://www.webappsec.org/projects/articles/022805.shtml
See also 'Application Misconfiguration'
[2] http://projects.webappsec.org/Application-Misconfiguration
See also 'Information Leakage'
[3] http://projects.webappsec.org/Information-Leakage
Information Leak Through Indexing of Private Data
[4] http://cwe.mitre.org/data/definitions/612.html
Comments (1)
Colin Watson said
at 2:36 am on Dec 14, 2009
A few suggestions to add to this if you think they help.
a) Sometimes indexes are made from database queries as well (as files), and these can include "deleted" records or ones where additional permissions would otherwise be required to view.
b) Could Example 1, perhaps be expanded to "Finding a hidden, test or old file" since "hidden" suggests positive action in some way, but forgotten about test or old files are more common?
c) Similarly, I wonder if Examples 2 and 3 could be made commonplace? I don't think the use of "buffer overflow" helps a less experienced reader, since that is of course another attack category. I realise the importance of finding that particular text, but it makes this category less clear. Maybe Example 2 could use source code exposure (e.g. Java, ASP.NET, PHP or SQL code) from an indexed script instead? And Example 3 could use guessing more business-familiar text such as "salaries will be cut by 10% in June 2010"?
You don't have permission to comment on this page.