HTCAP
htcap is a web application scanner able to crawl single page application (SPA) recursively by intercepting ajax calls and DOM changes.
Key Features
- Recursive DOM crawling engine
- Discovers ajax/fetch/jsonp/websocket requests
- Supports cookies, proxy, custom headers, http auth and more
- Heuristic page deduplication engine based on text similarities
- Scriptable login sequences
- All findings are saved to sqlite database and can be exported to an interactive html report
- The built-in fuzzers can detect SQL-Injection, XSS, Command Execution, File disclosure and many more
- Can be easly interfaced with Sqlmap, Arachni, Wapiti, Burp and many other tools
- Fuzzers are built on top of a fuzzing framework so they can be easly created/customized
- Fuzzers fully support REST and SOAP payloads (json and xml)
- Both crawler and fuzzers run in a mulithreaded environment
- The report comes with advanced filtering capabilities and workflow tools
Briefly
Htcap is not just another vulnerability scanner since it's focused on the crawling process and it's aimed to detect and intercept ajax/fetch calls, websockets, jsonp ecc. It uses its own fuzzers plus a set of external tools to discover vulnerabilities and it's designed to be a tool for both manual and automated penetration test of modern web applications.
It also features a small but powerful framework to quickly develop custom fuzzers with less than 60 lines of python.
The fuzzers developed on top of htcap's framework can work with GET/POST data, XML and JSON payloads and switch between POST
and GET. Of course, fuzzers run in parallel in a multi-threaded environment.
The scan process is divided in two parts, first htcap crawls the target and collects as
many requests as possible (urls, forms, ajax ecc..) and saves them to a sql-lite database.
When the crawling is done it is possible to launch several security scanners against the
saved requests and save the scan results to the same database.
When the database is populated (at least with crawing data), it's possible to explore it
with ready-available tools such as sqlite3 or DBEaver or export the results in various
formats using the built-in utilities.
What's new in version 1.1
Phantomjs has been replaced by Puppetter, a nodejs module that controls Chromium over the DevTools Protocol.
Htcap’s Javascript crawling engine has been rewritten to take advantage of the new async/await features of ecmascript and has been converted to a nodjes module build on top of Puppetteer. You can find it here.
The user script functionality as been completely removed and will be (probably) re-added in the upcoming releases.
Short Video of HTCAP crawling Gmail
QUICK START
Let's assume that we have to perform a penetration test against target.local, first we crawl the site:
$ htcap/htcap.py crawl target.local target.db
Once the crawl is done, the database (target.db) will contain all the requests discovered by the crawler. To explore/export the database we can use the htcap utilities or ready available tools. For example, to list all discovered ajax calls we can use a single shell command:
$ echo "SELECT method,url,data FROM request WHERE type = 'xhr';" | sqlite3 target.db
or
$ htcap/htcap.py util lsajax target.db
Now that the site is crawled it's possible to launch several vulnerability scanners against the requests saved to the database. A scanner is an external program that can fuzz requests to spot security flaws.
Htcap uses a modular architecture to handle different scanners and execute them in a multi-threaded environment. For example we can run ten parallel instances of sqlmap against saved ajax requests with the following command:
$ htcap/htcap.py scan -r xhr -n 10 sqlmap target.db
Htcap comes with it's own fuzzers plus sqlmap and arachni modules.
While the native fuzzers can discover common web vulnerabilities (sql injection, xss ecc),
sqlmap is used to discover and exploit hard to find SQL injections and arachni is used to
discover a larger set of vulnerabilities.
Since scanner modules extend the BaseScanner class, they can be easly created or
modified (see the section "Writing Scanner Modules" of this manual).
Htcap includes several tools to export the crawl and scan results.
For example we can generate an interactive report
containing the relevant informations about website/webapp with the command below.
Relevant informations will include, for example, the list of all pages that trigger
ajax calls or websockets and the ones that contain vulnerabilities.
$ htcap/htcap.py util report target.db target.html
For a list of discovered vulnerabilities use the "lsvuln" utility
$ htcap/htcap.py util lsvuln target.db
Htcap's commands are chainable using '\;' as separator. When commands are chained they share the same database even if the database file name has been dinamically generated.
$ htcap/htcap.py crawl https://www.fcvl.net/htcap/ htcap.db \; scan native \; scan sqlmap
SETUP
Requirements
- Python 3.3
- Nodejs and npm
- Sqlmap (for sqlmap scanner module)
- Arachni (for arachni scanner module)
Download and Run
$ git clone https://github.com/fcavallarin/htcap.git htcap $ htcap/htcap.py
Alternatively you can download the latest zip here.
On first run, htcap will install all needed nodejs modules. If you encounter dependencies error with Puppetter, you should try to manually install "chromium-browser".
To install htcap system-wide:
# mv htcap /usr/share/ # ln -s /usr/share/htcap/htcap.py /usr/local/bin/htcap
DEMOS
You can find an online demo of the html report here
and a screenshot of the database view here
You can also explore the test pages here to see from what the report
has been generated. They also include a page to test ajax recursion.
EXPLORING DATABASE
In order to read the database it's possible to use the built-in utilities or any ready-available sqlite3 client.
UTILITIY EXAMPLES
Generate the html report. (demo report available here)
$ htcap/htcap.py util report target.db target.html
List all pages that trigger ajax requests:
$ htcap/htcap.py util lsajax target.db Request ID: 6 Page URL: http://target.local/dashboard Referer: http://target.local/ Ajax requests: [BUTTON txt=Statistics].click() -> GET http://target.local/api/get_stats
List all discovered SQL-Injection vulnerabilities:
$ htcap/htcap.py util lsvuln target.db "type='sqli'" C O M M A N D python /usr/local/bin/sqlmap --batch -u http://target.local/api/[...] D E T A I L S Parameter: name (POST) Type: error-based Title: PostgreSQL AND error-based - WHERE or HAVING clause Payload: id=1' AND 4163=CAST [...] [...]
QUERY EXAMPLES
Search for login forms
SELECT referer, method, url, data FROM request WHERE type='form' AND (url LIKE '%login%' OR data LIKE '%password%')
Search inside the pages html
SELECT url FROM request WHERE html LIKE '%upload%' COLLATE NOCASE
AJAX CRAWLING
Htcap features an algorithm able to crawl ajax-based pages in a recursive manner.
The algorithm works by capturing ajax calls, mapping DOM changes to them and repeat
the process recursively against the newly added elements.
When a page is loaded htcap starts by triggering all events and filling input values
in the aim to to trigger ajax calls. When an ajax call is detected, htcap waits until
it is completed and the relative callback is called; if, after that, the DOM is
modified, htcap runs the same algorithm against the added elements and repeats it until
all the ajax calls have been fired.
_________________ | | |load page content| '--------,--------' | | | ________V________ | interact with | | new content |<-----------------------------------------+ '--------,--------' | | | | | | | YES ______V______ ________________ ______l_____ / AJAX \ YES | | / CONTENT \ { TRIGGERED? }-------->| wait ajax |----->{ MODIFIED? } \ ______ ______ / '----------------' \ ______ _____ / | NO | NO | | | | ________V________ | | | | | return |<-----------------------------------------+ '-----------------'
COMMAND LINE ARGUMENTS
Crawler arguments
$ htcap crawl -h usage: htcap [options] url outfile Options: -h this help -w overwrite output file -q do not display progress informations -v be verbose -m MODE set crawl mode: - passive: do not intract with the page - active: trigger events - aggressive: also fill input values and crawl forms (default) -s SCOPE set crawl scope - domain: limit crawling to current domain (default) - directory: limit crawling to current directory (and subdirecotries) - url: do not crawl, just analyze a single page -D maximum crawl depth (default: 100) -P maximum crawl depth for consecutive forms (default: 10) -F even if in aggressive mode, do not crawl forms -H save HTML generated by the page -d DOMAINS comma separated list of allowed domains (ex *.target.com) -c COOKIES cookies as json or name=value pairs separaded by semicolon -C COOKIE_FILE path to file containing COOKIES -r REFERER set initial referer -x EXCLUDED comma separated list of urls to exclude (regex) - ie logout urls -p PROXY proxy string protocol:host:port - protocol can be 'http' or 'socks5' -n THREADS number of parallel threads (default: 10) -A CREDENTIALS username and password used for HTTP authentication separated by a colon -U USERAGENT set user agent -t TIMEOUT maximum seconds spent to analyze a page (default 180) -S skip initial checks -G group query_string parameters with the same name ('[]' ending excluded) -N don't normalize URL path (keep ../../) -R maximum number of redirects to follow (default 10) -I ignore robots.txt -O dont't override timeout functions (setTimeout, setInterval) -e disable hEuristic page deduplication -l do not run chrome in headless mode -E HEADER set extra http headers (ex -E foo=bar -E bar=foo) -L SEQUENCE set login sequence
Scanner arguments
$ htcap scan -h Usage: scan [options] <scanner> <db_file> [scanner_options] Options: -h this help -n THREADS number of parallel threads -r REQUEST_TYPES comma separated list of request types to pass to the scanner -m PATH path to custom modules dir -v verbose mode -p PROXY | 0 proxy, set to 0 to disable default (default: crawler) -U USERAGENT user agent (default: crawler) -c STRING | PATH cookies (default: request) -E HEADER extra http headers (default: crawler) Scanner Options: those are scanner-specific options (if available), you should try -h ..
Crawl Modes
Htcap supports three scan modes: passive, active and aggressive.
When in passive mode, htcap do not interacts with the page, this means that no events are triggered and only
links are followed. In this mode htcap acts as a very basic web crawler that collects only the links found
in the page (A tags). This simulates a user that just clicks on links.
The active mode behaves like the passive mode but it also triggers all discovered events. This simulates a user that interact
with the page without filling input values.
The aggressive mode makes htcap to also fill input values and post forms. This simulates a user that performs
as many actions as possible on the page.
Example
Crawl http://www.target.local trying to be as stealth as possible
$ htcap/htcap.py crawl -m passive www.target.local target.db
Crawl Scope
Htcap limits the crawling process to a specific scope. Available scopes are: domain, directory and url.
When scope is set to domain, htcap will crawl the domain of the taget only, plus the allowed_domains (-d option).
If scope is directory, htcap will crawl only the target directory and its subdirectories and if the scope is
url, htcap will not crawl anything, it just analyzes a single page.
The excluded urls (-x option) are considered out of scope, so they get saved to database but not crawled.
Examples
Crawl all discovered subdomains of http://target.local plus http://www.target1.local starting from http://www.target.local
$ htcap/htcap.py crawl -d '*.target.local,www.target1.local' www.target.local target.db
Crawl the directory admin and never go to the upper directory level
$ htcap/htcap.py crawl -s directory www.target.local/admin/ target.db
Excluded Urls
It's possible to exclude some urls from crawling by providing a comma separated list of regular expression. Excluded urls are considered out of scope.
$ htcap/htcap.py crawl -x '.*logout.*,.*session.*' www.target.local/admin/ target.db
Crawl depth
htcap is designed to limit the crawl depth to a specific threshold.
By default there are two depth limits, one for general crawling (-D) and the other
for sequential post request (-P).
Cookies
Cookies can be specified both as json and as string and can be passed as commandline option or inside a file.
The json must be set as follow, and only the 'name' and 'value' properties are mandatory.
[
{
"name":"session",
"value":"eyJpdiI6IkZXV1J",
"domain":"target.local",
"secure":false,
"path":"/",
"expires":1453990424,
"httponly":true
},
{
"name":"version",
"value":"1.1",
"domain":"target.local",
"secure":false,
"path":"/",
"expires":1453990381,
"httponly":true
}
]
The string format is the classic list of name=value pairs separated by a semicolon:
session=eyJpdiI6IkZXV1J; version=1.1
Examples
$ htcap/htcap.py crawl -c 'session=someBetterToken; version=1' www.target.local/admin/ target.db'
$ htcap/htcap.py crawl -c '[{name:"session",value:"someGoodToken"}]' www.target.local/admin/ target.db'
$ htcap/htcap.py crawl -C cookies.json www.target.local/admin/ target.db'
AUTHENTICATED CRAWLING
Htcap supports several methods to perform authenticated crawling/scanning. The easyer method is to provide session cookies via commandline, but in many situations it's not always convenient and sometimes it's not even possible. The primary disadvantage of this method is that session tokens usually get invalidated after a certain amount of time of inactivity, so, for example, it's not possible to schedule a crawling to run each month. On top of that, modern web applications do not rely only on cookies for authentication but they need to put token(s) in an http header or in REST payload.
In order to try to solve those problems, htcap implements the so-called "login sequences". A login sequence is a json object that contains instructions on how to login into the target platform.
A login sequence can be either "shared" or "standalone".
A shared login sequence is executed only before the crawing starts to grab session cookies. In this case it's possible to specify which cookie(s) to acquire by listing their names in the "cookies" field.
A standalone login sequence should be used when each page requires login. If, at page load, the session is not autamatically started by inheriting cookies from parent, the login sequece is executed.
The login sequence json is composed by the following fields:
type | It can be "shared" or "standalone" |
cookies | An array containing cookie names to grab. It's used only if type is "shared" |
url | The url of the login page |
loggedinCondition | A regexp to execute against the page's html, if it matches, the session is considered started |
sequence | An array containing the sequence of actions to execute to fill input value. Each element is an array where the first element is the name of the action to take and the remaining elements are "parameters" to those actions. |
Sequence's actions are:
click <selctor> | Clicks on the selected element and waits for ajax/fetch/jsonp requsts to be completed |
write <selector> <text> | Types <text> into the selected element |
set <selector> <text> | Sets the value of the selected element to <text> |
sleep <millis> | Sleeps for <millis> milliseconds |
clickToNavigate <selector> [<timeout>] | Clicks on the selected element and waits for the page to navigate |
assertLoggedin | Checks loginCondition against current html and exit if it fails |
A selector can be any valid CSS selector or, if the first character is '$', an XPath espression.
An example to login in into a simple form-based page:
{
"type":"shared",
"cookies":["PHPSESSID"],
"url":"https://www.fcvl.net/htcap//scanme/login/",
"loggedinCondition":"logout",
"sequence": [
["write", "#username", "fcvl"],
["write", "#password", "passW00rD"],
["clickToNavigate", "#btn-login"],
["assertLoggedin"]
]
}
Example sequence to login into gmail account:
"sequence": [
["write", "#identifierId", "filippo@fcvl.net"],
["click", "#identifierNext"],
["sleep", 1000],
["write", "input[name=password]", "*****************"],
["clickToNavigate", "#passwordNext", 800]
]
DATABASE STRUCTURE
Htcap's database is composed by the tables listed below
CRAWL_INFO | Crawl informations |
REQUEST | Contains the requests discovered by the crawler |
REQUEST_CHILD | Relations between requests |
ASSESSMENT | Each scanner run generates a new assessment |
VULNERABILITY | Vulnerabilities discovered by an assessment |
The CRAWL_INFO Table
The CRAWL_INFO table contains the informations about the crawl and, since each crawl has its own database, it contains one row only. It's composed by the following fields:
HTCAP_VERSION | Version of htcap |
TARGET | Target URL |
START_DATE | Crawl start date |
END_DATE | Crawl end date |
COMMANDLINE | Crawler commandline options |
USER_AGENT | User-agent set by the crawler |
The REQUEST Table
The REQUEST table contains all the requests discovered by the crawler.
It's composed by the following fields:
ID | Id of the request |
ID_PARENT | Id of the parent request |
TYPE | The type of the request |
METHOD | Request method |
URL | Request URL |
REFERER | Referer URL |
REDIRECTS | Number of redirects ahead of this page |
DATA | POST data |
COOKIES | Cookies as json |
HTTP_AUTH | Username:password used for basic http authentication |
OUT_OF_SCOPE | Equal to 1 if the URL is out of crawler scope |
TRIGGER | The html element and event that triggered the request |
CRAWLED | Equal to 1 if the request has been crawled |
CRAWLER_ERRORS | Array of crawler errors as json |
HTML | HTML generated by the page |
USER_OUTPUT | Array of user messages as Json - see ui.print() of User Script |
The parent request is the request from wich the main request has been generated.
For example, each request inherits the cookies from the parent.
Consider that the crawler follows just one path, this means that if page A is linked from page B and
page C, the crawler will load page A as if the navigation comes from page B but not from
page C. To save all the connections between pages, the crawler uses a separate table.
This table, called REQUEST_CHILD, contains the following fields:
ID_REQUEST | Id of the parent request |
ID_CHILD | Id of the child request |
By combining these two tables it's possible to rebuild the whole site structure.
The ASSESSMENT Table
Each scaner run generates a new record in this table to save the scanning informations,
so basically an assessment is considered the execution of a scanner module.
It contains the following fields:
ID | Id of the assessment |
SCANNER | The name of the scanner |
START_DATE | Scan start date |
END_DATE | Scan end date |
The VULNERABILITY Table
The VULNERABILITY table contains all the vulnerabilities discovered by the various assessments. It contains the following fields:
ID | Id of the vulnerability |
ID_ASSESSMENT | Id of the assessment to which it belongs to |
ID_REQUEST | Id of the request that has been scanned |
TYPE | vulnerability type (see vulntypes) |
DESCRIPTION | Description of the vulnerability |
WRITING SCANNER MODULES
A scanner module is a python class that extends the BaseScanner class. Each scanner executes multiple threads that are responsable to fuzz a request and push back the results to the main thread. A scanner thread is an inner-class (named Scan) that extends the ScannerThread class. A scanner may implement one or more fuzzers. A fuzzer is a class extending the BaseFuzzer class that generates the mutations of the request to fuzz, sends them out and parses the response in order to find vulnerabilities.
Basically the execution process of a scanner is as follow:
- Scanner is initalized by the parent class by calling init method
- For each request, the parent class instantiate the Scan class and executes its run method
- The scanner thread can fuzz the request by implementig its own fuzzer of by executing an external program
- Once the fuzzer returns the scanner thread can call the self.save_vulnerabilities method to save the findings to the database
Scanner Module Example
[...]
from core.scan.base_scanner import BaseScanner, ScannerThread
from core.scan.fuzzers.fileinclude import Fileinclude
class Demoscan(BaseScanner):
def init(self, argv):
"""
Function called on scanner startup
Parameters
----------
argv: string
Command line arguments passed to scanner
"""
pass
def usage(self):
return ( "Options are:\n"
" -h this help\n"
)
def add_vulns(self, vulns):
"""
Example of a thread safe operation
"""
self.lock.acquire()
self.vulns.extend(vulns)
self.lock.release()
def end(self):
"""
Function called before exit
"""
print("\nScan ended, tot vulnerabilities: %d" % len(self.vulns))
def get_settings(self):
"""
Scanner settings
"""
return dict(
request_types = "form,link,redirect,xhr,fetch",
num_threads = 10,
)
class Scan(ScannerThread):
def run(self):
"""
Thread main method that handles each request
"""
if self.is_request_duplicated():
return False
self.sprint("Scanning %s %s" % (self.request.method, self.request.url))
# Fuzzer class that handles request mutations
fuzzer = Fileinclude(self)
vulnerabilities = fuzzer.fuzz()
self.save_vulnerabilities([{"type":'fileinclude',"description": v} for v in vulnerabilities])
self.scanner.add_vulns(vulnerabilities)
Fuzzer Module Example
# Payload to generate mutations. Each requst parameter (get, post, xml, json, cookies) is replaced with each payload
payloads = [
"/etc/passwd",
"../../../../../../../../../../../../../../../../../../../../../../etc/passwd",
"file:/c:\\windows\\system.ini",
"c:\\windows\\system.ini",
]
# Regular expressions to check response body
responses = [
r'root\:[^ ]\:0\:0\:',
r'\nwoafont=dosapp\.fon'
]
class Fileinclude(BaseFuzzer):
def init(self):
pass
def is_vulnerable(self, body):
body = self.utils.strip_html_tags(body)
for regex in responses:
if re.search(regex, body, re.M):
return True
return False
def fuzz(self):
vulnerabilities = []
# Initialize mutations iterator.
# A mutation is an object that holds the original request with a parameter replaced with a payload
mutations = self.get_mutations(self.request, payloads)
for m in mutations:
try:
resp = m.send(ignore_errors=True)
except Exception as e:
self.sprint("Error: %s" % e)
continue
if not resp.body:
continue
if self.is_vulnerable(resp.body):
vulnerabilities.append(str(m))
mutations.next_parameter()
return vulnerabilities
Example of Scanner Module that executes external program
[...]
from core.scan.base_scanner import BaseScanner, ScannerThread
from core.scan.fuzzers.fileinclude import Fileinclude
class Demoscan(BaseScanner):
def init(self, argv):
pass
def usage(self):
return ( "Options are:\n"
" -h this help\n"
)
def end(self):
pass
def get_settings(self):
return dict(
request_types = "form,link,redirect,xhr,fetch",
num_threads = 10,
)
class Scan(ScannerThread):
def run(self):
if self.is_request_duplicated():
return False
cmd_options = [
"--batch",
"-u", self.request.url,
"-v", "0"
]
out = None
try:
cmd_out = self.utils.execmd("sqlmap", cmd_options)
out = cmd_out['out']
except Exception as e:
self.sprint(e)
# parse command output and save discovered vulnerabilities...
[...]
Adding Modules
Custom scanner and fuzzer modules can be dynamically added with the -m command line option.
To add a scanner module called, for example, myscanner follow these steps:
- Create a file named myscanner.py inside a directory called custommods
- Inside that file create Myscanner class that overrides BaseScanner
To execute myscanner run the following command:
$ htcap/htcap.py scan -m custommods myscanner target.db
URL Uniqueness
One of the biggest problem when scanning a webapplication is how to determine the
uniqueness of a page. Modern web applications usually use the same page to generate
different contents based to the url parameters or patterns. A vulnerability scanner
don't need to analyze the same page just because its content has changed.
In the aim to solve this problem, or at least reduce its impact, htcap implements
an algorithm for url comparision that can let scanner modules skip "duplicated" urls.
Inside the run method it's possbile to implement this feature using the code below.
if self.is_request_duplicated():
return False
The algorithm is extreamely simple, it just removes the values from parameters
and it sorts them alphabetically; for example http://www.test.local/a/?c=1&a=2&b=3
becames http://www.test.local/a/?a=&b=&c= .
A good idea would be the use of the SimHash algorithm but lots of tests are needed.
In case of POST requests the same algorithm is also applied to the following payloads:
- URL-Encoded
- XML
- JSON
Scan scope
Scanner modules analyze only in-scope requests. If the crawler scope is set to "url", any discovered request will be considered
out of scope, including ajax requests, jsonp ecc.
For example if target.local/foo/bar.php has been crawled with scope set to
"url" and it contains ajax request to target.local/foo/ajax.php, they won't be scanned. With a simple query it's possible to
make those request visible to scanners.
UPDATE request set out_of_scope=0 where type in ('xhr','websocket','jsonp')
About
Htcap has been written by Filippo Cavallarin.
Please report bugs, comments ecc to filippo.cavallarin[]wearesegment.com
Licensing
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or(at your option) any later version.