Web content wordlists#
Summary#
Perimeter discovery is an important step during a web pentest and can, in some cases, lead to a website compromise. In order to carry out this recognition, several tools are available, including web content wordlists for web fuzzing:
Name | First release | Last Update | Max Size (lines) |
---|---|---|---|
SecLists | 2012/02/20 | 2021/02/12 | 1.273.833 (directory-list-2.3-big.txt) |
Assetnote wordlists | 2020/11/16 | 2021/01/28 | 4.319.406 (httparchive_js_2020_11_18.txt) |
Dirb wordlists | 2015/06/16 | 2015/06/16 | 20.469 (big.txt) |
DirBuster wordlists | 2013/05/01 | 2013/05/01 | 220.560 (directory-list-2.3-medium.txt) |
Dirsearch dicc.txt | 2013/05/22 | 2021/02/10 | 9.021 (dicc.txt) |
Wfuzz wordlists | 2014/10/23 | 2019/03/14 | 45.459 (megabeast.txt) |
Wordlistctl (Bonus) | 2018/10/28 | 2018/11/02 | N/A |
* this post has been written in Feb. 2021
Note that this post only includes routes, files and folder wordlists. Therefore, wordlists which include passwords such as rockyou.txt
will not be covered.
SecLists#
SecLists is a collection of multiple types of wordlists, including usernames, passwords, URLs, sensitive data patterns, fuzzing payloads, web shells, and many more.
SecLists is the security tester's companion. [...] The goal is to enable a security tester to pull this repository onto a new testing box and have access to every type of list that may be needed.
The repository is actively maintained and its last commit is less than two weeks ago. The package is provided by most of pentesting Linux releases such as Black Arch and Kali Linux.
Covered wordlists are located into Discovery/Web-Content/
. We can notice that there is a lot of available wordlists (121 in the main folder). Some of them are specific for a given technology (CGIs.txt
, coldfusion.txt
, oracle.txt
...), others are specific for a given language (common-and-french.txt
, common-and-dutch.txt
...). The main wordlist family present in SecList is the "RAFT Word Lists".
RAFT wordlists has been generated from robots.txt
from 1.7 million websites and were originally provided by RAFT Tool in 2011. In this family, wordlists are separated as follows :
- 4 families (directories, extensions, files and words)
- 3 sizes per family (large, medium and small)
- 2 case options (normal and lowercase)
Name | Size (lines) large | Size (lines) medium | Size (lines) small |
---|---|---|---|
raft-*-directories.txt | 62.283 | 30.000 | 20.116 |
raft-*-directories-lowercase.txt | 56.163 | 26.584 | 17.770 |
raft-*-files.txt | 37.042 | 17.128 | 11.424 |
raft-*-files-lowercase.txt | 35.324 | 16.243 | 10.848 |
raft-*-extensions.txt | 2.449 | 1.289 | 963 |
raft-*-extensions-lowercase.txt | 2.366 | 1.233 | 914 |
raft-*-words.txt | 119.600 | 63.087 | 43.003 |
raft-*-words-lowercase.txt | 107.982 | 56.293 | 38.267 |
Looking at raft-*-files.txt, we got the following extension repartition :
Histogram | Pie chart |
---|---|
SecLists also includes wordlists provided with dirbuster and dirb, covered in the rest of this post.
Assetnote wordlists#
Assetnote is a company that provides security tools and services to measure exposure to external attack. The company also provides a repository named Assetnote Wordlist.
Theses wordlists are generated monthly using Google BigQuery datasets with their GO client named commonspeak2, and results in content discovery and subdomain wordlists.
As these datasets are updated on a regular basis, the wordlists generated via Commonspeak2 reflect the current technologies used on the web.
Wordlists are generated per technologies, for this post we will focus on directories, API routes and PHP, ASP.NET, JSP/JSPA languages.
Note : As January 2021 wordlists seems less complete than previous wordlists, and February 2021 wordlists not available at this time, we will focus in November 2020 wordlists.
Name | Technologie | Size (lines) |
---|---|---|
httparchive_directories_1m_2020_11_18.txt | Directories | 1.000.000 |
httparchive_apiroutes_2020_11_20.txt | API routes | 953.011 |
httparchive_php_2020_11_18.txt | PHP | 74.887 |
httparchive_aspx_asp_cfm_svc_ashx_asmx_2020_11_18.txt | ASP .NET | 63.200 |
httparchive_jsp_jspa_do_action_2020_11_18.txt | JSP | 10.506 |
Assetnote Directories | Assetnote API routes |
---|---|
Note: /
, -
and _
are considered as a wildcard in the previous graph.
Dirb wordlists#
Dirb is a web discovery tool already covered in a previous post. The tool is provided with multiple wordlists including more common ones:
Name | Size (lines) |
---|---|
common.txt (default wordlist for dirb) | 4.614 |
big.txt | 20.469 |
small.txt | 959 |
Charsets in dirb family. | |
---|---|
Those wordlist doesn't have any extensions and only 2% of the words contain capital letters. You can also note that there is more "other" charsets in common.txt
than in big.txt
.
DirBuster wordlists#
DirBuster is a web discovery tool that has also been covered in a previous post. The tool is provided with multiple wordlists including directory-list-2.3
wordlists family.
Name | Size (lines) |
---|---|
directory-list-2.3-big.txt | 1.273.833 |
directory-list-2.3-medium.txt | 220.560 |
directory-list-2.3-small.txt | 87.664 |
Some packaged versions may not include directory-list-2.3-big.txt.
Such as dirb wordlists, directory-list-2.3 doesn't include any extensions.
Charsets in directory-list-2.3 family. | |
---|---|
Note: /
, -
and _
are considered as a wildcard in the previous graph.
Dirsearch dicc.txt#
dicc.txt is a wordlist provided with dirsearch tool. The wordlist has the particularity to provide the variable extension %EXT%
. Therefore, the wordlist must be used with tools that support %EXT%
format (see post about web discovery tools).
The wordlist has a total of 9021 lines distributed as follows :
dicc.txt | |
---|---|
You can note that there is "only" 500 words containing %EXT%
extension.
Wfuzz wordlists#
Wfuzz tool is provided with a lot of wordlists. Some of them in "general" directory are dedicated for directories and files enumeration. That's the case of megabeast.txt
, big.txt
, medium.txt
and common.txt
. None of those wordlist have words containing extensions. They are distributed as follows :
Charsets in wfuzz family. | |
---|---|
Wordlistctl (Bonus)#
In some case, an auditor may look for a specific wordlist. Wordlistctl is a tool design to fetch, install, update and search for a given wordlists. This python script offers more than 6400 wordlists and is maintained by BlackArch Linux distribution.
1 | $ wordlistctl search wordpress |
Security.txt (Bonus)#
Intro#
I (Alexandre ZANNI a.k.a. noraj) am adding a little bonus section about security.txt in web wordlists on Alex GARRIDO (a.k.a. zeecka) article.
In 2020, I wrote an article about security.txt
on TurgenSec blog:
Security.txt | Progress in Ethical Security Research.
I invite you to read the article to understand what is security.txt, what it is used for, and how widely adopted it has become.
Here we are only going to get an idea of how widely security.txt is included in security wordlist.
Stats#
SecLists#
There are only 3 lists used for Web content discovery in SecLists that are actually including at least one variant of the security.txt file among the 233.
1 | $ grep -rnE '^security.txt|.well-known/security.txt' /usr/share/seclists/Discovery/Web-Content |
We can conclude that only 1,3% of Web content discovery in SecLists are including security.txt.
But SVNDigger/all.txt
is only including security.txt
while common.txt
and
dirsearch.txt
are only including .well-known/security.txt
. So zero list
is including the 2 variants.
Assetnote Wordlists#
The Assetnote Wordlists are cut under 3 categories:
- automated
- manual
- technologies
We'll exclude technologies from the stats since it's focusing on specific products.
There are only 3 lists used for Web content discovery in Assetnote Wordlists that are actually including at least one variant of the security.txt file among the 77 generic wordlists.
1 | $ grep -rnE '^security.txt|.well-known/security.txt' /tmp/assetnote-wordlists/{automated,manual} |
We can conclude that only 3,8% of Web content discovery in Assetnote Wordlists are including security.txt.
But all the three are only including security.txt
and do not include the
standard path .well-known/security.txt
.
Conclusion#
If you are trying to find security.txt files, you should build your custom wordlists including the two following entries as most of the generic wordlists don't include them.
1 | security.txt |
An alternative would be to run the common wordlists you are used to fuzz with and build only an additional wordlist including only files like security.txt or other files that may be missing from most wordlists so you don't have to update the generic part on your own.
Comparative table#
Without further ado, here is a comparative table of the different wordlists discussed in this post. Colored cases represent a high correlation between wordlists. To understand the matrix you should read: "N% of the wordlist at line Y is contained in wordlist at column X"
.
I.E.: 87% of wordlist n°17 (dirb - small) is contained in wordlist n°0 (seclists - raft-large-files).
The sources used to generate this chart are available on this repository: sec-it/WL-Comparison.
An interactive version of the chart is available online.
About the author#
This piece was written by Alex GARRIDO a.k.a. zeecka. Alex is a pentester at SEC-IT.
Website: zeecka.fr