Web wordlists in 2021

Table of contents
  1. 1. Web content wordlists
    1. 1.1. Summary
    2. 1.2. SecLists
    3. 1.3. Assetnote wordlists
    4. 1.4. Dirb wordlists
    5. 1.5. DirBuster wordlists
    6. 1.6. Dirsearch dicc.txt
    7. 1.7. Wfuzz wordlists
    8. 1.8. Wordlistctl (Bonus)
    9. 1.9. Security.txt (Bonus)
      1. 1.9.1. Intro
      2. 1.9.2. Stats
        1. 1.9.2.1. SecLists
        2. 1.9.2.2. Assetnote Wordlists
      3. 1.9.3. Conclusion
  2. 2. Comparative table
  3. 3. About the author

Web content wordlists#

Summary#

Perimeter discovery is an important step during a web pentest and can, in some cases, lead to a website compromise. In order to carry out this recognition, several tools are available, including web content wordlists for web fuzzing:

Name First release Last Update Max Size (lines)
SecLists 2012/02/20 2021/02/12 1.273.833 (directory-list-2.3-big.txt)
Assetnote wordlists 2020/11/16 2021/01/28 4.319.406 (httparchive_js_2020_11_18.txt)
Dirb wordlists 2015/06/16 2015/06/16 20.469 (big.txt)
DirBuster wordlists 2013/05/01 2013/05/01 220.560 (directory-list-2.3-medium.txt)
Dirsearch dicc.txt 2013/05/22 2021/02/10 9.021 (dicc.txt)
Wfuzz wordlists 2014/10/23 2019/03/14 45.459 (megabeast.txt)
Wordlistctl (Bonus) 2018/10/28 2018/11/02 N/A

* this post has been written in Feb. 2021

Note that this post only includes routes, files and folder wordlists. Therefore, wordlists which include passwords such as rockyou.txt will not be covered.

SecLists#

SecLists is a collection of multiple types of wordlists, including usernames, passwords, URLs, sensitive data patterns, fuzzing payloads, web shells, and many more.

SecLists is the security tester's companion. [...] The goal is to enable a security tester to pull this repository onto a new testing box and have access to every type of list that may be needed.

The repository is actively maintained and its last commit is less than two weeks ago. The package is provided by most of pentesting Linux releases such as Black Arch and Kali Linux.

Covered wordlists are located into Discovery/Web-Content/. We can notice that there is a lot of available wordlists (121 in the main folder). Some of them are specific for a given technology (CGIs.txt, coldfusion.txt, oracle.txt ...), others are specific for a given language (common-and-french.txt, common-and-dutch.txt ...). The main wordlist family present in SecList is the "RAFT Word Lists".

RAFT wordlists has been generated from robots.txt from 1.7 million websites and were originally provided by RAFT Tool in 2011. In this family, wordlists are separated as follows :

  • 4 families (directories, extensions, files and words)
  • 3 sizes per family (large, medium and small)
  • 2 case options (normal and lowercase)
Name Size (lines) large Size (lines) medium Size (lines) small
raft-*-directories.txt 62.283 30.000 20.116
raft-*-directories-lowercase.txt 56.163 26.584 17.770
raft-*-files.txt 37.042 17.128 11.424
raft-*-files-lowercase.txt 35.324 16.243 10.848
raft-*-extensions.txt 2.449 1.289 963
raft-*-extensions-lowercase.txt 2.366 1.233 914
raft-*-words.txt 119.600 63.087 43.003
raft-*-words-lowercase.txt 107.982 56.293 38.267

Looking at raft-*-files.txt, we got the following extension repartition :

Histogram Pie chart
Raft large file repartition Raft large file repartition
Raft medium file repartition Raft medium file repartition
Raft small file repartition Raft small file repartition

SecLists also includes wordlists provided with dirbuster and dirb, covered in the rest of this post.

Assetnote wordlists#

Assetnote is a company that provides security tools and services to measure exposure to external attack. The company also provides a repository named Assetnote Wordlist.

Theses wordlists are generated monthly using Google BigQuery datasets with their GO client named commonspeak2, and results in content discovery and subdomain wordlists.

As these datasets are updated on a regular basis, the wordlists generated via Commonspeak2 reflect the current technologies used on the web.

Assetnote Wordlist

Wordlists are generated per technologies, for this post we will focus on directories, API routes and PHP, ASP.NET, JSP/JSPA languages.

Note : As January 2021 wordlists seems less complete than previous wordlists, and February 2021 wordlists not available at this time, we will focus in November 2020 wordlists.

Name Technologie Size (lines)
httparchive_directories_1m_2020_11_18.txt Directories 1.000.000
httparchive_apiroutes_2020_11_20.txt API routes 953.011
httparchive_php_2020_11_18.txt PHP 74.887
httparchive_aspx_asp_cfm_svc_ashx_asmx_2020_11_18.txt ASP .NET 63.200
httparchive_jsp_jspa_do_action_2020_11_18.txt JSP 10.506
Assetnote Directories Assetnote API routes
Assetnote Directories Assetnote API routes

Note: /, - and _ are considered as a wildcard in the previous graph.

Dirb wordlists#

Dirb is a web discovery tool already covered in a previous post. The tool is provided with multiple wordlists including more common ones:

Name Size (lines)
common.txt (default wordlist for dirb) 4.614
big.txt 20.469
small.txt 959
Charsets in dirb family.  
big.txt common.txt
small.txt

Those wordlist doesn't have any extensions and only 2% of the words contain capital letters. You can also note that there is more "other" charsets in common.txt than in big.txt.

DirBuster wordlists#

DirBuster is a web discovery tool that has also been covered in a previous post. The tool is provided with multiple wordlists including directory-list-2.3 wordlists family.

Name Size (lines)
directory-list-2.3-big.txt 1.273.833
directory-list-2.3-medium.txt 220.560
directory-list-2.3-small.txt 87.664

Some packaged versions may not include directory-list-2.3-big.txt.

Such as dirb wordlists, directory-list-2.3 doesn't include any extensions.

Charsets in directory-list-2.3 family.  
directory-list-2.3-big.txt directory-list-2.3-medium.txt
directory-list-2.3-small.txt

Note: /, - and _ are considered as a wildcard in the previous graph.

Dirsearch dicc.txt#

dicc.txt is a wordlist provided with dirsearch tool. The wordlist has the particularity to provide the variable extension %EXT%. Therefore, the wordlist must be used with tools that support %EXT% format (see post about web discovery tools). The wordlist has a total of 9021 lines distributed as follows :

dicc.txt  
dicc.txt dicc.txt

You can note that there is "only" 500 words containing %EXT% extension.

Wfuzz wordlists#

Wfuzz tool is provided with a lot of wordlists. Some of them in "general" directory are dedicated for directories and files enumeration. That's the case of megabeast.txt, big.txt, medium.txt and common.txt. None of those wordlist have words containing extensions. They are distributed as follows :

Charsets in wfuzz family.  
megabeast.txt big.txt
medium.txt common.txt

Wordlistctl (Bonus)#

In some case, an auditor may look for a specific wordlist. Wordlistctl is a tool design to fetch, install, update and search for a given wordlists. This python script offers more than 6400 wordlists and is maintained by BlackArch Linux distribution.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
$ wordlistctl search wordpress

--==[ wordlistctl by blackarch.org ]==--

> wordpress (29.20 Kb)
> urls-wordpress-3 (36.62 Kb)
> wordpress-attacks-july2014 (88.00 B)
> wordpress_usernames (541.57 Mb)
> wordpress_attacks_july2014 (88 B)

$ wordlistctl fetch -l urls-wordpress-3
--==[ wordlistctl by blackarch.org ]==--

[*] downloading urls-wordpress-3.3.1.txt to /usr/share/wordlists/discovery/urls-wordpress-3.3.1.txt.part
[+] downloading urls-wordpress-3.3.1.txt completed

Security.txt (Bonus)#

Intro#

I (Alexandre ZANNI a.k.a. noraj) am adding a little bonus section about security.txt in web wordlists on Alex GARRIDO (a.k.a. zeecka) article.

In 2020, I wrote an article about security.txt on TurgenSec blog: Security.txt | Progress in Ethical Security Research.

I invite you to read the article to understand what is security.txt, what it is used for, and how widely adopted it has become.

Here we are only going to get an idea of how widely security.txt is included in security wordlist.

Stats#

SecLists#

There are only 3 lists used for Web content discovery in SecLists that are actually including at least one variant of the security.txt file among the 233.

1
2
3
4
5
6
7
$ grep -rnE '^security.txt|.well-known/security.txt' /usr/share/seclists/Discovery/Web-Content
/usr/share/seclists/Discovery/Web-Content/SVNDigger/all.txt:15772:security.txt
/usr/share/seclists/Discovery/Web-Content/common.txt:87:.well-known/security.txt
/usr/share/seclists/Discovery/Web-Content/dirsearch.txt:933:.well-known/security.txt

$ find /usr/share/seclists/Discovery/Web-Content -type f | wc -l
233

We can conclude that only 1,3% of Web content discovery in SecLists are including security.txt.

But SVNDigger/all.txt is only including security.txt while common.txt and dirsearch.txt are only including .well-known/security.txt. So zero list is including the 2 variants.

Assetnote Wordlists#

The Assetnote Wordlists are cut under 3 categories:

  • automated
  • manual
  • technologies

We'll exclude technologies from the stats since it's focusing on specific products.

There are only 3 lists used for Web content discovery in Assetnote Wordlists that are actually including at least one variant of the security.txt file among the 77 generic wordlists.

1
2
3
4
5
6
7
$ grep -rnE '^security.txt|.well-known/security.txt' /tmp/assetnote-wordlists/{automated,manual}
/tmp/assetnote-wordlists/automated/httparchive_txt_2021_02_28.txt:623:security.txt
/tmp/assetnote-wordlists/automated/httparchive_txt_2021_01_28.txt:701:security.txt
/tmp/assetnote-wordlists/automated/httparchive_txt_2020_11_18.txt:366:security.txt

$ find /tmp/assetnote-wordlists/{automated,manual} -type f | wc -l
77

We can conclude that only 3,8% of Web content discovery in Assetnote Wordlists are including security.txt.

But all the three are only including security.txt and do not include the standard path .well-known/security.txt.

Conclusion#

If you are trying to find security.txt files, you should build your custom wordlists including the two following entries as most of the generic wordlists don't include them.

1
2
security.txt
.well-known/security.txt

An alternative would be to run the common wordlists you are used to fuzz with and build only an additional wordlist including only files like security.txt or other files that may be missing from most wordlists so you don't have to update the generic part on your own.

Comparative table#

Without further ado, here is a comparative table of the different wordlists discussed in this post. Colored cases represent a high correlation between wordlists. To understand the matrix you should read: "N% of the wordlist at line Y is contained in wordlist at column X".

I.E.: 87% of wordlist n°17 (dirb - small) is contained in wordlist n°0 (seclists - raft-large-files).

Comparison between wordlists

The sources used to generate this chart are available on this repository: sec-it/WL-Comparison.

An interactive version of the chart is available online.

About the author#

This piece was written by Alex GARRIDO a.k.a. zeecka. Alex is a pentester at SEC-IT.

Website: zeecka.fr