This post gives you an overview about how you can use data leaks and breaches in your investigations, how to maybe find them, how to search them and some handy tools build around them. Some of the presented techniques are variations of the contents from Michael Bazzells book “Open Source Intelligence Techniques” and I highly encourage you to buy this book. If you already have this book, read carefully through the chapter on “Data Breaches and Leaks” and you might find some more gold. Throughout this post I will provide you links to additional resources that explore some topics further.

In this post I will not include any screenshots or examples as they always include personal data I am not allowed to share. You could reproduce all steps with your own sample data to learn the techniques.

What are they:

Data leaks and breaches are the loss of data by companies, websites, forums or other entities that will more often than not end up somewhere in the internet. If a company by accident exposes a database with user information publicly accessible to the internet, we call it leak and if some black hats get in, steal the database and sell it, we say it is a breach. You should know those terms because a lot of people are going to get upset, if you do not. You can educate yourself by reading this blog( https://blog.f-secure.com/data-breach-and-data-leak-whats-the-difference/). For the rest of this post it will not be of any importance anyway.

In this post we mostly discuss leaks and breaches containing personal data like emails, phone numbers and passwords, but they can contain a lot more valuable data like IP addresses, physical addresses, SSNs, scans of drivers licenses and passports and much more.

Where to download them:

Today there are many online services for searching through leak and breach data (discussed later), nevertheless having offline copies of those datasets is always an advantage as it will give you the full data without having to pay for it and makes you independent from those services. But be aware that storing them can quickly consume a lot of disk space.

Disclaimer: I am in no way a legal expert. This is for educational purpose only. Having this data can be illegal or violate company or other policy. Do not use this to commit crimes. Read up on laws or contact a lawyer before you reproduce any steps. You do this at your own risk.

When you search for any data like this, it will be on sale by potential criminals. Please do never purchase data like this, you will get ripped off or encourage illegal activities and it is unnecessary. And of course use proper OPSEC (VM, VPN, hardened browser) when searching for this type of data. Needless to say .exe files are probably not the right ones

I will not point you to any links but the following things could potentially lead you to some pretty juicy stuff:

  • Certain forums like Nulled, CrackingX and others will offer some datasets for free if you search through them. Most of the time you need a user account, please use a burner email. As already mentioned please do not purchase them ever.
  • If a new breach or leak of a company is reported just googling for it will often only give you news reports. Instead try “company name” filetype:zip or try “company name” ext:zip. Try this with multiple common file extensions like .rar, .7z, .sql, .gz and .txt.
  • Often data like this will be shared on pastebin so it is always worth looking for “whatever you look for” site:pastebin.com.
  • If some site gives you a lot of false positives exclude it with site:annoying.com.
  • Googling for “LIUsers.7z” will get you to a dataset on the webarchive that we will later use. It contains the user id and corresponding email of LinkedIn users.
  • A search for “SnapChat.7z” will reveal a similar dataset containing Snapchat usernames and phone numbers.
  • Googling for Gravatar site:crackingx,com could bring up the scraped database of Gravatar which is discussed later.
  • A list of over 200GB of emails and passwords was compiled in 2021 and googling for “CompilationOfManyBreaches.7z” or “CompilationOfManyBreaches.7z” 18.65 may help. The h8mail post from the first search may be useful (tool will be discussed later).
  • Ransomware gangs now expose data of companies that do not pay. It is pretty well lined out in this post and the accompanying podcast: https://inteltechniques.com/blog/2021/07/23/personal-ransomware-exposure/.
  • Companies may accidentally leak data by not at all securing their MongoDB or ElasticSearch databases. For further information read https://habr.com/en/post/443132/ or https://inteltechniques.com/blog/2019/05/24/the-privacy-security-osint-show-episode-123/

As you can see Google Fu will often do the trick.

When obtaining datasets like this you should always validate them by crosschecking with various online sources or accounts of you or others you know of, who are exposed in this leak or breach.

If you go ahead and download all of the breaches available out there you will be out of disk space pretty soon, so consider doing clean-up on the datasets or storing them in a unified database. As I unfortunately have pretty bad habits when it comes to storing this kind of data and this is a whole separate topic, I advise you to do some research and figure out what works for you.

If you are interested in collecting breach data, you may want to listen to: https://inteltechniques.com/blog/2023/04/28/the-privacy-security-osint-show-episode-295/.

Online sources for leaked and breached data:

While having all this data stored offline has benefits, it is simply impractical to get every dataset out there. Therefore I will now show you some online sources that allow you to work with this kind of data.

HaveIBeenPwned:

https://haveibeenpwned.com/

Probably the most common source to find emails, phone numbers and passwords that were exposed in data breaches and leaks. This site will never give you any clear text password but it will show you the source of the exposure. This is pretty useful if you are dealing with combo lists or want to verify an obtained dataset. It will also help you to see in which datasets your target is included and you can go from there and locate and download them. Although the main purpose of this tool is to check your own exposure and improve your OPSEC.

Breachdirectory:

https://breachdirectory.org/

Works similar to HIBP but shows you a partial password and sometimes a full hash, which is even better for validation purpose.

Dehashed:

https://www.dehashed.com/

A paid resource aggregating data breaches and leaks and making them fully searchable. Not my favourite as I do not want to put in the money but probably the easiest way to access this kind of data.

Snusbase:

https://snusbase.com/

Another paid service I have never used but I will leave it here for reference.

IntelX

https://intelx.io/

This data aggregator is not only for data leaks and breaches but will give you a lot of the data contained within them. The free trial account will show you the sources and therefore help you to find additional datasets for your offline collection.

PSBDMP

https://psbdmp.ws/

This website collects pastebin data and makes it searchable. But to view it you have to register with a Google Account and much of the functionality is achieved from using IntelX.

illicit services

https://search.illicit.services/

Probably the most amazing service of them all. This one is a banger and I hope it stays. It allows for a variety of search options like email, first name, phone number, ASN and much more. On top of all that it gives you full access to all the data it got and it has plenty.

Using them:

By now you have potentially thousands of GB of breaches and leaks on your device and many more sources to find them online, so let’s put them to good use. Most of my experience is working with data considering names, emails and passwords so most of the examples here will focus on that, but do keep in mind that they often include much more good stuff you can use during your investigation.

When working with all the data you have downloaded, you probably want to be in a Linux environment as it provides the necessary tools to work with large amounts of data. Almost any GUI based editor will crumble and it will take you ages to get anything done that is why you have to use either a VM or WSL to really get all the value out of it.

ripgrep:

https://github.com/BurntSushi/ripgrep

Is a fast and reliable tool to search through files in a directory with regex. See the above link for documentation but most of the time you will use rg -a -F -i -N your term.

h8mail

https://github.com/khast3x/h8mail

H8mail is a wonderful tool integrating with many APIs to search leak data. It does however include the functionality to search local files. This is outlined, for example, in this article:  https://khast3x.club/posts/2021-02-17-h8mail-with-COMB/.

CompilationOfManyBreaches:

The creators included a script to search for email addresses within the compilation in record speed. Note that it is possible to search for partial emails with the script, if you are looking for the start of an email address (for example “peter.example” will give you all the peter.examples at any domain). The folder structure resembles the first letters of the pwned email addresses.

Gaining an initial foothold:

In many cases we can leverage data from leaks and breaches to gain an initial foothold on our target. Examples for this are:

  • Searching for the combination of First and Last Name in datasets like the Gravatar dataset, to obtain information on email addresses, user accounts and passwords. From there you can find other accounts of them.
  • Obtaining the user id from the LinkedIn account (inspect source code of profile, search “member:". Currently it is the second result from the bottom. It is the one that does change when you do this on other profiles) and retrieving the corresponding email address from the dataset.
  • Same thing as above just with Snapchat users and phone numbers.

Broadening the scope:

This kind of data can help us to find additional attack surface for our target. This could be achieved, for example, by the following methods:

  • Enter the phone number or email address of your target into HIBP or a similar side and obtain websites they have accounts with. Specifically search for those datasets in other sources to gain more information. You can also investigate their accounts on those websites.
  • If the target has a unique password you can search the datasets for this password and get potential additional targets.
  • Search for partial email addresses. For example target is [email protected], try “lucy.someone” and find other mail addresses of your target.
  • Use information you find during the broadening, like email addresses, passwords and phone numbers, to further broaden the scope by using the techniques above.

Compromising the target:

Depending on the goal of your investigation, targets can be fully compromised only by data included in leaks and breaches. Compromising any target is ALWAYS illegal unless you have explicit permission to do so from all involved parties. Examples of compromise can be:

  • The target reuses publicly available passwords on other accounts and this accounts can be taken over.
  • The home address, true identity or other compromising information about the target is exposed in any dataset you can link to the target.
  • You can link the target to an account it has on any illegal or otherwise compromising website.
  • Data you obtain from any dataset gives you enough information to achieve a successful spear phishing attack.

Additional Tools:

Some tools that are built for the purpose of aiding your overall investigation are Maltego https://www.maltego.com/ and Spiderfoot https://www.spiderfoot.net/. As mentioned they are not focused on data breaches and leaks but they often helped me working with them. For me they really help because they are able to automatically collect data from various sources and illustrate the relationship between datapoints by displaying them in graphs.

Conclusions:

Of course this post only mentions methods that include data breaches and leaks but in most investigations you have to combine those with the various other tools in your toolset. Nevertheless this kind of data is an extremely valuable resource with a devastating potential, if correctly used. It is highly advised to use all of this to test your own exposure and take appropriate measures to protect against possible attacks.