Analyzing PhishTank’s verified online phishes.

4 min readApr 11, 2020

Phishing attacks have been targetting individuals since the 1980s as per Wikipedia, and we need to admit that threat actors have perfectioned this technique (via spear-phishing attacks) in such a way they can target more organizations and achieve their goals.

I’ve been working with my team to develop a Cybersecurity framework, and detecting phishing attempts targetting organizations is one of our main points of research. We have the capability to correlate network connections against threat intel feeds in an automated fashion.

If you are a believer in the so-called Pyramid of Pain as much as I’m, you probably know already that phishing domains become a low-hanging fruit in terms of detection (third level in the pyramid), yet having this functionality in place allows detecting potential attacks with minimum effort.

David Bianco’s Pyramid of Pain: https://www.oreilly.com/library/view/intelligence-driven-incident-response/9781491935187/ch04.html

Therefore, I spent some time looking at PhishTank with the idea in mind to correlate our detections against their dataset. I also was interested in understanding how many phishes related to COVID-19 did PhishTank had. As you could imagine, attackers are using the current global situation of the coronavirus to carry out attacks and compromise organizations.

What is PhishTank anyway?

As described in their Website:

PhishTank is a free community site where anyone can submit, verify, track and share phishing data. PhishTank is free to everyone, both the website and the data (via the API). PhishTank is not protection. PhishTank is an information clearinghouse, which helps to pour sunshine on some of the dark alleys of the Internet. PhishTank provides accurate, actionable information to anyone trying to identify bad actors, whether for themselves or for others (i.e., building security tools). PhishTank is operated by OpenDNS, a company founded in 2005 to improve the Internet through safer, faster, and smarter DNS.

How many online valid phishes does PhishTank manage?

At the time of writing this blog, there are 13,786 online, verified phishes. This number is quite relative and it is in function to the submissions and criteria used by Phishtank to validate such phishes. This number is also visible at their statistics page here

COVID-19 related phishes

I have created a simple application that allows me to parse the PhishTank dataset for online, verified phishes into Elasticsearch. The reason for this is that by adding extra parsing and transformations, it is easier to understand the dataset.

The code for this parsing can be found at Github here.

I managed to find 26 verified phishes related to COVID-19 from PhishTank, take a look at the following picture:

Phishtank to ELK: Visualizing online, verified phishes associated with COVI-19

The Power of visualizations

I’ve been a big fan of Elasticsearch to upload datasets and understand better their demographics, that is how we could understand the dataset better by analyzing the data and be able to get meaningful information out of that.

Such analysis is great when making decisions and understanding faster such datasets. Here are some screenshots on the visualizations I have created for PhishTank:

Here are some distributions from the online verified phishes pulled via Phishtank API.

As depicted above, we can understand that some of the domains hosting online verified phishes include google docs, firebase, bit.ly, office forms, drive among others. We can also see the targets of such phishes including a large number where the target is unclear (Other) but we can see other targetted domains such as PayPal, Microsoft, Facebook, eBay, etc.

As we can see, there are several visualizations that can be built with the objective to understand a better PhishTank dataset. This will allow us to create detection rules within our security products and detect when connections to these domains are happening within the organization.

Takeaways:

Most of the traffic comes from legitimate domains, so if we create a rule to detect a network connection, for example to docs.google.com, several false-positives will show up; this will happen because, in principle, there could be legitimate connections to these domains (i.e connections to docs.google.com). From the threat actor point of view, this is a nice move, this traffic will be lost on regular traffic if the organization does not have another sort of detections such as an analysis of the paths, historic records of connections to those URLs, length of the URL among others.
Most of the connections go to the US, this is understandable considering that majority of those phishes are hosted at docs.google.com.
I would have expected more phishes related to COVID-19. Threat actors will look for novel opportunities to hit organizations and the global crisis caused by this disease becomes a low-hanging fruit strategy for them to exploit.
It is good to know who is being targeted, by target it means that it is deceiving to the end-user that the links belong to legit organizations.