Blog, PHP

Why You Should Never Post Your Email To Social Media

This post was originally written back in late 2012 as part of a quest to better understand spammers. As part of the near-pointless crusade we looked at a large number of ways in which they obtained and harvested data in an attempt to determine ways of mitigating abuse. One such technique we looked at was the way in which they harvested emails (and other personal details) from social media. This article, while a little outdated, will look at some of the automated methods that spam-bots use to harvest emails from Twitter.

Note: With the new Twitter API, unauthenticated requests to Twitter are no longer permitted. For that reason, the code will no longer work as advertised. Authenticated requests will still be able to use the method as described. The same methods apply to any social network or public webpage.

In the time we emulated techniques used by spammers to harvest emails from Twitter, we saved 2,654,529 emails shared by users (in their tweets) to a local database.

If you read no further, the point of this post is clear: If you must make your email public, do so via an unreadable form with a pattern that is difficult to emulate, such as something {at} something [dot] com. Never post your email in full.

Twitter is a spammers paradise. By searching for emails in various feeds made available by Twitter - and assuming the email in each tweet belongs to the profile posting it - they can very easily determine very specific and targeted information relating to the owner of the (valid and confirmed) email address. Interests, geographic location, a website, your name, and a plethora of other personal details can all very easily be extracted and manufactured into a targeted profile. Greater profile accuracy can be achieved by adding specific #hashtags or keywords to the email search query.

In the time I've run my little experiment, I've said 2,654,529 emails to my database. Keep in mind that I'm doing this for demonstration purposes only... but if I were an aggressive spammer I could probably have saved at least 5 times as many contacts. I'm only searching the two big free email providers. The fact that so many non-Gmail/Hotmail emails were caught up in the mix suggests that massive results could be achieved by widening search parameters (or using a regular expression on a wider search) to include other free and/or popular email services.

The emails I'm saving are just an encrypted username with the domain intact. I'm not saving actual emails.

What I'll show you in this post is very basic information on what spammers are doing and how they do it. It should be noted that I'm vigorously opposed to spam and don't condone use of the code I've provided below for anything other than educational purposes. The principles discussed have countless ethical applications and should be used as such. Unsolicited email is a crime worthy of castration.

If you're worried about what I've done, don't be; I'm not providing anything in this post that isn't already freely available to spammers elsewhere.

Email in Search Feeds

By searching Twitter's Search API , spammers can either parse the returned data for detailed information relating to each tweet or, by simply using a regular expression, extract all occurrences of an email from returned text. It's the latter technique I'll talk about in this post.

Search Twitter and Save Emails to a Text File

For the purpose of this example we're going to search twitter for the most recent occurrences of only the terms gmail.com or hotmail.com (since they're the most popular free email providers). We'll search the returned text for all instances of either email and save the unique results to a text file.

The first thing the spammer will do is make a request to Twitter and extract the data. The search that I'm using to retrieve results (in JSON format) is as follows:

<?php

http://search.twitter.com/search.json?q=gmail.com+OR+hotmail.com&rpp=100&include_entities=true&result_type=recent

To return an ATOM feed, use http://search.twitter.com/search.atom. For a list of all available URL parameters, visit this page . For information on using search (including use of search operators), visit Twitter's docs here .

The CURL Function

Request Data from Twitter

<?php

function curlPage($proxy, $url, $referer, $agent, $header, $timeout) {

$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, $url);

curl_setopt($ch, CURLOPT_HEADER, $header);

curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

// curl_setopt($ch, CURLOPT_PROXY, $proxy); // Using a proxy?

curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 1);

curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);

curl_setopt($ch, CURLOPT_REFERER, $referer);

curl_setopt($ch, CURLOPT_USERAGENT, $agent);

$result = curl_exec($ch);

curl_close($ch);

return $result;

}

$proxy = "IP:PORT";

$url = "http://search.twitter.com/search.json?q=QUERY";

$referer = "http://www.google.com/";

$agent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.8) Gecko/2009032609 Firefox/3.0.8";

$header = "1";

$timeout = "5";

$twitterSearch = curlPage($proxy, $url, $referer, $agent, $header, $timeout);

The clever spammers would obviously randomise their $agent, $referer, $agent and other details in an attempt to avoid detection.

file_get_contents() Alternative

If you wanted to test this out and your server didn't support cURL you could use PHP's file_get_contents() as an alternative.

<?php

$twitterSearch = file_get_contents("http://search.twitter.com/search.json?q=QUERY");

Finding the Emails

Based on the returned text, we will now perform a regular expression match using preg_match_all() to find all occurrences of any email. All the matches will be returned in an array.

<?php

preg_match_all("([a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+(?:[A-Z]{2}|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum)\b)siU",$twitterSearch,$matches);

Based on the data returned in the $matches array, we must loop through and write each unique email to a text file. It should be noted that my code is grossly inefficient (I'm not overly concerned about writing functional spam code or clever enough to think of an alternative). For my own example (and other applications that use the same principle), I'm writing the unique emails to a MySQL database.

We'll iterate through the array and check each email against those that were already written to the email text file. To accomplish this, I'm using strpos() - not very efficient... but it's all I could think of.

<?php

foreach ($matches[0] AS $email_id) {

$email = trim(strtolower($email_id));

$file = file_get_contents('emails.txt', 'r');

$posExist = strpos($file, $email);

if ($posExist === false) {

$File = "emails.txt";

$Handle = fopen($File, 'a+');

$Data = "$email\n";

fwrite($Handle, $Data);

echo "$email written to text file";

fclose($Handle);

} else {

echo "$email is duplicate";

}

The email text file (emails.txt) should be in the same directory as your PHP script and it must have appropriate write permissions.

To view all emails returned from your search, you can simply use PHP's print_r() function.

<?php

/* Wrap result in pre tags */

print_r($matches);

My Working Example

I started saving Gmail and Hotmail emails to a MySQL database for experimental and demonstration purposes only at 3pm on the 9th of August, 2012. I'm making a query only once per hour, and in that time, I've saved 2,654,529. Again, all are unique, valid and confirmed emails that I could use to build a very targeted spam email list.

Practical Applications

I use the Twitter search in company with their other API offerings all the time for statistical analysis of various types of information with a focus on airline Twitter trends. Selected data is compiled into both csv files and graphical representation for various clients. From that data, and by utilising other known data about a user, an organisation is able to better understand how their brand is represented by different groups in different geographical areas. I'm also able to compare participation and reply rates to various tweets. It's one of the most powerful - if not the most powerful - marketing (or survey) tools available anywhere... and it's essentially free.

It's expected that the search API will become more restrictive in time (not unlike other API calls ) meaning it'll be easier for Twitter to police this kind of activity. Twitter will know what applications are making what requests... meaning that they can block a specific application without affecting other users of a particular IP address.

The meta data in search can quite easily be used to give an emotional snapshot of a group of users. It can be used in a basic sense to draw word clouds and highlight attention to regional trends or, in advanced cases, it allows you to closely monitor your competition. I could ramble forever...

Harvesting Emails

The ability to harvest emails is one of the less redeeming 'features' of the search API - but it's not the fault of Twitter - it's simply users that don't know any better.

I created a Twitter bot that would tweet random people that posted their email address. It is, in itself, very spammy... so I gave up on the experiement. The tweet included a link to a page with all sorts of information I was able to automatically extract from their profile or tweet (including interest based keywords etc).

How would you respond to an email that said:

Hi YourName, I just saw your post on Twitter and thought that you might also be interested in this http://tinyurl.com/internoetics. I hope you don't mind me emailing you directly (I got your email from one of your earlier posts). Regards, Marty @Internoetics.

The link would send you to a dodgy page with nefarious intent.

Again, if you must make your email public, do so via an unreadable form with a pattern that is difficult to emulate, such as something {at} something [dot] com. Never post your email in full.