Home Data Engineering Data News Process of Data scraping from Social media

Process of Data scraping from Social media

Process of Data scraping from Social media

Name, location, age, job role, marital status, headshot? The amount of information people are comfortable with posting online varies.

But most people accept that whatever we put on our public profile page is out in the public domain.

So, how would you feel if all your information was catalogued by a hacker and put into a monster spreadsheet with millions of entries, to be sold online to the highest paying cyber-criminal?

That’s what a hacker calling himself Tom Liner did last month “for fun” when he compiled a database of 700 million LinkedIn users from all over the world, which he is selling for around $5,000 (£3,600; €4,200).

The incident, and other similar cases of social media scraping, have sparked a fierce debate about whether or not the basic personal information we share publicly on our profiles should be better protected.

In the case of Mr Liner, his latest exploit was announced at 08:57 BST, UK time, in a post on a notorious hacking forum.

It was a strangely civilised hour for hackers, but of course we have no idea which time zone, the hacker who calls himself Tom Liner, lives in.

“Hi, I have 700 million 2021 LinkedIn records”, he wrote.

Included in the post was a link to a sample of a million records and an invite for other hackers to contact him privately and make him offers for his database.

Understandably the sale caused a stir in the hacking world and Tom tells me he is selling his haul to “multiple” happy customers for around $5,000 (£3,600; €4,200).

He won’t say who his customers are, or why they would want this information, but he says the data is likely being used for further malicious hacking campaigns.

The news has also set the cyber-security and privacy world alight with arguments about whether or not we should be worried about this growing trend of mega scrapes.

What’s important to understand here is that these databases aren’t being created by breaking into the servers or websites of social networks.

They are largely constructed by scraping the public-facing surface of platforms using automatic programmes to take whatever information is freely available about users.

In theory, most of the data being compiled could be found by simply picking through individual social media profile pages one-by-one. Although of course it would take multiple lifetimes to gather as much data together, as the hackers are able to do.

So far this year, there have been at least three other major “scraping” incidents.

In April, a hacker sold another database of around 500 million records scraped from LinkedIn.

In the same week another hacker posted a database of scraped information from 1.3 million Clubhouse profiles on a forum for free.

Also in April, 533 million Facebook user details were compiled from a mixture of old and new scraping before being given away on a hacking forum with a request for donations.

The hacker who says he is responsible for that Facebook database, calls himself Tom Liner.

I spoke with Tom over three weeks on Telegram messages, a cloud-based instant messenger app. Some messages and even missed calls were made in the middle of the night, and others during working hours so there was no clue as to his location.

The only clues to his normal life were when he said he couldn’t talk on the phone as his wife was sleeping and that he had a daytime job and hacking was his “hobby”.

Tom told me he created the 700 million LinkedIn database using “almost the exact same technique” that he used to create the Facebook list.

He said: “It took me several months to do. It was very complex. I had to hack the API of LinkedIn. If you do too many requests for user data in one time then the system will permanently ban you.”

API stands for application programming interface and most social networks sell API partnerships, which enable other companies to access their data, perhaps for marketing purposes or for building apps.

Tom says he found a way to trick the LinkedIn API software into giving him the huge tranche of records without setting off alarms.

Privacy Shark, which first discovered the sale of the database, examined the free sample and found it included full names, email addresses, gender, phone numbers and industry information.

LinkedIn insists that Tom Liner did not use their API but confirmed that the dataset “includes information scraped from LinkedIn, as well as information obtained from other sources”.

It adds: “This was not a LinkedIn data breach and no private LinkedIn member data was exposed. Scraping data from LinkedIn is a violation of our Terms of Service and we are constantly working to ensure our members’ privacy is protected.”

In response to its April data scare Facebook also brushed off the incident as an old scrape. The press office team even accidentally revealed to a reporter that their strategy is to “frame data scraping as a broad industry issue and normalise the fact that this activity happens regularly”.

However, the fact that hackers are making money from these databases is worrying some experts on cyber security.The chief executive and founder of SOS Intelligence, a company which provides firms with threat intelligence, Amir Hadžipašić, sweeps hacker forums on the dark web day and night. As soon as news of the 700 million LinkedIn database spread he and his team began analysing the data.Mr Hadžipašić says the details in this, and other mass-scraping events, are not what most people would expect to be available in the public domain. He thinks API programmes, which give more information about users than the general public can see, should be more tightly controlled.

“Large-scale leaks like this are concerning, given the intricate detail, in some cases, of this information – such as geographic locations or private mobile and email addresses.

“To most people it will come as a surprise that there’s so much information held by these API enrichment services.

“This information in the wrong hands could be significantly impacting for some,” he said.

Tom Liner says he knows his database is likely to be used for malicious attacks.

He says it does “bother him” but would not say why he still continues to carry out scraping operations.

Mr Hadžipašić, who is based in southern England, says hackers who are buying the LinkedIn data could use it to launch targeted hacking campaigns on high-level targets, like company bosses for example.

He also said there is value in the sheer number of active emails in the database that can be used to send out mass email phishing campaigns.

‘No ambiguity’

But cyber-security expert Troy Hunt, who spends most of his working life pouring over the contents of hacked databases for his website haveibeenpwned.com, is less concerned about the recent scraping incidents and says we need to accept them as part of our public profile-sharing.

“These are definitely not breaches, there’s no ambiguity here. Most of this data is public anyway.

“The question to ask, in each case though, is how much of this information is by user choice publicly accessible and how much is not expected to be publicly accessible.”

Troy agrees with Amir that controls on social network’s API programmes need to be improved and says we can’t brush off these incidents.

“I don’t disagree with the stance of Facebook and others but I feel that the response of ‘this isn’t a problem’ is, whilst possibly technically accurate, missing the sentiment of how valuable this user data is and their perhaps downplaying their own roles in the creation of these databases.”

Mr Liner’s actions would be likely to get him sued by social networks for intellectual property theft or copyright infringement. He probably wouldn’t face the full force of the law for his actions if he were ever found but, when asked if he was worried about getting arrested he said “no, anyone can’t find me” and ended our conversation by saying “have a nice time”.