Data Brokers are Machine Learning’s Rogue TradersCharlie SammondsBlockedUnblockFollowFollowingApr 24Machine learning, the process by which artificial intelligence (AI) can learn from experience, is built on reams of data.
Once initially programmed, machine learning software will analyse data, draw conclusions and, in turn, ‘learn’ from those conclusions.
Put simply, the more data a machine learning program can be fed, the more effective it becomes.
This is, in part, the reason why data has become the valuable commodity of recent years.
Giants like Google, Amazon, Facebook, etc.
, rely on data as their lifeblood, their overwhelming access to it putting them ahead of the competition on just about any new project that involves any form of data processing or machine learning.
For smaller entities without the access to swathes of user data, there are alternate means of accessing the data necessary to build machine learning projects.
One of these is data brokerages.
An Industry Built on SandAn entirely modern business, the model of the data broker is scarcely more complex than to hoard as much data as possible from as wide a variety of sources as possible.
Their product is packages of that data, which companies will buy for marketing purposes, machine learning training purposes, etc.
Such is the sweeping nature of the data held by data brokers that Jeffrey Chester, executive director of the Center for Digital Democracy, dubbed them “privacy deathstars”.
A regulatory vacuum has allowed them to multiply and prosper until now, but policy-makers have had to scramble to catch up.
Data brokers represent both opportunities and problems for machine learning as a field, but tightening regulations may make them less of a force going forward.
The sweeping GDPR legislation that came into effect on the 25th May last year represents a serious stumbling block to data brokerages.
While the likes of Facebook and Google have been and will continue to be the subject of scrutiny from the EU regulation, it is data brokers that have the most to lose.
GDPR stipulates that data use must be transparent, fair and accurate.
It should not be collected without the user’s knowledge, it should only be used for the purpose it was collected for and companies should endeavour to keep as little data as possible about their users.
It won’t take the most analytical mind to see that these principles are all at direct odds with the business model of a data broker.
A spokesperson from Experian said the company has “worked hard to ensure we are compliant with GDPR” and Acxiom has said that it takes the legislation “very seriously”.
Wired spoke with Ailidh Callander, legal officer at Privacy International, who believes that enforcement of GDPR legislation needs to be more stringent.
“The burden is on the companies to seriously look at what they do and look at the law and people’s rights and come up with a better solution,” she says.
“What they seem to have done is slightly amend their privacy policies.
“We consider these companies’ practices are failing to meet the standard — yet we’ve only been able to scratch the surface with regard to their data exploitation practices.
GDPR gives regulators teeth and now is the time to use them to hold these companies to account.
”The UK’s information commissioner, Elizabeth Denham, told the Financial Times that regulatory bodies are actively looking into the working practices of data brokers to find out if they are compliant with the laws.
She described a “dynamic tension” in the fundamental principles of GDPR and the way in which data brokers conduct their businesses.
If GDPR has succeeded in anything, it has been to highlight to the general public the importance of the data trail you leave online, and the fact that you can and should have a degree of ownership over it.
Tightening regulation alongside public awareness of how their own data is being stored and used is likely to strangle the stream of data that keeps brokers afloat — cut off that supply and the whole industry all but dies.
A Quantity Over Quality BusinessThere is another problem — data brokers get people wrong.
Katev Leetaru, writing in Forbes, explained how he began receiving AARP marketing emails which suggested that, as a 65-year-old, he could benefit from a membership.
As a man in his mid-30s, Katev was at a loss as to why he had been targeted and decided to do some digging to see where AARP had gotten his name from.
The answer was DSA Direct, a small New Jersey data brokerage that had built a (wildly misinformed) profile on him and was clearly selling that information on to advertisers.
Katev’s digging took him to Oracle Data Cloud, a company that ‘helps advertisers connect with the right customers.
’ Except they don’t.
After an arduous process having Oracle release the data they held on him, Katev found that 78% of the categories he had been assigned to bore no resemblance to his life.
From having him down as someone who buys children’s lunchbox meals to someone who shops at Victoria’s Secret, Katev had been assigned 85 categories which would be used for marketing purposes despite being entirely inapplicable to him.
Katev was listed simultaneously as a successful single parent, a cosmopolitan professional and a golden grandparent over the age of 65.
With an equal interest in retirement services and young professional parenting, he must have presented a confusing figure to advertisers.
These aren’t obscure inexperienced startups, either — these are serious companies with a wide reach.
If you multiply the wasted ad spend involved in offering Katev retirement plans out across all the inaccuracies there must be in their data, the whole system looks incredibly flimsy and wasteful.
His is not the only story, either.
In 2017, Caitlin Renee Miller wrote in The Atlantic about her experience paying $50 to find out everything a data broker claimed it knew about her life.
As a percentage, she estimates that 50% of the data points in the report were incorrect.
Old addresses, outdated job titles — a lot of the information you might deem useful for marketing purposes was woefully wide of the mark.
Machine Learning Needs BetterWhat, then, could this mean for machine learning algorithms using data provided by data brokers?.Well, confused results.
Data brokers deal in ballparks as well as specifics.
Machine learning doesn’t have that luxury.
For smaller companies looking to build machine learning projects, perhaps with a budget so limited that only the bare minimum data can be bought for feeding the algorithm, it’s important that the data received is accurate.
Poor data quality is the number one reason why machine learning tools become redundant.
The long-standing adage ‘garbage-in, garbage-out’ applies even more pertinently to machine learning than other forms of data analytics.
This is because, with machine learning, the bad data affects not only the training process used to build the predictive model, but also the new data that model then uses to make future decisions.
Quality data is paramount to building effective, cost-efficient machine learning programs and it has become all too clear that data brokers don’t often possess it.
There’s a manpower issue raised by the purchase of bad or inaccurate data, too.
Before training any predictive model, data scientists will be expected to ‘cleanse’ the data they are using.
According to Harvard Business Review, this process can take up to 80% of a data scientist’s time.
Smaller entities simply don’t have the manpower to be dealing with expensive reams of low-quality data.
Ultimately, none of this may be relevant in the coming years; the sea of data brokerages may well be be crushed by GDPR and a greater awareness of how each individual’s data is collected and processed online.
For now, though, it’s important that smaller organisations looking to build machine learning programs find cleaner ways of harvesting data for training and future analysis.
GDPR complicates matters in one sense, but in another it forces companies to think about not only how their data is sourced but its quality.
The business model of a data broker hinges on the notion that ‘more is more’ when it comes to data, but it can be as much as about quality as quantity — don’t buy snake oil.
.. More details