The Not-So-Hidden Problem With Big Data Sets

Molly Poppie NIelsen
Molly Poppie (Image credit: Nielsen)

There’s been a lot of energy and excitement in media circles of late about the future of measurement and the promise of big data. At Nielsen, we’ve long understood the value of big data, in fact just last month we announced additional details around how we are adding it to our national TV measurement service

We also know that no panel is perfect, as the past few months have demonstrated. 

But when our teams of data scientists hear some of the big, broad claims about big data coming to save the day and fix all the perceived challenges in the industry, it’s hard not to be skeptical.

That’s because, for all its value and amazing potential, the big data sets that the industry currently has access to have very real limitations

A Relevant Recent Example

After losing access to Nielsen’s Portable People Meters, Comscore reported that it will now be using data sets from Experian’s ConsumerView to help them identify individual viewers for measurement purposes. Their announcement was framed in the trade press as an advancement — after all, if big data is the future, any shift in that direction must be a good thing. 

Unfortunately for their customers, and for consumers, that’s not the case. 

There are a handful of third-party identity vendors out there who provide the ability to match data sets based on personally identifiable information and provide demographic characteristics, both directly collected and modeled. 

At Nielsen, we regularly check this data. We do it by directly measuring information from our robust panels to validate how accurate these data sets are by: 1) correctly matching to a household and 2) accurately reporting demographics and characteristics. 

What we typically find should give advertisers pause. 

The majority of data sets out there today are built around billing information or online behavior collection, not demographic profiles. They don’t have the rich details about exactly who the people are on their lists — from age, to income to race and ethnicity — the way you do with a robust panel. These data sets, because they’re created by machine-to-machine transfers, also increase the possibility of waste and fraud. 

Because of that, the level of certainty they can provide around who actually lives in a given household is limited. And they have no ability to say who within a given home is watching a given program at a specific time. 

Even when you triangulate that data with other sources, you’re almost guaranteed to have massive gaps and errors in your estimates. This may be acceptable if the use case is targeting, but this data on its own does not provide the accuracy, objectivity and transparency required to deliver measurement. 

Why It Matters

So what does that mean, practically? Well, it has a few implications. 

In the case of Comscore, a shift away from our Personal People Meters, which actually affix microphones to about 100,000 real-life, verified people and track exactly what they’re watching, to a model that uses billing data to provide guesstimates of who within a dwelling might be watching a given program at a given time, will result in a less-accurate read on who is watching what. 

But the possibly bigger implication is that this shift is going to get the industry further away from capturing a true representation of the country. 

We know that many of these types of data sets do a better job of providing data around households when the people living there own their own home and have been there for a long time. And that stands to reason. The problem with that is that long-time homeowners tend to be more White, more affluent and significantly older than the nation as a whole. By design. these data sets undercount Black and Brown people, lower income people and younger people at a time when all of those segments are growing, not shrinking. 

The same is true of data sets built off of set-top box data, which tends to overcount more affluent consumers who are willing to pay more for cable packages and thus disproportionately excludes lower income consumers who are important targets for many marketers. 

The media industry has, rightly, made accurately representing Black and Brown communities a central priority. At Nielsen, our track record on this going back decades hasn’t been perfect, but today we have the most accurate and advanced view of the nation as it truly is. 

Big data-derived measurement tools that aren’t backed by a representative, validated and audited panel can’t make that claim. Nielsen panels can target many demographics within the U.S. Census with 1% variability, but the big data-focused options out there aren’t even close to that. The industry needs to be open and honest with itself about the challenges that big data presents when it comes to representation.

A Wider Problem

To be clear, this is not just a Comscore issue. This is an issue with all the big data sets out there currently. 

In August of 2020 the Association of National Advertisers, in partnership with the Media Research Council and Sequent Partners, used Nielsen data as a benchmark in a study designed to understand the degree to which the multicultural audiences were being accurately represented in media targeting. The study looked at an aggregated collection of high-quality marketing and media data and sought to understand how accurately it was targeting Black, Brown and Asian audiences. The findings were troubling, but not at all surprising to us. 

The study found that the big data sets the industry relies on weren’t up to the task of accurately targeting these critical communities. In part because the data sets weren’t designed to capture rich data about who these consumers truly are, the way robust panels are, there was rampant misrepresentation and underrepresentation in the data. 

Now contrast that with Nielsen's robust panels, which provide a wealth of directly collected information from real-life people, representative of the entire U.S. population. Who lives in the home? How old are they? What race and ethnicity do they identify as? Who is watching the television at a given point in time? Nielsen's panel answers these questions. 

Again, panels on their own aren’t perfect, but there’s a reason other industries, namely pharmaceuticals, use approaches that are similar to panels in approving drugs. That’s because, when the stakes are high, there’s no substitute for real, verified people.  

We know that many industry players are excited about the promise of big data, we are too. But as an industry we need to be honest about what big data can and can’t solve for. And we too understand that the future of media measurement is an approach that combines the reach of big data with the verified personal data of robust panels.

Molly Poppie

Molly Poppie is senior VP of data science at Nielsen.