Big Data’s Dirty Little Secret: Why Cleaning Up Set-Top Box Data Is Not Optional

With billions of advertising dollars at stake, it’s hard to overstate the importance of reliable and accurate TV audience measurement. When the chief brand officer for the biggest advertising spender in the world starts using words and phrases like “crappy media supply chain,” “crazy,” “unbelievable,” “complicated,” “clean-up” and “transparency” to describe digital advertising measurement and standards, we need to think carefully about how we plan to use big data to measure traditional TV advertising.

The race is on to understand identity across both TV and digital, and when it comes to using big data to understand audiences, there is no such thing as perfect information. Whether you are pulling data from a data management platform (DMP), mobile devices or set-top boxes, all big data that is used for audience measurement needs to be scrutinized and calibrated using a verified truth-set purposefully built for audience measurement. We know poor quality data yields poor quality results, and insights are only as good as the underlying data on which they are based on. The biggest misconception today is that set-top box data represents the universe of actual TV viewing behavior. In reality, it’s far from it, and we must first ask ourselves if this data even represents true-person’s behavior.

Also: Survey: Consumers Uncomfortable With Smart TV Data Collection

Scale vs. Representation

Set-top box data is not census measurement. On average, a given measurement provider is only able to receive data from 40%-60% of a set-top-box provider’s footprint in a market. Our own research has shown that homes that return data and homes that do not return data also view television differently. For one station, homes that are not capable of returning data watched their programming four times more frequently than homes that are capable. The fact that there are millions of homes contributing to your ratings won’t matter if these differences are not accurately accounted for.

In addition to set-top-box data not able to reflect its own complete viewer base, it also cannot accurately represent consumer viewing occurring across other cable, satellite or telco providers, as well as over-the-air (OTA) homes. Over-the-air homes can account for 10%-65% of a station’s audience for news and sports alone, depending on the network. And simply having a home in every zip code does not mean the ratings are representative or reflective of this growing, important viewing segment.

To provide an accurate competitive read of the market, all stations need to be represented. Two set-top box providers may represent no more than 10%-20% of the market and may not carry all stations in the market due to carriage agreements. In one TV market, the fifth rated station is not carried by either of two major national set-top box providers. If measurement is based on set-top-box data alone, how will this impact that station’s market rank and share? Incorporating direct in-market tuning measurement and panel-based calibrations is necessary to ensure measurement reflects all stations and cable networks, not just today, but ongoing.

Ensuring Accuracy

To overcome the bias, coverage gaps and inaccuracies of big data, set-top box data must be cleaned up. This means leveraging reliable, representative panel-based measurements to compare and contrast against data from set-top box providers and make necessary adjustments.

The first step in the cleanup effort is examining household characteristics. We know third parties can be very useful in assigning household characteristics to big data, but their data can be subject to inaccuracies. For example, a third-party demographic provider recently reported zero household characteristics for nearly 30% of households in their return path data footprint. In addition, when household characteristics are provided, in many cases, they are not accurate. In another example, households with persons 18-34 were correctly identified 60% of the time, and the designation of a two-person household was only correct 30% of the time.

The second step is comparing the tuning records of set-top box data with panel measurement. This helps determine if the time of events, including time-shifted viewing, and station assignments captured by set-top box data are correct. Nielsen has found that across all markets more than 25% of time-shifted viewing in the raw set-top box data was assigned to the incorrect station/network. In addition, in one example, 100% of time-shifted viewing was credited to the wrong programs. Both of these issues can cause major changes to a station's ratings and rankings if not fixed. Being able to continue to accurately reflect viewing audiences, whether Live, Live Same Day, or Live +X, is critical for buying of advertising.

Once household characteristics and tuning behaviors are compared, data deficiencies can be corrected and validated to reflect real audience estimates. This is not just a one-time effort—this is something that is critical and must be established as an ongoing process to ensure that the veracity of the ratings is never questioned.

Don't Whistle Past the Issues

No doubt, big data offers a wealth of information, but it’s only valuable to advertisers if it accurately reflects the actual viewing behaviors of the TV audience in each local market and nationally. To ensure set-top box data reflects true-person's behavior, it needs to be validated by high-quality panels that are purposely and expressly built for audience measurement, and coverage gaps from what set-top-box data does not accurately express need to be accurately accounted for.

As big data for TV measurement reaches sufficient quality and scale for deeper analyses, we can’t turn our heads from these serious issues. For advertisers to feel completely confident in TV audience ratings or audience-based buys that use set-top box data, the household and tuning data coming from these devices must be validated against true-person’s behavior. Fixing errors and correcting for biases is imperative in delivering reliable and truly accurate measurement.

In a $70 billion advertising ecosystem, accuracy counts.