Predictive Analytics Data

Data Hygiene: Your Cleaning Checklist

By David Longstreet on May 7, 2019
David Longstreet

David has over 20 years of building analytical research, statistical and econometric models. He leads FanThreeSixty’s core fan analytics capabilities, which includes predictive sales and attendance models, enriched fan profiles and smarter audience segments.

You wouldn’t drink spoiled milk; so, why would you use outdated data? While cleansing data may not sound exciting, it’s foundational on the journey to truly understand and predict the needs of each individual fan. In fact, it’s the first step to creating data-fueled activations that capture fans in the right moment at the right time.

Data-driven anything—marketing, sales or strategies—is only as good as the quality of the data. Using inaccurate and poorly structured data is like wearing a pair of glasses with the wrong prescription—your vision is fuzzy and everything is blurry. Similarly, bad data results in an incomplete picture of a fan and missed opportunities (see figure 1). In fact, dirty data costs the US economy $3.1 trillion per year (source: IBM).

blurry-vision

figure 1

The new rules of fan engagement in the digital age are based upon using consumer data to personally engage with and motivate fans. And because data will shape all future automated activations, it is critical to use data of the highest quality.

For sports and entertainment brands, data can quickly get messy because of how fan data is currently collected. Most gather information about their fans from disparate sources such as ticketing, Wi-Fi, point of sale, websites, email, surveys, special events and promotions, group sales and a team app. Plus, the format is different across all these sources, making it difficult to clean and organize the data. As a result, fan databases often contain duplicate profiles, incomplete information or just plain wrong data: misspellings, typos or fake fan data.

It is pointless and expensive to collect data that is not actionable. According to the Harvard Business Review, cleaning and organizing data comprises 50 to 75 percent of analytics and data science efforts. This creates inefficiencies and can cause mistrust in data, which is problematic since less than half of all organizations trust the quality of data being collected and stored (Harvard Business Review).

 

How We Got Here: Big Data

Big Data is all the rage. Google returns almost 6 billion results when searching the word. While this is a common buzzword used in nearly every industry, the word itself is misleading because bigger is not always better. In fact, having lots of data and having useful data are two dramatically different things.

Access to data is not the issue; acting on it is.

In order for data to be actionable, it must be clean, fresh, accurate and connected across all various data sources. When combined, this creates a single source of truth for data and results in a single fan view. Without a single source of truth, organizations will continue to get different results from disparate, incomplete data sets setting off a domino effect across the organization—mistrust, inefficiencies and missed opportunities.

Bad Data Means Bad Marketing

Incomplete data also results in more waste—wasted time, money, effort and more. Bad data cripples marketing campaigns. From email bounces to reaching a fan’s inbox and hitting spam traps, outdated, incomplete or fake email addresses or names can lead to unhappy fans. They may not get the email or it could be addressed to the wrong person, which will only result in high churn rates and decreased loyalty.

Bad data can also result in targeting the wrong audience with the wrong message. If a database has several inaccuracies, teams may be creating target audiences based on false information, which will annoy fans and prospects. Without accurate data, sales teams can’t meet their numbers and marketers miss their targets, and the business can’t make informed decisions to drive revenue.

This is why quality matters most in data, starting with hygiene, yet only 30 percent of organizations report any formal activity to clean up data and over 50 percent seldom or never verify data ingested from third parties (MIT Center for Information Systems Research). But if teams establish processes and adopt the right tools, they can prevent bad data from seeping into their organization and into their brand.

 

The Cleaning Checklist

Reference Data Sets

Every piece of consumed and saved data should follow a set of very specific rules, which should be documented and updated frequently. Using reference datasets and an automated process can quickly and accurately verify each piece of data accumulated from millions of fans, starting with the five below:

Name Verification

At FanThreeSixty, our data science team has amassed name verification reference sets for every piece of data consumed by sports teams. Our name verification reference set includes observations for nearly 25 million individuals and includes a list of 600,000 verified first names and over one million valid and verified last names. We also created a blacklist of names.

Postal Codes

Postal Codes follow a set of rules and varies by country. In the U.S., postal codes are called ZIP Codes and follow a set of prescribed rules established by the US Postal Service. A standard ZIP Code is numeric and 5 digits long, and are one of the easiest data elements to check. A fan’s ZIP code can be looked up in the reference dataset provided by the US Postal Service, quickly removing or repairing invalid ZIP codes.

Emails

Emails follow a specific format and can’t include special characters. The @domain can be verified using a “Whois” lookup from which a whitelist of verified domains are built. A blacklist of invalid domains can also be created to then flag and store bad emails and can inform future models so only verified emails come into a team’s database. The blacklist would include known, fake email providers that provide “10 minute emails,” those temporary emails with a 10-minute lifespan.

Phone Number

Phone numbers follow a set of rules and it also varies by country. In the U.S., telephone numbers follow the 10-digit North American Numbering Plan (NANP). It includes 24 countries including the United States, Canada, Bermuda and 17 nations in the Caribbean.

The rules for phone numbers in other parts of the world can be messy. For example, in the United Kingdom, telephone numbers are of variable length and in Italy landline numbers start with 0 and mobile phone numbers start with 3. Nonetheless, the rules for phones can be learned through data science algorithms and machine learning. Once rules are established, any phone number that does not follow the prescribed rules for a specific region is removed from the database.

Date of Birth

Believe it or not, it is not uncommon to see people with birth dates in the 1800s. This is pointless information to have. Most of the world follows a date format of Day Month Year format versus the convention of Month Day Year we’re used to in the U.S. Regardless, established rules can ensure that the date of birth inputs follow the required format.

This, coupled with a method to ensure age is valid, helps ensure teams are using accurate age information. For instance, if a fan enters his or her date of birth incorrectly—or doesn’t enter the year at all—we can use data science to uncover a user’s age. Let’s say this fan with no birth year buys beers during the game; we can then match POS data with their fan profile, revealing that this fan is at least 21 years or older.

By taking the steps to validate age, we’re helping sales and marketing teams better target the right fans, which also has a strong impact on sponsorship. For example, a beer brand could get in serious trouble for targeting minors. But if we help validate (or invalide) age, we’re removing the risk of that brand potentially advertising toward underage fans.

 

Going Beyond Basic Datasets

There are a variety of other issues that can impact the cleanliness of data. By applying certain rules and processes for some of the most common mishaps in data aggregation and integration (identified below), the process of cleansing data can become a simple, quick task completed by a machine, as seen in figure 2.

data-visualization-data-cleansing-process (1)

figure 2: FanThreeSixty's method to clean data 

First/Last Transposition

It is not uncommon to transpose first and last name when importing data. If more than 10 percent of the last names appear to be first names or vice versa, then there is a good chance that first and last names were transposed. Automatic rules can be put in place that automatically flip first and last names, removing the time to manually fix this error.

Duplicate Fan Data

A very common problem is having multiple profiles and multiple emails for the same fan; so, a single fan will receive the same communication multiple times unbeknownst to the team. Data cleansing eradicates duplicate data and consolidates all data into a single fan profile. This simplifies the process of enriching a fan profile and it streamlines communication to fans. It also helps with the accuracy of the calculations and campaign measurement such as clicks and open rates on email campaigns.

Fraud/Invalid

Fraudulent or fake data can come from Wi-Fi data, surveys, purchased marketing lists and at times, ticketing data. Using machine learning algorithms, fraudulent and fake data can be identified, flagged and partitioned into its own database. The machine also continuously learns how to identify bad data so only the purest data filters into a team’s database.

 

Quality Over Quantity: What’s Next

While cleaning data may not be fun, it’s foundational in creating hyper-personalized experiences for fans. From increased productivity to reduced waste and improved ROI, clean data yields big returns. It’s the first step toward fan clarity and to making data actionable, which is the true MVP of sports today.

Now that we’ve gotten rid of that dirty data, it’s time to connect it in order to better understand fans. Stay tuned for the next blog in this series: Connected Fans. Connected Data. Better Results, or sign up below to receive an alert when that post is live. 

Stay up to date