If you’re a data scientist or analyst, you’re very lucky. You’ve landed a career that’s been touted as the “sexiest job of the 21st century.” Behind the scenes, though, working with data probably doesn’t feel all that sexy. In fact, you might be spending as much as 80 percent of your time just doing data preparation: cleaning and organizing hundreds of thousands of rows of data. You were hired to analyze this information and help the company get powerful insights; instead, you’ve been reduced to a glorified data washing machine.
Adding insult to injury, you’re probably hearing experts call this drudgery “data janitorial work.” Whoops. Hey, whatever happened to the sexiest job thing? Oh wait. Turns out that was just clickbait.
The good news is that this post is not just more clickbait. I work for a data-quality solutions company and have had the privilege of speaking with several data analysts in Fortune 500 organizations. I want to share some facts and debunk some myths to help leaders, recruiters, and data analysts better understand the data analyst role and how analysts can help an organization achieve its data goals.
Let’s dig in.
Why Analysts Spend So Much Time Cleaning Data
Data analysis should follow the 80/20 rule – 80 percent of your time should be spent on analysis and 20 percent on cleaning. Somehow, that ratio has flipped, and the root causes for that shift are procedural. Data analysts can’t spend 80 percent of their time analyzing if they’re expected to solve organizational process gaps that result in bad, dirty, and duplicated data. Here are three of the most common problems analysts encounter.
Human Data Entry Errors
Any time a human types in data, expect it to be flawed. The first problem is that employees often aren’t trained on proper data entry standards, so people enter data according to their best judgment. The other major problem is accidental mistakes such as fat-finger errors, which often go unnoticed. Only at a later stage when your marketing team sends out an email to Mr. Jonasthan instead of Jonhnathan (oops … I legit have fat fingers) — JONATHAN — does all hell break loose. The next thing you know, your customer is ranting on LinkedIn about how careless you are, sending out emails with typos!
Multiple Disconnected Apps and Systems
Your lead generation team uses a lead gen tool and LinkedIn to source leads. Your marketing team uses HubSpot CRM. Your sales team uses Salesforce. Your customer support team uses Jira. None of them are connected to a centralized database.
Your executives are demanding organizational insights, but you can’t seem to track anything because all these platforms and systems are working in silos. The lead isn’t connected to the CRM, the CRM isn’t connected to the sales process, and customer support is somehow in a completely different zone. You practically have to request each department physically hand you data from these different sources to get the insights you need.
This problem gets compounded at the enterprise level. Today, an average business runs 464 different pieces of software, all of which pull data from multiple systems. Teams of data analysts must constantly sort this data and decide what to keep and what to purge. This work is an ongoing effort and one that makes the job rather taxing. More than half of analysts surveyed here say that the data preparation process is the worst part of their job. That’s because it’s difficult and time-consuming but doesn’t reward the effort, as outcomes usually leave much to be desired.
Analysts are often pressed for time to merge data from multiple sources, and this involves iterative cleaning processes, each catering to just one problem at a time. For instance, to match [First Names] fields between two sources, the analyst will have to run scripts to check for completeness (are all fields filled?), then to check for spelling mistakes or typos, or to verify if the right titles have been assigned (Mr./Miss/Mrs./Dr. etc). They will have to repeat the same process for [Last Names], [Phone Numbers], [Address], and so on. This process is grinding, tedious work that, despite its difficulty, doesn’t even guarantee accurate outcomes.
Data Lakes Turned Into Dumping Grounds
Turns out that people love to hoard data! Businesses even create data lakes to store all the data that they can’t manage in real time with the hope of getting back to it when they need insights in the future. What businesses may miss when they’re filling these lakes is that data gets obsolete or decays at a rapid pace. For instance, one study reports that CRM data can decay as much as 30 percent per year. For businesses investing in big data, these lakes become repositories that are never touched because the companies lack the tools to extract information from this data in real time. Data analysts working with data lakes have to do a lot of cleaning to extract data that might turn out to be irrelevant or not fresh enough to provide reliable insights.
An analyst can’t be expected to analyze data if these process gaps and issues are not nipped in the bud. Most data analysts get so neck-deep buried in fixing these problems that they hardly have any time left for studying data deeply. This is the kind of inefficiency that disrupts operational processes, causing conflicts between departments when data isn’t available to fulfill objectives and ultimately delaying crucial business goals.
Empower Your Analysts With the Right Tools
Automated data cleaning tools are gradually replacing the need for manual codes and transcripts. These tools are designed to do the actual cleaning, allowing the analyst to spend more time in reviewing and assessing the kind of data they want to keep, clean, merge or purge. In the recent past, an analyst’s programming and scripting skills were as important as their critical thinking skills. As automation picks up speed with powerful data preparation and matching tools, an analyst’s real skill is not how well they can build a cleaning code, but how well they understand the data and how soon they can deliver on data-driven objectives.
Experienced data analysts know that change is on its way. They also know that poor data quality requiring manual cleaning and preparation is an indication of serious process flaws. They know that cleaning data is part of the job, but it should not take up 80 percent of their time.
According to data architects, cleaning up a data set with 1,000 rows takes seven weeks. Here’s a basic breakdown of where that time goes:
Week One – Gather data from multiple departments.
Week Two – Analyze it to figure out some of the basic issues, which could take up to two weeks depending on the dirtiness of the data.
Week Three – If existing scripts or cleansing rules aren’t available, the team needs to code new rules using Python for cleaning data. For instance, existing scripts can change all abbreviated city names into complete versions or ensure salutations are correct. But uncommon errors like the use of nicknames will require the analyst to create algorithms that detect nicknames and suggest alternatives. If the individual has varying names across multiple records, the analyst will have to use fuzzy matching algorithms to match the records before replacing the name.
Week Four to Six – If multiple lists need to be matched to remove duplicates and create a single record, then this process will be both lengthy and tedious as analysts have to try multiple algorithms to get the matching done right.
Week Seven – Reviewing changes. Repeat if data still has errors.
This calculation assumes the analyst works eight hours a day dedicated to this task. In the day-to-day reality of the business world, however, very few analysts will be able to devote entire days at a time to it. That means the project is likely to drag on for months. By the time it’s done, there’s more data to add. Then it’s time to repeat the entire process.
Solving this problem requires the use of proper tools: zero-code solutions designed to work with modern data structures and the demand for instant insights. With a good data preparation tool, an analyst could gather, consolidate, review, clean, finalize data in just one week. Unreal, right?
Here’s an actual case study from my company. One of our clients, a consulting group, took just one week to fix their data. We were working with a client that had large data sets from a database that dated back to 2005. They were preparing to move their data, consisting of more than 100,000 records from a legacy system to a new one. Before they could attempt the migration, however, they had to make sure duplicates were removed and dirty data was treated. The analysts at the consulting group used Ruby and SQL tools to clean up the data, but their results were inaccurate. Worse, the process took them months! The team was nearing the migration deadline but they weren’t satisfied with the outcome. That’s when they decided to look for other feasible options. When they used my company’s data match solution, they were able to clean, remove duplicates and create a consolidated record a few days shy of the migration deadline. It took them around three days to normalize data, match, and present the most accurate record for migration. This data preparation case study and many others like it prove that, with the right tools and solutions in place, analysts can save time and deliver on projects without being overwhelmed.
In an age of automation, relying on old, outdated methods to perform basic tasks is inefficient. Why should you be manually fixing your data when you have tools that can do it more efficiently? Automation is necessary for a data analyst to succeed in their role.
So What Should an Analyst Do?
Information about data analyst roles is often confusing and misleading because of recruiters or talent hunters who aren’t sure of what or who they need. Because C-level executives have knee-jerk reactions to dirty data, an analyst is brought in to “fix it.” A messy CRM? A database full of obsolete, incoherent data? Lots of data but no one to make sense of it? Let’s hire a data analyst! Although an analyst can help with these problems, you shouldn’t hire one for these jobs, and neither should they be expected to do the dirty work.
Instead of simple cleaning and prep, analysts are best deployed in these four key areas.
Creating Data Quality Rules and Frameworks
An analyst must have the power to create new processes, implement policies and bring on board new tools that can enable the organization to manage its data and ensure its quality. For instance, say the analyst figures out that 80 percent of errors happen at the point of data entry. They might solve this problem by training employees, setting up data entry protocols, creating governance rules and installing other procedures to minimize errors.
Lead the Company to a Truly Data-Driven Mindset
This term is quite a buzzword these days, but you don’t become data-driven by hoarding data or by pulling random reports willy-nilly as you go. A truly data-driven organization has a process for storing, managing, and using all its data. You should value quality over quantity, efficiency over traditionalism, and real success over fancy notions. For instance, using automation over ETL tools increases efficiency. Ensuring data is accurate and usable helps you achieve better targeted marketing goals than investing in cloud storage that does nothing to improve your data usability. Your data analyst/solution architect/engineer is the best person to help you achieve this goal.
Be a Crucial Part of M&As and Migration Projects
Planning a merger or a migration? You need an experienced data analyst before and after the process to ensure that your organization’s most valuable asset — data — is safely transferred, ensuring that context and quality doesn’t vanish during the process.
Taming the Chaos of Big Data Projects
Any company that can afford to do so is investing in big data. It’s the next big thing. But little do companies know that big data is a nightmare for data-quality professionals. Imagine getting social media data, firmographic, demographic, psychographic data on a million people. Imagine the chaos you’ll be facing with inconsistent data (people using nicknames instead of actual names), duplicates, obsolete information and a dozen other problems. It’s here that your team of data analysts with their sharp analysis skills will be required to help organize, use and make sense of this chaos.
Does this sound sexy now? If it does, it’s exactly what an analyst’s role is supposed to be.
Let me keep the conclusion short. If you want to drive into a data-driven future with confidence, equip your analyst with the best systems, processes and tools to get the job done. This way, they can flip the equation to 80 percent analysis, 20 percent cleaning, and we can finally get better clickbait articles.