Sramana Mitra: Let’s look at use cases now that we understand what you do. Let’s do a before and after. If you were not in the picture, what are they able to do? With you in the picture, what are they able to do?
Nenshad Bardoliwalla: I’ll give you a story of my own life because this is actually how we started designing the Paxata product. If you’re an analyst today, your number one weapon of choice is Excel. That has always been the case for as long as Excel was available. When it comes to data preparation, the analyst lifecycle looks like the following. I used to get a request from one of the management teams saying, “I need you to answer this analytical question. I need you to find out which of our accounts in Europe have this many escalations logged against them, find out who is assigned to those cases, and how long they have been in operation.”
The first thing I would do is find the raw data sources that may be relevant for me to be able to start answering this question. I would pull data form the transactional CRM system. I would pull data from the Access databases of my colleagues in the European organization. I might have to go into a customer success database to pull out the service request information. At least three to five sources of raw data would be used in the process of data preparation.
What would I do? Either I would have somebody write a sequel statement for me, or I would have access to the data myself. Then I would copy and paste each one of those data sets into a single Excel workbook. I would have four or five tabs with the data from each one of these systems. I would then have to explore that data and try to figure out which of these columns actually has the information that I care about. Which of these columns have corrupt data or data that doesn’t make sense? I would try to fix those issues by writing some pretty nasty formulas that would remove blanks, standardize the lengths, and find out a specific code that could only be eight digits. I would do that for all five sheets.
Then I would try to start gluing the data together. I need to tie the service request information with the customer account information and the geography information. I would use a technique that all analysts who use Excel know, which is called the VLookUp. I would use that to pull the additional attributes that I needed on to my data set so that I could assemble this richer data set that had the information across these different data sources. Then I would have to do more normalization or cleaning. Perhaps pivot the data in a variety of different ways to create different cuts of the data.
At the end of this process, I would have a dataset that would be capable of answering the question that my management team asked me. The irony is we would have these requests multiple times in a day. By the time that analysis is due, I had just finished going through this horrendous process of preparing the data. I would spend the last 15 minutes of doing very surface-level analysis to try to figure out, “What was the trend here? Can we figure out if there’s a correlation between these two elements?” I spent so much time pulling the data, trying to clean it, trying to figure out how to get it to connect together, trying to reshape it, that I spend very little time actually doing analysis. Most of my job – I’m talking 8 to 16 hours – had gone into just preparing the data set that could answer the question the management team asked. That is the what you find in every analyst’s life today. They spend an inordinate amount of time doing the exact same set of activities that I described.
This segment is part 5 in the series : Thought Leaders in Big Data: Nenshad Bardoliwalla, VP of Products, Paxata
1 2 3 4 5 6 7