Sramana Mitra: Does that mean they are giving comScore access to their Google Analytics?
Josh Rogers: No. They will download a piece of software that sits in the browser and that sends the information back to comScore.
SM: comScore doesn’t work with individuals, they work with corporations. Is that correct?
JR: Their customers are corporations, but the source of their data is two million individual panelists. That is one source of the data. The second source is that they track 15 of the 20 most visited webpages in the world. They index all the page views of those sites. That is a much larger data set. Together, these two sources allow them to understand who is visiting what information on the Internet. They sell these reports to corporations that are interested in this kind of information.
SM: What is the structure of the consumer data?
JR: URLs – weblog information. Those data sets power different product offerings. But those two data sets represent the large inventory of the data that comScore collects. That equates to about one trillion records per month – a record in their notion is a page view in the form of a URL. They are processing about 30 terabytes per week, and the challenge they have is to do this in a cost-effective manner. Syncsort is a key enabling technology for them to do that. They use us on about 200 servers to preprocess all that data before it goes into their Hadoop cluster. What we do is taking the URL streams – they have specific pieces of those streams that they care about – and help filter that data out, compress it, partition it, etc. Then they use the Hadoop cluster to execute a lot of the analytics and ETL-type [extract, transform, load] work to be able to prepare specific tables they need for both reporting as well as the data sets they need for custom analytics.
SM: Let’s do one more use case.
JR: Another specific type of project we do a fair amount of work on is application modernization. You can think of it as a highly specialized data integration project. That is where customers are interested in taking mainframe applications and moving them to run in open systems technologies. We work with a number of partners on this front. There is an organization called DellAppMod, which is a part of Dell.
Then there are Clarity and Micro Focus, which has an offering called Micro Focus Server. These organizations have rehosting environments that run on either Windows or UNIX and run your JCL in place to having to make code changes. What we do is help them to move data off the mainframe and into open systems, translation from EBCDIC to ASCII, as well as make sure the application runs in the new operating system. We remain in place and provide a highly reliable, scalable sort such that those applications can continue to run in a reliable fashion as they did on the mainframe.
This segment is part 3 in the series : Thought Leaders in Big Data: Interview with Josh Rogers, SVP of Data Integration Business at Syncsort
1 2 3 4 5