categories

HOT TOPICS

Thought Leaders in Big Data: Interview with Steve Shine, CEO of Actian (Part 2)

Posted on Friday, Nov 8th 2013

Steve Shine: People aren’t tapping into just any ERP system. They are tapping into applications and new forms of data sources that are either machine generated or cloud generated, social media, mobile and streams of data that are on an entirely different scale from anything that legacy technologies can deal with. Frankly, this is why Hadoop has grown. Hadoop is quite good at dealing with massive scale.

We wanted to bring the classic ETL [extract, transform, load] capability into the massive scale capability of Hadoop. That is when we first came across the Pervasive Software. What we found is that the Pervasive team had seen five years ago that Hadoop was going to be one of the dominant technology platforms of the future. We also saw that they had tried to move all their rich ETL functionality onto Hadoop and found what everybody knows today, which is that Hadoop is great for scale but not great for performance. They realized that whilst Mapreduce was a good basic building block, it was not great in terms of productivity and code development and is a very slow run-time environment. So they wrote the Mapreduce layer and produced a product that we call Dataflow. It is a very high-performance layer to replace Mapreduce, so that you can get really high performance capability on Hadoop clusters. Then they moved all their ETL functionality to operate on that high-performance layer on that massive scalability of Hadoop.

For us that was a terrific first step, because it enables you to digest and process all that data at a completely different scale – at a high performance level and at a different cost point. I think that is probably one of the biggest challenges that you are seeing in the traditional legacy market – the cost to deal with that amount of data is just prohibitive. That is why a lot of people have adopted Hadoop and are excited by a completely different set of technologies – high performance and a completely different price point. That was a great first step, and it really allowed us to get data in, cleansed, prepared, and augmented to either run analytics on it in Hadoop or to pass that data on to our Vectorwise technology for lower latency/higher performance analytics.

The additional component for us was ParAccel. In April 2013 we also acquired ParAccel, which is an MPP [massively parallel processing] high-performance database for analytics. Why were we excited about it? Up until that point, we were able to offer analytics and data processing on massive scale on Hadoop and the data flow technology not only brought ETL, but also a full suite of analytics with it. That is great if your use case doesn’t require low latency or high performance. It is all about scale.

If you are in retail and you want to get a market basket analysis on three years’ worth of points of sale, that is an enormous challenge, and Hadoop is absolutely the best platform for it. It may take you a day to get that analysis. But if you really do understand how people’s associated buying is working, you can lay out your stores and increase your sales for every basket that goes through. Performance is not really an issue there. I don´t care if it takes a day to get that answer. We had that part of the solution. Vectorwise is where someone is really slicing and dicing and asking conversational question over 10 to 20 terabytes of data – a lot of data. But there was a really big gap between five, 10, or 20 terabytes of querying and the Hadoop petabyte querying. It was that middle space where MPP does so well.

Let’s say you are a risk platform in investment banking, for example. An investment bank is placing a trade, and they are regulated to know that if they place this trade, they know the risk profile of the rest of their trades they have in their portfolio when they put this trade into that portfolio. You cannot wait hours for that. Actually, you cannot even wait minutes. You need to know within seconds whether or not you are able to place that trade. The scale of data you are looking at is often hundreds of terabytes. That is often where the MPP aspect really comes into play. For us, it means being able to say, “No matter what your analytics and reporting demands are, whether they are ultra low latency and relatively large footprints, or massive footprints with really low latency, or truly huge footprints like petabytes of data, we are able to offer the underlying engines that are optimized for the analytical outcomes.”

That is what got us to where we are today. We have compiled an end-to-end platform capability to connect, cleanse, and aggregate data on a Hadoop platform and then offer it to the reporting and analytics environments to meet whatever the business need is on the far end. It is all operating on commodity hardware at a much lower cost than legacy software can ever evolve to.

We think we are at this terrific point now where when you look at the macro trends with more and more data, you look at the demands for cost, performance, and scale. That puts us in a strong and relatively unusual position.

This segment is part 2 in the series : Thought Leaders in Big Data: Interview with Steve Shine, CEO of Actian
1 2 3 4 5

Hacker News
() Comments

Featured Videos