Sramana Mitra: I have a question in that context. There’s a lot of processing going on midstream of traffic coming in. Is it all happening in real time? How do you deal with delays and latencies?
Amir Husain: First of all, we’re not blocking things until the final answer arrives. In other words, we’re not inserting ourselves as a delay in the servicing of whatever requests our clients or customers are looking to service. All this data exhaust is going into our system and there’s a growing level of confidence being built up as deeper and deeper research is happening. You clearly don’t want to go to real-time NLP research query while you’re waiting on the customer to get their web page back.
Sramana Mitra: That’s right.
Amir Husain: We can do a lot of stuff in real-time, which is quicker. It might be knowledge that we have learned that we can apply. Still, there are things that might look fishy while you may continue to do what the system would have done as long as the action falls in the range of things that are not completely disastrous. You can conclude a service to web and take it offline and figure out whether this request is coming from a human being or some kind of a price-scraping robot that’s targeting this e-commerce property. That’s now a big problem.
56% of the traffic on the web according to certain studies are generated by robots that are basically eating up half your investment in your data centre while delivering you no revenue, and potentially even causing you harm because many of them are security-scanning bots. The backend infrastructure is building up that confidence over time. It’s not blocking transactions. Of course, we use a very scalable infrastructure. We use Spark, Hive, Hadoop, and a lot of known, deployed, and tested components to power all of these algorithms.
Sramana Mitra: What would you like to illustrate with the customer? I understand, more or less, the algorithm. What would you like to illustrate using customers further?
Amir Husain: I’ll give you a couple of examples. One of our customers is ExamSoft. It’s a very large provider of online testing. They administer the California Bar Exam and a number of other state exams. Because they basically are delivering examinations, they are a frequent target for hackers looking to get keys or questions before it’s been deployed in the context of an actual exam. We’ve been providing them with services for a while over a year now.
In their data streams, we have uncovered several incidents of attacks and types of threats that weren’t seen before. One example was back in September 2014 when we had just launched the service with them as one of our early customers. We started seeing these odd signatures of an attack in their web logs. The algorithms flagged that as being something very strange even though it didn’t match a hard signature of a known threat. As the NLP research generated additional evidence, it turned out that was one of the first incidents involving Shellshock. I’m not even sure if it was called Shellshock at that time. The name was given to it publicly on the 24th of September 2014.
During that time, we started to see these odd patterns. The NLP-based research was able to uncover that this was an instance of Shellshock before the threat had a name or before we’d seen its signature. That became a great validation for the kind of techniques being used. For another large customer that’s focused on e-commerce, we recently uncovered an attack whereby there was a lot of traffic going to a very well-known media property.
It’s one of the large news sites. If you look at it, seemingly, there wasn’t anything strange about it. That large media property wouldn’t have been blacklisted. The actual traffic pattern of what internal nodes were generating that traffic and what kind of traffic made the algorithms suspicious. They started to research deeper into the requests that were being made. At the end of it, what we discovered was for a brief moment, for less than one day, this media property had been hacked and malware had been placed on that domain which an internal bot malware inside our customer network was accessing and spreading across the network.
You would not have caught that with a blacklist rule. You would not have caught that simply by looking at the log and saying, “There’s a request going to a web property.” All of these things put together and the way the cognitive analysis was done by the algorithm was able to uncover these threats. These are a couple of examples.
This segment is part 3 in the series : Thought Leaders in Cyber Security: Amir Husain, CEO of SparkCognition
1 2 3 4 5