If you haven’t yet seen the announcements and initial demo of Viv, you should do. Built by several key members of the original team behind Siri, this is the next generation of virtual assistant. Many of the current tools from the big software players – Amazon’s Alexa, Microsoft Cortana, and the aforementioned Siri from Apple – have a large and growing list of commands they can execute, but they are essentially scripted responses.

If a programmer has written the routine to add a task to your to-do list (Wunderlist, I’m still waiting for Siri integration!), then the system can do it. Once you go off-script though, especially with a more complex request, the results deteriorate rapidly – don’t ask your phone to find out how to cook the perfect BBQ turkey and order the necessary elements for delivery to your brother’s house…. you will be offered a search page, at best. Where Viv has taken a step forward, is by allowing a more sophisticated, conversational style of interface that acts as glue between multiple different application services. Viv could handle the BBQ turkey example, provided it had access to the relevant applications. It can understand much more complex language (at least in the demo) and has the AI power to pull various services together to achieve the intent of your request.

But there are big limitations. When we’re at work, our daily tasks are not to order pizza, buy flowers, or get an uber. If you work in a major industrial company, you may want to understand whether the electric submersible pump in Permian Basin Rig 279A has a problem with gas lock, or alternatively, is temporarily clogged with debris. Perhaps you’re a researcher looking at causality of a specific gene on prostate cancer.  Neither Siri nor Viv will be able to answer these questions.

Why? Because the question hasn’t been answered before and these are much harder problems to solve (we’ll come to why in a moment). That’s where Data Scientists usually come in. They research the problem and find a scientific, data-driven answer. What if you could get Viv (or Siri) to be THAT smart. What if they could find answers to questions where the data is not readily available or has not been taught before? Could next-generation chat bots be a smart as data scientists? And work faster and cheaper? The answer is yes, but first they need to be able to handle the following challenges:


1. Scalability: We’re looking at petabytes of data for many industrial-class problems. For example, Pratt & Whitney’s new jet engine has 5,000 sensors generating up to 10GB data per second (compared to 250 sensors on a traditional engine). Aviation Week Network describing this engine stated that, ‘a single twin-engine aircraft with an average 12-hour flight-time can produce up to 844 TB of data.’ Given the number of these new engines expected to reach the market, Pratt & Whitney is looking at zetabytes (1021) of engine data over a 2-3 year period. To put that in context, a couple of years ago Google revealed they were tracking 30 trillion web pages, in an index file that was approximately 1,000 TB in size. Imagine every jet engine manufacturer trying to analyze data that is at least a million times larger than Google’s index. Perhaps the end user of that data would want to scan across multiple engine types from different manufacturers, or combine engine with airframe data and weather. Think of the volume of data involved in that advanced query. Incidentally, Pratt & Whitney have a significantly smaller market share than GE, Rolls Royce, and CFM, and the world commercial airline industry is expected to triple in size over the next few years. If you think the Data Warehouse appliances of the noughties cured all our data performance issues, or that Hadoop has taken this problem away, think again. You can put any AI you want in front of this data, but you still need to rethink your approach to scalability if you’re going to get a response to your query inside a week.


2. Distributed nature of data. If we only had to worry about the volume of data, life would be easy!  Unfortunately, this data is scattered across the world, being generated in real-time(by engines at 36,000 feet, for example) and despite what many vendors will tell you, it is simply not an option to stream all of this raw data to the cloud. Lots of information will live (and die) at the edge, or in the Cisco Fog as they like to call this zone. We’ll need to process data at the edge while simultaneously transferring relevant pieces to the cloud, dynamically changing our ideas about which pieces we want where, while running applications centrally and locally. Imagine a smart city with cameras and audio sensors on every streetlight, perhaps somewhere like London. There are 5.5 million streetlights in the UK, so let’s assume there are 500,000 in the metropolitan area, all streaming audio and video data in real-time (about 2GB data per hour if we’re talking HD). That’s a petabyte per hour. Now imagine that we want to identify a gunshot, and transfer video footage and stills of the scene to the local police station, immediately alerting them of an incident. Are we going to do that by crunching a petabyte of data streaming into our data center? Or will we run that gunshot identification app locally? I think that answer is clear. The reality is that much of this data (perhaps 80%+) will never be streamed into the cloud. However, we still need to access it after an event – to get a clear video of a crime, or a good facial image of the perpetrator, for example.

[I should also mention distributed virtual assistants. You’ve probably noticed that you need an internet connection to use Siri or Cortana or Alexa, which is because the voice analysis takes place in the cloud, as it requires significant processing power. Before virtual assistants can really become ubiquitous, they need to function locally, remotely, off-grid. If someone can embed that in silicon, they’ll have a very popular chip.]


3. Quality and Meaning of the data. Anytime when you have this much data, especially when it’s distributed, and you need to combine the data to answer a question, there’s a problem. Data is dirty, incomplete, messy, incorrectly calibrated, and different systems use alternate naming and coding systems. There’s a large industry dedicated to helping analysts and end-users combine this date. Traditionally, this was known as ETL (Extract, Transform, Load) and was dominated by Informatica, Ab-Initio, Talend, and in-house solutions from big vendors like Microsoft’s SQL Server Integration Service. Recently, we have seen the rise of self-service tools like Alteryx, SourceThought, Paxata, and Trifacta. The primary function of these tools is to prepare data for analytics. Without such tools, it can take weeks or months before data is ready for analysis.


Not only do you have to clean and blend data though, you also need to understand its meaning. Google can look at facts for you – ‘What is the state bird for California?’, but new semantic approaches like Maana (who just recently closed a $26m Series B funding round) can provide detailed meaning, creating a knowledge graph that provides a view on specific assets or processes, wherever that data resides in an organization.


4. Automation of data science. Once you have your clean, meaningful data, what do you do with it? You might run a report, filter the results, add some additional data, change the visualization approach, and find a fragment of knowledge that sends you back into the loop, running and refining different steps until you have uncovered some truth about a situation. The common parlance is to call that an insight. This takes time, and expertise. Both of which seem to be in remarkably short supply. We have spoken to several companies that have an open book on recruiting data scientists – they are hard to find and keep, because demand is still outstripping supply. People are the bottleneck in the process to garner further insights. If we want to talk about AI, this is surely the place to apply it. If we can automate many of the recurring steps in data science, and allow scientists and professional business users alike to be augmented by super-smart assistants that understand Granger causality and vector auto-regression, to run multiple steps in parallel and suggest not the best pizza in town, but the most likely root-cause analysis of a specific scenario based on the thousands of similar situations the system has reviewed, then we have something genuinely transformative.


I’m being a little unfair on Viv. This technology is not trying to solve the world’s problems. If it works as advertised it will be a major component in our future lives, helping people talk naturally to their virtual assistant, and combining various services in useful ways. I’m pretty excited by the proposition. In some ways though, Viv is purely an advanced user interface, only as powerful as the underlying applications. To gain traction the team at Viv will need to attract a large portfolio of third-party developers, who will hope that their service will rise to the top – a chance to be first place in the search list of the ‘new’ google.

 The gauntlet has been thrown down, though. If someone can combine a Viv-like interface with the ability to manage vast quantities of distributed data, blending and federating data to eliminate quality and performance issues, so that automated data science can run at pace across meaningful information… well then we’ll truly be in a new world. I believe the pieces are already out there for this type of solution.  Over the next few blog posts, I’ll take a look at how to solve each of the challenges, and how a large organisation with the right resources could assemble a revolutionary platform.