Internet of Things and Big Data – The concept of Data Gravity

A number of posts in this blog have dealt with the increasing variety and sheer number of ‘things’, be they sensors, wearables, appliances, actuators, industrial components etc. that will gain connectivity over the coming few years. There will however be no point adding connectivity to these items if they do not produce meaningful data, and in turn there will be no point collating this data unless insights are obtained that can be acted upon. This is why the ‘Internet of Things’ is invariably coupled with its close relative in the family of hyped buzzwords –  Mr ‘Big Data’.

To give an example of the scale of data being produced, Cisco estimates that by 2018, connected devices will generate 400 zettabytes of data annually. (I too needed to look that up. Apparently a zettabyte is 1021 bytes or 1 billion terabytes).

Now this figure is as much conjecture as it is incomprehensible to most of us. Nevertheless it aims to highlight one of the key challenges of gaining value out of all these connecting ‘things’, namely the requirement to process and analyse data, and act upon it to generate business value – either by reducing operational costs, improving customer experience or increasing revenue. By way of example, acting upon the data could mean carrying out preventative maintenance in an industrial application, carry out a diagnosis in a health application or recommend an appropriate traffic routing in a particularly congested metropolitan area. So without analysis, the data is worthless and there is no justification to invest in the device, sensor and communications infrastructure required for an “Internet of Things” system.

This is not a problem for the likes of Google and Amazon who have built their entire business on predicting customer behaviour based on sophisticated machine-learning models. Similarly, in the industrial internet, big players like GE can easily deal with the volume of data produced by its jet engines, power plants or any other of its countless industrial machines. The key challenge comes to the smaller companies, whose expertise lies in the specific industry within which they operate, and not in the artificial intelligence and data science disciplines that underpin big data analysis. These can include the raft of industrial operations whose processes can be optimised by clever use of analytics, or the IoT start-up working on a shoestring, and therefore unable to access the knowledge that can be offered by firms such as IBM, Accenture and Capgemini.

One answer to this problem comes in the form of self-service cloud-based analytic engines. Two weeks ago at the Microsoft Futures event in London, much was made of Azure Cloud Machine Learning service. Now also available via a free trial, it provides access to a wide range of algorithms, allowing users to build bespoke APIs to analytical engines underpinned by Microsoft’s decades of research into machine learning. The Microsoft package includes a host of wonderfully named alogorithms, including Deep Neural Networks, Scalable Boosted Decision Trees and my favourite, Decision Jungles.

Of course, Microsoft are neither the first nor the only outfit to offer this sort of analytical engine as a cloud service. Google, for example, offers its Google Prediction API, though is very coy about what algorithms underpin it. Additionally, the (imaginatively titled) Predictive Analytics Today website lists another twenty-eight sites that provide such capabilities. The Microsoft offering does however appear to represent a significant step forward, as the provision of a ‘no coding’ option, the ability to integrate seamlessly with existing algorithms, as well as the ease of integrating with other Microsoft products, all lower the bar to access cutting-edge big data capability.

While this field is undoubtedly exciting, in my view, one of the main challenges is to understand how realistic it is to apply a centralised cloud-based analytics engine to an IoT scenario that is generating gigabytes of data. Especially where the input rate of data is very fast, and a quick reaction time is required, it may simply not be practical to push all this to a centralised cloud API. Beau Cronin, a leading exponent in this space calls this the problem of Data Gravity, whereby for truly large data sets, it makes a lot more sense to take the compute to the data rather than the other way round. In other words, instead of sending data to the cloud, bring the algorithms on site, in direct contradiction to the thirty “machine learning in the cloud” models mentioned above. Although Dr Cronin does not offer a way forward, I feel that the opportunity and value that can be achieved by opening up Microsoft et al’s great pool of capability in machine learning to a wide ecosystem of players in the world of connected devices is so great that these platforms will definitely provide a great source of innovation and exciting services.

dilbert_big_data

 

 

Leave a comment