We live in an incredible era of extremely rapid, disruptive innovation. Globalization, accelerated technology change, infinite cloud scale, ubiquitous connectivity, and an internet of smart things powered by artificial intelligence is enabling a fourth industrial revolution — the digital transformation.
Successful organizations are masters of data. A culture of analytics permeates today’s most advanced companies. To fully create a culture of analytics, an organization must bring together its two greatest assets: its people and its data. Thus, if you can extract intelligence from massive volumes of data, you should enjoy unprecedented levels of opportunity in the realm of digital transformation. Here are a few key technologies to understand as we modernize analytics for a new world of data.
Internet of Things (IoT)
Digital transformation is fueling the growing maturity and affordability of edge technologies that can communicate with the internet. According to Gartner estimates, by 2020 there will be more than 26 billion connected devices. From vehicles, appliances, machines, cellphones, and wearable devices to just about anything else you can think of, intelligent things powered by data will compute, communicate, sense, and respond. As more business processes and decisions get automated, scalable data storage, secure digital data lifecycle management, solid metadata management, and enhanced data quality procedures will rise in importance for efficiently sharing or monetizing data.
IoT drives demand for big data analytics to uncover hidden patterns, unknown correlations, and other useful information. In big data analytics, advanced analytical techniques such as deep learning are used with large diverse data sets of structured, unstructured, and streaming data ranging from terabytes to zettabytes. Unstructured data sources do not fit in traditional data warehouses. Thus, a new ecosystem of big data analytics technologies has been developed to ingest, process, store, and analyze unpredictable volumes, velocities, and varieties of data.
Hadoop for Big Data Analytics
Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle extreme volumes of concurrent tasks or jobs. In a modern analytics architecture, Hadoop provides low-cost storage and data archival for offloading old historical data from the data warehouse into online cold storage. It is also used for IoT, data science, and unstructured analytics use cases.
Within the Hadoop framework, there are a plethora of related technologies for loading, organizing and querying data. Here is a list of the most popular ones today.
Apache Spark – Open-source cluster computing framework with highly performant in-memory analytics and a growing number of related projects
Apache Kafka – a distributed streaming platform for building real-time data pipelines and streaming apps
MapReduce – a parallel processing software framework that takes inputs, partitions them into smaller problems and distributes them to worker nodes
Hive – a data warehousing and SQL-like query language
Hadoop Distributed File System (HDFS) – the scalable system that stores data across multiple machines without prior organization
YARN – (Yet Another Resource Negotiator) provides resource management for the processes running on Hadoop
Ambari – a web interface for managing Hadoop services and components
Cassandra – a distributed database system
Flume – software for streaming data into HDFS
HBase – a non-relational, distributed database that runs on top of Hadoop
HCatalog – a table and storage management layer
Oozie – a Hadoop job scheduler
Pig – a platform for manipulating data stored in HDFS
Solr –a scalable search tool
Sqoop – moves data between Hadoop and relational databases
Zookeeper – an application that coordinates distributed processing
Notably over the past two years, Apache Spark has moved from being a component of the Hadoop ecosystem to a big data analytics platform of choice. It is growing faster than Hadoop. Spark provides dramatically increased data processing speed compared to Hadoop. Spark itself has many related projects including the core Apache Spark runtime, Spark SQL, Spark Streaming, MLlib, ML, and GraphX. It is now the largest big data open source project today with 1,000-plus contributors from more than 250 organizations.
The distributed and rapid storage of just about anything unstructured or structured into Hadoop Distributed File Structure (HDFS), combined with the capability of contemporary analytics tools to massively parallel query that data in a timely manner for analysis, is appealing, powerful, and driving the data platform modernization movement. While the Hadoop ecosystem continues to rapidly improve, exponential volumes of data from digital devices are just starting to overwhelm traditional data architectures.
Open Source Analytics
Another market force that is changing analytics landscapes globally has been the shift by most vendors to embrace Open Source projects such as Hadoop, Apache Spark, R, and Python. Competition drives innovation and pushes businesses forward. However, harnessing the collective genius of a worldwide community of analytics developers is priceless. After Open Source was adopted by major vendors that saw opportunities to monetize it by providing better tools, security, maintenance, and support, the adoption risks that historically held Open Source back were reduced. That is one of the many reasons why Open Source is literally everywhere these days.
Cloud and Hybrid Analytics
Although most analytics applications today still use older data warehouse and OLAP technologies on-premises, the pace of the cloud shift is significantly increasing. Internet infrastructure is getting better and is almost invisible in mature markets. Cloud fears are subsiding as more organizations witness the triumphs of early adopters. Instant, easy cloud solutions continue to win the hearts and minds of non-technical users. Cloud also accelerates time to market allowing for innovation at faster speeds than ever before.
IoT inspired cloud streaming analytics, cloud data warehouse and cloud data lake technology is exceptionally simple, fast, and cost-effective to spin up, scale up, or even scale down versus investing in multi-million-dollar hardware purchases that are almost immediately outdated. Cloud analytics plug-and-play, point-and-click, no-code designs along with pre-packaged templates empower far more organizations to enjoy sophisticated, highly scalable, advanced analytics solutions that used to be extremely complex to build in-house.
As more groups leverage cloud in digitalization strategies, the center of data gravity shifts, changing where analytics takes place. Analytics in the digital era often spans data residing on-premises and in the cloud. To ease hybrid analytics complexity, we are seeing novel cross domain solutions being introduced.
For example, databases today have optional storage locations on-premises or in virtual external tables in the cloud. A transformed class of data-as-a-service architectures, data virtualization, and logical data warehouses that enable data analysis without moving data is currently evolving. Lastly a new generation of intelligent, enterprise data catalogs is expanding to allow analytics professionals to manage metadata, improve data quality, run sophisticated data searches, and get smart data usage-based recommendations powered by machine learning.
Preparing for the Digital Future
The exponential growth of data, digitization, and internet connectivity is the “backbone” of the Fourth Industrial Revolution. It has the potential to propel societies forward, enable innovative business models, and help governments. Digitization doesn’t just enable what we do, it transforms it — not only business models, but also policy and social norms.
Like many changes in our lifetime, digital transformation is a blue ocean of opportunity to reinvent. There are also significant risks to mitigate as disruptive business models change the game. We are just beginning to see a future world where basic data visualization and data analysis that have been the foundations of analytics will be partially automated with savvy smart data discovery. Cognitive intelligence and deep learning technologies will take on countless tasks humans historically performed. To prepare, analytics professionals will need to think digital, think big, and enjoy diving into state-of-the-art analytics technologies to pave the path forward.
Jen Underwood is Founder and Principle of Impact Analytix, LLC. Impact Analytix is a boutique integrated product research, consulting, technical marketing and creative digital media agency led by experienced hands-on practitioners. Jen can be tweeted at @idigdata