Overview of Big Data Analytics

Big Data Analytics offers a nearly endless source of business and informational insight, that can lead to operational improvement and new opportunities for companies to provide unrealized revenue across almost every industry. From use cases like customer personalization, to risk mitigation, to fraud detection, to internal operations analysis, and all the other new use cases arising near-daily, the Value hidden in company data has companies looking to create a cutting-edge analytics operation.

Discovering value within raw data poses many challenges for IT teams. Every company has different needs and different data assets. Business initiatives change quickly in an ever-accelerating marketplace, and keeping up with new directives can require agility and scalability. On top of that, a successful Big Data Analytics operation requires enormous computing resources, technological infrastructure, and highly skilled personnel.

All of these challenges can cause many operations to fail before they deliver value. In the past, a lack of computing power and access to automation made a true production-scale analytics operation beyond the reach of most companies: Big Data was too expensive, with too much hassle, and no clear ROI. With the rise of cloud computing and new technologies in compute resource management, Big Data tools are more accessible than ever before.

The early Big Data innovation projects were open-sourced under the Apache Software Foundation, with most significant contributions coming from the likes of Google, Yahoo, Facebook, IBM, academia, and others. Some of the most widely used engines are:

Apache Hive/Hadoop: (developed at Yahoo!, Google, and Facebook) is the workhorse for complex ETL and data preparation that services information to many analytics environments or data stores for further analysis.

Apache Spark: (developed at University of California, Berkeley) tends to be used with heavy compute jobs that are typically batch ETL and ML workloads, but is also used in conjunction with technologies such as Apache Kafka.

Presto: (developed by Facebook) is a SQL engine that is lighting fast and reliable for reporting and ad-hoc analytics.