Ambari
A web interface for managing Hadoop services and components
Apache Kafka
A distributed streaming platform for building real-time data pipelines and streaming apps.
Apache Spark
Open-source cluster computing framework with highly performant in-memory analytics and a growing number of related projects
Cassandra
A distributed database system
Cubes
A cube is a set of related measures and dimensions that is used to analyze data.
• A measure is a transactional value or measurement that a user may want to aggregate. Measures are sourced from columns in one or more source tables, and are grouped into measure groups.
• A dimension is a group of attributes that represent an area of interest related to the measures in the cube, and which are used to analyze the measures in the cube. The attributes within each dimension can be organized into hierarchies to provide paths for analysis.
Edge computing
Edge computing is a method of optimizing cloud systems by performing data processing at the edge of the network, near the source of the data. Edge computing covers a range of technologies that includes mobile data acquisition and signature analysis, wireless sensor networks, and cooperative distributed peer-to-peer ad hoc networking and processing.
Flume
Software for streaming data into HDFS
Google BigQuery
BigQuery is Google’s fully managed, petabyte scale, enterprise data warehouse for analytics. BigQuery is serverless; there is no infrastructure to manage or a database administrator.
Hadoop
The Apache Hadoop software library is a framework that allows the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
Hadoop Distributed File System (HDFS)
the scalable system that stores data across multiple machines without prior organization.
HBase
A non-relational, distributed database that runs on top of Hadoop
HCatalog
A table and storage management layer
Hive
A data warehousing and SQL-like query language
MapReduce
A parallel processing software framework that takes inputs, partitions them into smaller problems and distributes them to worker nodes
ODBC
ODBC stands for Open Data Base Connectivity, a connection method to data sources.
Oozie
A Hadoop job scheduler
Pig
A platform for manipulating data stored in HDFS
Python
Python is a high-level programming language for general-purpose programming. Python emphasizes code readability and a syntax which allows programmers to express concepts in fewer lines of code than might be used in languages such as C++ or Java.
R
R is a language and environment for statistical computing and graphics. It is a GNU project similar to the S language and environment, The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.
Solr
A scalable search tool
Sqoop
Moves data between Hadoop and relational databases
Welch’s Test
Welch’s Test for Unequal Variances (also called Welch’s t-test, Welch’s adjusted T or unequal variances t-test) is used to see if two sample means are significantly different. The null hypothesis for the test is that the means are equal. The alternate hypothesis for the test is that means are not equal.
YARN
(Yet Another Resource Negotiator) provides resource management for the processes running on Hadoop
Zookeeper
An application that coordinates distributed processing