A web interface for managing Hadoop services and components
a distributed streaming platform for building real-time data pipelines and streaming apps.
Open-source cluster computing framework with highly performant in-memory analytics and a growing number of related projects
A distributed database system
A cube is a set of related measures and dimensions that is used to analyze data.
- A measure is a transactional value or measurement that a user may want to aggregate. Measures are sourced from columns in one or more source tables, and are grouped into measure groups.
- A dimension is a group of attributes that represent an area of interest related to the measures in the cube, and which are used to analyze the measures in the cube. The attributes within each dimension can be organized into hierarchies to provide paths for analysis.
Software for streaming data into HDFS
BigQuery is Google’s fully managed, petabyte scale, enterprise data warehouse for analytics. BigQuery is serverless; there is no infrastructure to manage or a database administrator.
The Apache Hadoop software library is a framework that allows the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
Hadoop Distributed File System (HDFS)
the scalable system that stores data across multiple machines without prior organization.
A non-relational, distributed database that runs on top of Hadoop
A table and storage management layer
A data warehousing and SQL-like query language
a parallel processing software framework that takes inputs, partitions them into smaller problems and distributes them to worker nodes
ODBC stands for Open Data Base Connectivity, a connection method to data sources.
A Hadoop job scheduler
A platform for manipulating data stored in HDFS
Python is a high-level programming language for general-purpose programming. Python emphasizes code readability and a syntax which allows programmers to express concepts in fewer lines of code than might be used in languages such as C++ or Java.
R is a language and environment for statistical computing and graphics. It is a GNU project similar to the S language and environment, The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.
A scalable search tool
Moves data between Hadoop and relational databases
Welch’s Test for Unequal Variances (also called Welch’s t-test, Welch’s adjusted T or unequal variances t-test) is used to see if two sample means are significantly different. The null hypothesis for the test is that the means are equal. The alternate hypothesis for the test is that means are not equal.
(Yet Another Resource Negotiator) provides resource management for the processes running on Hadoop
An application that coordinates distributed processing