User Experience on mobile might not be great yet, but I'm working on it.

Your first time on this page? Allow me to give some explanations.

Awesome Hadoop

A curated list of amazingly awesome Hadoop and Hadoop ecosystem resources

Here you can see meta information about this topic like the time we last updated this page, the original creator of the awesome list and a link to the original GitHub repository.

Last Update: Dec. 1, 2020, 9:16 p.m.

Thank you youngwookim & contributors
View Topic on GitHub:
youngwookim/awesome-hadoop

Search for resources by name or description.
Simply type in what you are looking for and the results will be filtered on the fly.

Further filter the resources on this page by type (repository/other resource), number of stars on GitHub and time of last commit in months.

Hadoop

Elasticsearch real-time search and analytics natively integrated with Hadoop

1.76K
913
8d
Apache-2.0

Python MapReduce library written in Cython. Visit us in #hadoopy on freenode. See the link below for documentation and tutorials.

243
62
8y 0d
GPL-3.0

Run MapReduce jobs on Hadoop or Amazon Web Services

2.52K
590
16d
n/a

Visualize your HDFS cluster usage

223
85
1y 5m
Apache-2.0

Hadoop log aggregator and dashboard

192
68
7y 36d
n/a

Distributed Big Data Orchestration Service

1.41K
323
16d
Apache-2.0

A fast to develop, fast to run, Go based toolkit for ETL and feature extraction on Hadoop.

196
15
6y 15d
n/a

framework for distributed processing. Integrates MapReduce (parallel processing), YARN (job scheduling) and HDFS (distributed file system).

An Object Store for Apache Hadoop

application framework for executing a complex DAG (directed acyclic graph) of tasks, built on YARN.

SpatialHadoop is a MapReduce extension to Apache Hadoop designed specially to work with spatial data.

Big Data Spatial Analytics for the Hadoop Framework

Pydoop is a package that provides a Python API for Hadoop.

Apache Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets

Distributed in-memory platform

YARN

Running MPICH2 on Yarn

113
60
6y 15d
n/a

Apache Slider is a project in incubation at the Apache Software Foundation with the goal of making it possible and easy to deploy existing applications onto a YARN cluster.

Apache Twill is an abstraction over Apache Hadoop® YARN that reduces the complexity of developing distributed applications, allowing developers to focus more on their application logic.

NoSQL

A developer-friendly Python library to interact with Apache HBase

550
152
7m
n/a

Hannibal is tool to help monitor and maintain HBase-Clusters that are configured for manual splitting.

169
63
3y 10m
n/a

Haeinsa is linearly scalable multi-row, multi-table transaction library for HBase

155
44
3y 9m
Apache-2.0

Secondary Index for HBase

579
289
6y 7m
Apache-2.0

A SQL skin over HBase supporting secondary indices

The Apache Accumulo™ sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.

The Scalable Time Series Database

column-oriented distributed datastore, inspired by BigTable.

SQL on Hadoop

The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL

A SQL skin over HBase supporting secondary indices

Apache HAWQ is a Hadoop native SQL query engine that combines the key technological advantages of MPP database with the scalability and convenience of Hadoop

SQL-like query language for Cascading.

Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its development in 2012.

distributed SQL query engine.

distributed data warehouse system on Hadoop.

Schema-free SQL Query Engine

Data Management

Confluent Schema Registry for Kafka

1.37K
796
13d
n/a

Schema Registry

169
136
20d
Apache-2.0

framework that allows efficient translation of queries involving heterogeneous and federated data.

Metadata tagging & lineage capture suppoting complex business data taxonomies

Kudu provides a combination of fast inserts/updates and efficient columnar scans to enable multiple real-time analytic workloads across a single storage layer, complementing HDFS and Apache HBase.

Workflow, Lifecycle and Governance

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

19.39K
7.54K
3d
Apache-2.0

data management framework.

Python package that helps you build complex pipelines of batch jobs

Data Ingestion and Integration

Netflix's distributed Data Pipeline

747
173
4y 11m
Apache-2.0

Gobblin is a distributed big data integration framework (ingestion, replication, compliance, retention) for batch and streaming systems. Gobblin features integrations with Apache Hadoop, Apache Kafka, Salesforce, S3, MySQL, Google etc.

1.8K
666
2d
Apache-2.0

distributed publish-subscribe messaging system.

DSL

Machine learning and natural language processing with Apache Pig

50
14
6y 11m
Apache-2.0

Packetpig - Open Source Big Data Security Analytics

309
91
4y 8m
n/a

A bunch of utility classes for Java, Hadoop, HBase, Pig, etc.

77
31
6y 8m
Apache-2.0

Pig Visualization framework

448
140
2y 10m
Apache-2.0

Map-Reduce for Clojure

524
57
3y 6m
Apache-2.0

A collection of libraries for working with large-scale data in Hadoop

Simple and scalable scripting for large sequencing data set(ex: bioinfomation) in Hadoop

Libraries and Tools

302
46
2y 14d
Apache-2.0

Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.

1.1K
384
1y 5m
Apache-2.0

A native go client for HDFS

937
251
83d
MIT

Web tool for Avro Schema Registry |

342
90
1y 11d
n/a

A set of libraries, tools, examples, and documentation

web application for interacting with Hadoop.

A web-based notebook that enables interactive data analytics

data serialization system.

A graphical editor for editing Apache Oozie workflows inside Eclipse.

Columnar storage format that uses the record shredding and assembly algorithm described in the Dremel paper.

Apache Superset (incubating) is a modern, enterprise-ready business intelligence web application

Realtime Data Processing

stream processing framework, based on Kafka and YARN.

Apache Flink is a platform for efficient, distributed, general-purpose data processing. It supports exactly once stream processing.

Apache Pulsar (incubating) is a highly scalable, low latency messaging platform running on commodity hardware. It provides simple pub-sub semantics over topics, guaranteed at-least-once delivery of messages, automatic cursor management for subscribers, and cross-datacenter replication.

A high-performance, column-oriented, distributed data store.

Distributed Computing and Programming

framework for in-memory cluster computing.

A community index of packages for Apache Spark

A community site for Apache Spark

framework for data management/analytics on Hadoop.

A platform for efficient, distributed, general-purpose data processing.

Enterprise-grade unified stream and batch processing engine.

Apache Livy (incubating) is web service that exposes a REST interface for managing long running Apache Spark contexts in your cluster. With Livy, new applications can be built on top of Apache Spark that require fine grained interaction with many Spark contexts.

Packaging, Provisioning and Monitoring

A big data cluster management tool that creates and manages clusters of different technologies.

20
16
5y 7m
LGPL-3.0
187
64
1y 43d
n/a

Apache Bigtop: Packaging and tests of the Apache Hadoop ecosystem

operational framework for Hadoop mangement.

centralized service for process management.

Java libaries for Apache ZooKeeper.

Search

Banana for Solr - A Port of Kibana

646
236
5m
n/a

Search and analytics engine based on Apache Lucene.

Search platform for Apache Lucene.

Search Engine Framework

open source web crawler.

Security

Enhanced data protection for the Apache Hadoop ecosystem

90
41
5y 5m
n/a

Ranger is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform.

An authorization module for Hadoop

A REST API Gateway for interacting with Hadoop clusters.

Benchmark

HiBench is a big data benchmark suite.

1.08K
640
2d
n/a

Yahoo! Cloud Serving Benchmark

3.36K
1.76K
2d
Apache-2.0

Benchmark of Redshift, Hive, Shark, Impala and Stiger/Tez.

Machine learning and Big Data analytics

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

1.73K
412
17d
Apache-2.0

RHadoop

757
288
5y 10d
n/a

Machine Learning framework for Spark

The R Project for Statistical Computing.

SINGA is a general distributed deep learning platform for training big deep learning models over large datasets

BigDL is a distributed deep learning library for Apache Spark; with BigDL, users can write their deep learning applications as standard Spark programs, which can directly run on top of existing Spark or Hadoop clusters.

Apache Hivemall is a scalable machine learning library that runs on Apache Hive, Spark and Pig.

Misc.

WebUI for query engines: Hive and Presto

185
48
3y 11m
n/a

Python interface to Hive and Presto. 🐝

1.33K
429
21d
n/a

An Open Source unit test framework for Hive queries based on JUnit 4 and 5

215
75
30d
Apache-2.0

A super simple utility for testing Apache Hive scripts locally for non-Java developers.

70
25
5y 35d
n/a

Unit test framework for hive and hive-service

69
50
3y 10m
n/a

Flume NG MongoDB source.

69
64
1y 7m
Apache-2.0

Flume plugin for RabbitMQ

59
47
3y 10m
Apache-2.0

Apache Flume source plugin allowing direct consumption of UDP messages

8
7
6y 8m
Apache-2.0

Websites

Presentations

Books

Hadoop and Big Data Events