User Experience on mobile might not be great yet, but I'm working on it.

Your first time on this page? Allow me to give some explanations.

Awesome Hadoop

A curated list of amazingly awesome Hadoop and Hadoop ecosystem resources

Here you can see meta information about this topic like the time we last updated this page, the original creator of the awesome list and a link to the original GitHub repository.

Last Update: None

Thank you youngwookim & contributors
View Topic on GitHub:
youngwookim/awesome-hadoop

Search for resources by name or description.
Simply type in what you are looking for and the results will be filtered on the fly.

Further filter the resources on this page by type (repository/other resource), number of stars on GitHub and time of last commit in months.

Hadoop

Elasticsearch real-time search and analytics natively integrated with Hadoop

1.8K
927
82d
Apache-2.0

Python MapReduce library written in Cython. Visit us in #hadoopy on freenode. See the link below for documentation and tutorials.

243
62
8y 5m
GPL-3.0

Run MapReduce jobs on Hadoop or Amazon Web Services

2.53K
592
5m
n/a

Visualize your HDFS cluster usage

225
86
1y 11m
Apache-2.0

Hadoop log aggregator and dashboard

191
67
7y 6m
n/a

Distributed Big Data Orchestration Service

1.45K
326
86d
Apache-2.0

A fast to develop, fast to run, Go based toolkit for ETL and feature extraction on Hadoop.

197
15
6y 5m
n/a

framework for distributed processing. Integrates MapReduce (parallel processing), YARN (job scheduling) and HDFS (distributed file system).

An Object Store for Apache Hadoop

application framework for executing a complex DAG (directed acyclic graph) of tasks, built on YARN.

SpatialHadoop is a MapReduce extension to Apache Hadoop designed specially to work with spatial data.

Big Data Spatial Analytics for the Hadoop Framework

Pydoop is a package that provides a Python API for Hadoop.

Apache Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets

Distributed in-memory platform

YARN

Running MPICH2 on Yarn

112
60
6y 5m
n/a

Apache Slider is a project in incubation at the Apache Software Foundation with the goal of making it possible and easy to deploy existing applications onto a YARN cluster.

Apache Twill is an abstraction over Apache Hadoop® YARN that reduces the complexity of developing distributed applications, allowing developers to focus more on their application logic.

NoSQL

A developer-friendly Python library to interact with Apache HBase

557
152
93d
n/a

Hannibal is tool to help monitor and maintain HBase-Clusters that are configured for manual splitting.

169
63
4y 119d
n/a

Haeinsa is linearly scalable multi-row, multi-table transaction library for HBase

156
44
4y 74d
Apache-2.0

Secondary Index for HBase

583
291
7y 16d
Apache-2.0

A SQL skin over HBase supporting secondary indices

The Apache Accumulo™ sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.

The Scalable Time Series Database

column-oriented distributed datastore, inspired by BigTable.

SQL on Hadoop

The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL

A SQL skin over HBase supporting secondary indices

Apache HAWQ is a Hadoop native SQL query engine that combines the key technological advantages of MPP database with the scalability and convenience of Hadoop

SQL-like query language for Cascading.

Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its development in 2012.

distributed SQL query engine.

distributed data warehouse system on Hadoop.

Schema-free SQL Query Engine

Data Management

Confluent Schema Registry for Kafka

1.43K
827
82d
n/a

Schema Registry

180
142
6m
Apache-2.0

framework that allows efficient translation of queries involving heterogeneous and federated data.

Metadata tagging & lineage capture suppoting complex business data taxonomies

Kudu provides a combination of fast inserts/updates and efficient columnar scans to enable multiple real-time analytic workloads across a single storage layer, complementing HDFS and Apache HBase.

Workflow, Lifecycle and Governance

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

20.49K
8K
83d
Apache-2.0

data management framework.

Python package that helps you build complex pipelines of batch jobs

Data Ingestion and Integration

Netflix's distributed Data Pipeline

752
173
5y 5m
Apache-2.0

A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.

1.87K
677
83d
Apache-2.0

distributed publish-subscribe messaging system.

DSL

Machine learning and natural language processing with Apache Pig

50
14
7y 4m
Apache-2.0

Packetpig - Open Source Big Data Security Analytics

308
90
5y 65d
n/a

A bunch of utility classes for Java, Hadoop, HBase, Pig, etc.

76
31
7y 44d
Apache-2.0

Pig Visualization framework

451
140
3y 110d
Apache-2.0

Map-Reduce for Clojure

527
58
3y 11m
Apache-2.0

A collection of libraries for working with large-scale data in Hadoop

Simple and scalable scripting for large sequencing data set(ex: bioinfomation) in Hadoop

Libraries and Tools

302
46
2y 5m
Apache-2.0

Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.

1.11K
384
1y 11m
Apache-2.0

A native go client for HDFS

977
259
5m
MIT

Web tool for Avro Schema Registry |

355
94
1y 5m
n/a

A set of libraries, tools, examples, and documentation

web application for interacting with Hadoop.

A web-based notebook that enables interactive data analytics

data serialization system.

A graphical editor for editing Apache Oozie workflows inside Eclipse.

Columnar storage format that uses the record shredding and assembly algorithm described in the Dremel paper.

Apache Superset (incubating) is a modern, enterprise-ready business intelligence web application

Realtime Data Processing

stream processing framework, based on Kafka and YARN.

Apache Flink is a platform for efficient, distributed, general-purpose data processing. It supports exactly once stream processing.

Apache Pulsar (incubating) is a highly scalable, low latency messaging platform running on commodity hardware. It provides simple pub-sub semantics over topics, guaranteed at-least-once delivery of messages, automatic cursor management for subscribers, and cross-datacenter replication.

A high-performance, column-oriented, distributed data store.

Distributed Computing and Programming

framework for in-memory cluster computing.

A community index of packages for Apache Spark

A community site for Apache Spark

framework for data management/analytics on Hadoop.

A platform for efficient, distributed, general-purpose data processing.

Enterprise-grade unified stream and batch processing engine.

Apache Livy (incubating) is web service that exposes a REST interface for managing long running Apache Spark contexts in your cluster. With Livy, new applications can be built on top of Apache Spark that require fine grained interaction with many Spark contexts.

Packaging, Provisioning and Monitoring

A big data cluster management tool that creates and manages clusters of different technologies.

20
16
6y 24d
LGPL-3.0
187
64
1y 6m
n/a

Apache Bigtop: Packaging and tests of the Apache Hadoop ecosystem

operational framework for Hadoop mangement.

centralized service for process management.

Java libaries for Apache ZooKeeper.

Search

Banana for Solr - A Port of Kibana

654
238
10m
n/a

Search and analytics engine based on Apache Lucene.

Search platform for Apache Lucene.

Search Engine Framework

open source web crawler.

Security

Enhanced data protection for the Apache Hadoop ecosystem

90
41
5y 10m
n/a

Ranger is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform.

An authorization module for Hadoop

A REST API Gateway for interacting with Hadoop clusters.

Benchmark

HiBench is a big data benchmark suite.

1.12K
657
5m
n/a

Yahoo! Cloud Serving Benchmark

3.5K
1.79K
85d
Apache-2.0

Benchmark of Redshift, Hive, Shark, Impala and Stiger/Tez.

Machine learning and Big Data analytics

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

1.74K
410
83d
Apache-2.0

RHadoop

757
288
5y 5m
n/a

Machine Learning framework for Spark

The R Project for Statistical Computing.

SINGA is a general distributed deep learning platform for training big deep learning models over large datasets

BigDL is a distributed deep learning library for Apache Spark; with BigDL, users can write their deep learning applications as standard Spark programs, which can directly run on top of existing Spark or Hadoop clusters.

Apache Hivemall is a scalable machine learning library that runs on Apache Hive, Spark and Pig.

Misc.

WebUI for query engines: Hive and Presto

187
48
4y 4m
n/a

Python interface to Hive and Presto. 🐝

1.37K
444
6m
n/a

An Open Source unit test framework for Hive queries based on JUnit 4 and 5

221
77
4m
Apache-2.0

A super simple utility for testing Apache Hive scripts locally for non-Java developers.

70
25
5y 6m
n/a

Unit test framework for hive and hive-service

69
50
4y 105d
n/a

Flume NG MongoDB source.

69
64
2y 31d
Apache-2.0

Flume plugin for RabbitMQ

59
47
4y 114d
Apache-2.0

Apache Flume source plugin allowing direct consumption of UDP messages

8
8
7y 50d
Apache-2.0

Websites

Presentations

Books

Hadoop and Big Data Events