User Experience on mobile might not be great yet, but I'm working on it.

Your first time on this page? Allow me to give some explanations.

Awesome Hadoop

A curated list of amazingly awesome Hadoop and Hadoop ecosystem resources

Here you can see meta information about this topic like the time we last updated this page, the original creator of the awesome list and a link to the original GitHub repository.

Last Update: None

Thank you youngwookim & contributors
View Topic on GitHub:
youngwookim/awesome-hadoop

Search for resources by name or description.
Simply type in what you are looking for and the results will be filtered on the fly.

Further filter the resources on this page by type (repository/other resource), number of stars on GitHub and time of last commit in months.

Hadoop

Elasticsearch real-time search and analytics natively integrated with Hadoop

1.8K
927
8m
Apache-2.0

Python MapReduce library written in Cython. Visit us in #hadoopy on freenode. See the link below for documentation and tutorials.

243
62
8y 10m
GPL-3.0

Run MapReduce jobs on Hadoop or Amazon Web Services

2.53K
592
11m
n/a

Visualize your HDFS cluster usage

225
86
2y 4m
Apache-2.0

Hadoop log aggregator and dashboard

191
67
8y 0d
n/a

Distributed Big Data Orchestration Service

1.45K
326
8m
Apache-2.0

A fast to develop, fast to run, Go based toolkit for ETL and feature extraction on Hadoop.

197
15
6y 11m
n/a

framework for distributed processing. Integratesย MapReduce (parallel processing), YARN (job scheduling) and HDFS (distributed file system).

An Object Store for Apache Hadoop

application frameworkย for executing a complex DAG (directed acyclic graph) of tasks, built onย YARN.

SpatialHadoop is a MapReduce extension to Apache Hadoop designed specially to work with spatial data.

Big Data Spatial Analytics for the Hadoop Framework

Pydoop is a package that provides a Python API for Hadoop.

Apache Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets

Distributed in-memory platform

YARN

Running MPICH2 on Yarn

112
60
6y 11m
n/a

Apache Slider is a project in incubation at the Apache Software Foundation with the goal of making it possible and easy to deploy existing applications onto a YARN cluster.

Apache Twill is an abstraction over Apache Hadoopยฎ YARN that reduces the complexity of developing distributed applications, allowing developers to focus more on their application logic.

NoSQL

A developer-friendly Python library to interact with Apache HBase

557
152
8m
n/a

Hannibal is tool to help monitor and maintain HBase-Clusters that are configured for manual splitting.

169
63
4y 9m
n/a

Haeinsa is linearly scalable multi-row, multi-table transaction library for HBase

156
44
4y 8m
Apache-2.0

Secondary Index for HBase

583
291
7y 6m
Apache-2.0

A SQL skin over HBase supporting secondary indices

The Apache Accumuloโ„ข sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.

The Scalable Time Series Database

column-oriented distributed datastore, inspired byย BigTable.

SQL on Hadoop

The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL

A SQL skin over HBase supporting secondary indices

Apache HAWQ is a Hadoop native SQL query engine that combines the key technological advantages of MPP database with the scalability and convenience of Hadoop

SQL-like query language for Cascading.

Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its development in 2012.

distributed SQL query engine.

distributed data warehouse system on Hadoop.

Schema-free SQL Query Engine

Data Management

Confluent Schema Registry for Kafka

1.43K
827
8m
n/a

Schema Registry

180
142
11m
Apache-2.0

framework that allows efficient translation of queries involving heterogeneous and federated data.

Metadata tagging & lineage capture suppoting complex business data taxonomies

Kudu provides a combination of fast inserts/updates and efficient columnar scans to enable multiple real-time analytic workloads across a single storage layer, complementing HDFS and Apache HBase.

Workflow, Lifecycle and Governance

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

20.49K
8K
8m
Apache-2.0

data management framework.

Python package that helps you build complex pipelines of batch jobs

Data Ingestion and Integration

Netflix's distributed Data Pipeline

752
173
5y 10m
Apache-2.0

A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.

1.87K
677
8m
Apache-2.0

distributed publish-subscribe messaging system.

DSL

Machine learning and natural language processing with Apache Pig

50
14
7y 10m
Apache-2.0

Packetpig - Open Source Big Data Security Analytics

308
90
5y 7m
n/a

A bunch of utility classes for Java, Hadoop, HBase, Pig, etc.

76
31
7y 7m
Apache-2.0

Pig Visualization framework

451
140
3y 9m
Apache-2.0

Map-Reduce for Clojure

527
58
4y 5m
Apache-2.0

A collection of libraries for working with large-scale data in Hadoop

Simple and scalable scripting for large sequencing data set(ex: bioinfomation) in Hadoop

Libraries and Tools

302
46
2y 11m
Apache-2.0

Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.

1.11K
384
2y 4m
Apache-2.0

A native go client for HDFS

977
259
11m
MIT

Web tool for Avro Schema Registry |

355
94
1y 11m
n/a

A set of libraries, tools, examples, and documentation

web application for interacting with Hadoop.

A web-based notebook that enables interactive data analytics

data serialization system.

A graphical editor for editing Apache Oozie workflows inside Eclipse.

Columnar storage format that uses the record shredding and assembly algorithm described in the Dremel paper.

Apache Superset (incubating) is a modern, enterprise-ready business intelligence web application

Realtime Data Processing

stream processing framework, based on Kafka and YARN.

Apache Flink is a platform for efficient, distributed, general-purpose data processing. It supports exactly once stream processing.

Apache Pulsar (incubating) is a highly scalable, low latency messaging platform running on commodity hardware. It provides simple pub-sub semantics over topics, guaranteed at-least-once delivery of messages, automatic cursor management for subscribers, and cross-datacenter replication.

A high-performance, column-oriented, distributed data store.

Distributed Computing and Programming

framework forย in-memory cluster computing.

A community index of packages for Apache Spark

A community site for Apache Spark

framework for data management/analytics on Hadoop.

A platform for efficient, distributed, general-purpose data processing.

Enterprise-grade unified stream and batch processing engine.

Apache Livy (incubating) is web service that exposes a REST interface for managing long running Apache Spark contexts in your cluster. With Livy, new applications can be built on top of Apache Spark that require fine grained interaction with many Spark contexts.

Packaging, Provisioning and Monitoring

A big data cluster management tool that creates and manages clusters of different technologies.

20
16
6y 6m
LGPL-3.0
187
64
2y 7d
n/a

Apache Bigtop: Packaging and tests of the Apache Hadoop ecosystem

operational framework for Hadoop mangement.

centralized service for process management.

Java libaries for Apache ZooKeeper.

Search

Banana for Solr - A Port of Kibana

654
238
1y 4m
n/a

Search and analytics engine based on Apacheย Lucene.

Search platform for Apache Lucene.

Search Engine Framework

open source web crawler.

Security

Enhanced data protection for the Apache Hadoop ecosystem

90
41
6y 4m
n/a

Ranger is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform.

An authorization module for Hadoop

A REST API Gateway for interacting with Hadoop clusters.

Benchmark

HiBench is a big data benchmark suite.

1.12K
657
10m
n/a

Yahoo! Cloud Serving Benchmark

3.5K
1.79K
8m
Apache-2.0

Benchmark of Redshift, Hive, Shark, Impala and Stiger/Tez.

Machine learning and Big Data analytics

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

1.74K
410
8m
Apache-2.0

RHadoop

757
288
5y 11m
n/a

Machine Learning framework for Spark

The R Project for Statistical Computing.

SINGA is a general distributed deep learning platform for training big deep learning models over large datasets

BigDL is a distributed deep learning library for Apache Spark; with BigDL, users can write their deep learning applications as standard Spark programs, which can directly run on top of existing Spark or Hadoop clusters.

Apache Hivemall is a scalable machine learning library that runs on Apache Hive, Spark and Pig.

Misc.

WebUI for query engines: Hive and Presto

187
48
4y 10m
n/a

Python interface to Hive and Presto. ๐Ÿ

1.37K
444
11m
n/a

An Open Source unit test framework for Hive queries based on JUnit 4 and 5

221
77
9m
Apache-2.0

A super simple utility for testing Apache Hive scripts locally for non-Java developers.

70
25
5y 12m
n/a

Unit test framework for hive and hive-service

69
50
4y 9m
n/a

Flume NG MongoDB source.

69
64
2y 6m
Apache-2.0

Flume plugin for RabbitMQ

59
47
4y 9m
Apache-2.0

Apache Flume source plugin allowing direct consumption of UDP messages

8
8
7y 7m
Apache-2.0

Websites

Presentations

Books

Hadoop and Big Data Events