User Experience on mobile might not be great yet, but I'm working on it.

Your first time on this page? Allow me to give some explanations.

Awesome Hadoop

A curated list of amazingly awesome Hadoop and Hadoop ecosystem resources

Here you can see meta information about this topic like the time we last updated this page, the original creator of the awesome list and a link to the original GitHub repository.

Last Update: Dec. 4, 2021, 11:20 a.m.

Thank you youngwookim & contributors
View Topic on GitHub:
youngwookim/awesome-hadoop

Search for resources by name or description.
Simply type in what you are looking for and the results will be filtered on the fly.

Further filter the resources on this page by type (repository/other resource), number of stars on GitHub and time of last commit in months.

Hadoop

Elasticsearch real-time search and analytics natively integrated with Hadoop

1.84K
945
32d
Apache-2.0

Python MapReduce library written in Cython. Visit us in #hadoopy on freenode. See the link below for documentation and tutorials.

244
62
5y 11m
GPL-3.0

Run MapReduce jobs on Hadoop or Amazon Web Services

2.57K
600
1y 18d
n/a

Visualize your HDFS cluster usage

226
86
1y 52d
Apache-2.0

Hadoop log aggregator and dashboard

192
66
8y 38d
n/a

Distributed Big Data Orchestration Service

1.53K
346
57d
Apache-2.0

A fast to develop, fast to run, Go based toolkit for ETL and feature extraction on Hadoop.

200
16
7y 17d
n/a

YARN

NoSQL

A developer-friendly Python library to interact with Apache HBase

577
160
9m
n/a

Hannibal is tool to help monitor and maintain HBase-Clusters that are configured for manual splitting.

168
62
3y 11m
n/a

Haeinsa is linearly scalable multi-row, multi-table transaction library for HBase

158
46
4y 9m
Apache-2.0

Secondary Index for HBase

587
291
4y 6m
Apache-2.0

SQL on Hadoop

Data Management

Workflow, Lifecycle and Governance

Data Ingestion and Integration

Netflix's distributed Data Pipeline

765
177
5y 11m
Apache-2.0

A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.

1.99K
709
31d
Apache-2.0

DSL

Machine learning and natural language processing with Apache Pig

51
15
7y 11m
Apache-2.0

Packetpig - Open Source Big Data Security Analytics

305
89
3y 6m
n/a

A bunch of utility classes for Java, Hadoop, HBase, Pig, etc.

76
31
7y 8m
Apache-2.0

Pig Visualization framework

456
142
3y 10m
Apache-2.0

Map-Reduce for Clojure

531
60
4y 6m
Apache-2.0

Libraries and Tools

Realtime Data Processing

Distributed Computing and Programming

Packaging, Provisioning and Monitoring

Search

Search Engine Framework

Security

Benchmark

HiBench is a big data benchmark suite.

1.21K
693
31d
n/a

Yahoo! Cloud Serving Benchmark

3.86K
1.91K
30d
Apache-2.0

Machine learning and Big Data analytics

Misc.

WebUI for query engines: Hive and Presto

192
49
4y 11m
n/a

Python interface to Hive and Presto. ๐Ÿ

1.48K
494
60d
n/a

An Open Source unit test framework for Hive queries based on JUnit 4 and 5

235
78
7m
Apache-2.0

A super simple utility for testing Apache Hive scripts locally for non-Java developers.

70
23
4y 9m
n/a

Unit test framework for hive and hive-service

68
50
1y 53d
n/a

Flume NG MongoDB source.

68
63
2y 7m
Apache-2.0

Flume plugin for RabbitMQ

59
46
4y 10m
Apache-2.0

Apache Flume source plugin allowing direct consumption of UDP messages

8
9
7y 7m
Apache-2.0

Websites

Presentations

Books

Hadoop and Big Data Events