Your first time on this page? Allow me to give some explanations.
A curated list of amazingly awesome Hadoop and Hadoop ecosystem resources
Here you can see meta information about this topic like the time we last updated this page, the original creator of the awesome list and a link to the original GitHub repository.
Thank you youngwookim & contributors
View Topic on GitHub:
Search for resources by name or description.
Simply type in what you are looking for and the results will be filtered on the fly.
Further filter the resources on this page by type (repository/other resource), number of stars on GitHub and time of last commit in months.
Elasticsearch real-time search and analytics natively integrated with Hadoop
Python MapReduce library written in Cython. Visit us in #hadoopy on freenode. See the link below for documentation and tutorials.
Run MapReduce jobs on Hadoop or Amazon Web Services
Visualize your HDFS cluster usage
Hadoop log aggregator and dashboard
Distributed Big Data Orchestration Service
A fast to develop, fast to run, Go based toolkit for ETL and feature extraction on Hadoop.
framework for distributed processing. Integrates MapReduce (parallel processing), YARN (job scheduling) and HDFS (distributed file system).
application framework for executing a complex DAG (directed acyclic graph) of tasks, built on YARN.
SpatialHadoop is a MapReduce extension to Apache Hadoop designed specially to work with spatial data.
Big Data Spatial Analytics for the Hadoop Framework
Apache Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets
Running MPICH2 on Yarn
Apache Slider is a project in incubation at the Apache Software Foundation with the goal of making it possible and easy to deploy existing applications onto a YARN cluster.
A developer-friendly Python library to interact with Apache HBase
Hannibal is tool to help monitor and maintain HBase-Clusters that are configured for manual splitting.
Haeinsa is linearly scalable multi-row, multi-table transaction library for HBase
Secondary Index for HBase
The Apache Accumulo™ sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.
SQL on Hadoop
The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL
Apache HAWQ is a Hadoop native SQL query engine that combines the key technological advantages of MPP database with the scalability and convenience of Hadoop
SQL-like query language for Cascading.
Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its development in 2012.
Confluent Schema Registry for Kafka
framework that allows efficient translation of queries involving heterogeneous and federated data.
Metadata tagging & lineage capture suppoting complex business data taxonomies
Workflow, Lifecycle and Governance
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
Data Ingestion and Integration
Netflix's distributed Data Pipeline
Gobblin is a distributed big data integration framework (ingestion, replication, compliance, retention) for batch and streaming systems. Gobblin features integrations with Apache Hadoop, Apache Kafka, Salesforce, S3, MySQL, Google etc.
Machine learning and natural language processing with Apache Pig
Packetpig - Open Source Big Data Security Analytics
A bunch of utility classes for Java, Hadoop, HBase, Pig, etc.
Pig Visualization framework
Map-Reduce for Clojure
A collection of libraries for working with large-scale data in Hadoop
Libraries and Tools
Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.
A native go client for HDFS
Web tool for Avro Schema Registry |
A set of libraries, tools, examples, and documentation
A web-based notebook that enables interactive data analytics
A graphical editor for editing Apache Oozie workflows inside Eclipse.
Columnar storage format that uses the record shredding and assembly algorithm described in the Dremel paper.
Realtime Data Processing
Apache Flink is a platform for efficient, distributed, general-purpose data processing. It supports exactly once stream processing.
Apache Pulsar (incubating) is a highly scalable, low latency messaging platform running on commodity hardware. It provides simple pub-sub semantics over topics, guaranteed at-least-once delivery of messages, automatic cursor management for subscribers, and cross-datacenter replication.
Distributed Computing and Programming
A platform for efficient, distributed, general-purpose data processing.
Enterprise-grade unified stream and batch processing engine.
Apache Livy (incubating) is web service that exposes a REST interface for managing long running Apache Spark contexts in your cluster. With Livy, new applications can be built on top of Apache Spark that require fine grained interaction with many Spark contexts.
Packaging, Provisioning and Monitoring
A big data cluster management tool that creates and manages clusters of different technologies.
Apache Bigtop: Packaging and tests of the Apache Hadoop ecosystem
Banana for Solr - A Port of Kibana
Search Engine Framework
Enhanced data protection for the Apache Hadoop ecosystem
Ranger is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform.
HiBench is a big data benchmark suite.
Yahoo! Cloud Serving Benchmark
Machine learning and Big Data analytics
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning
SINGA is a general distributed deep learning platform for training big deep learning models over large datasets
BigDL is a distributed deep learning library for Apache Spark; with BigDL, users can write their deep learning applications as standard Spark programs, which can directly run on top of existing Spark or Hadoop clusters.
WebUI for query engines: Hive and Presto
Python interface to Hive and Presto. 🐝
An Open Source unit test framework for Hive queries based on JUnit 4 and 5
A super simple utility for testing Apache Hive scripts locally for non-Java developers.
Unit test framework for hive and hive-service
Flume NG MongoDB source.
Flume plugin for RabbitMQ
Apache Flume source plugin allowing direct consumption of UDP messages