User Experience on mobile might not be great yet, but I'm working on it.

Your first time on this page? Allow me to give some explanations.

Awesome Big Data

A curated list of awesome big data frameworks, ressources and other awesomeness.

Here you can see meta information about this topic like the time we last updated this page, the original creator of the awesome list and a link to the original GitHub repository.

Last Update: Nov. 26, 2020, 3:15 p.m.

Thank you onurakpolat & contributors
View Topic on GitHub:
onurakpolat/awesome-bigdata

Search for resources by name or description.
Simply type in what you are looking for and the results will be filtered on the fly.

Further filter the resources on this page by type (repository/other resource), number of stars on GitHub and time of last commit in months.

RDBMS

The world's most popular open source database.

Powerful object-relational database system. PostgreSQL licence

object-relational database management system.

high-performance MPP data warehouse platform.

Frameworks

Bistro is a flexible distributed scheduler, a high-performance framework supporting multiple paradigms while retaining ease of configuration, management, and monitoring.

926
123
5d
MIT

High Throughput Real-time Stream Processing Framework

275
34
3y 7m
Apache-2.0

Machine Learning Platform for Kubernetes

2.64K
254
6d
Apache-2.0

platform for distributed processing and real-time analytics. Integrates with many of the popular technologies in the Big Data ecosystem (Kafka, HDFS, Spark, etc.)

framework for distributed processing. Integrates MapReduce (parallel processing), YARN (job scheduling) and HDFS (distributed file system).

Pachyderm is a data storage platform built on Docker and Kubernetes to provide reproducible data processing and analysis.

Distributed Programming

437
91
4m
n/a

Hadoop MapReduce in idiomatic Clojure.

259
19
5y 4m
Apache-2.0

Tuple MapReduce for Hadoop: Hadoop API made easy

58
12
4y 5m
Apache-2.0

Map-Reduce for Clojure

523
57
3y 6m
Apache-2.0

An open source framework that provides a simple, universal API for building distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library.

13.87K
2.2K
2d
Apache-2.0

High performance distributed data processing engine

385
51
1y 11m
Apache-2.0

Develop streaming applications for IBM Streams in Python, Java & Scala.

28
44
6d
Apache-2.0

Big Data Science Swiss Army Knife - http://www.tuktu.io --

56
17
2y 9m
n/a

Apache Heron (Incubating) is a realtime, distributed, fault-tolerant stream processing engine from Twitter

3.49K
616
3d
Apache-2.0

A Scala API for Cascading

3.27K
685
58d
Apache-2.0

Streaming MapReduce with Scalding and Storm

2.09K
261
1y 9m
Apache-2.0

run Spark on Hadoop MapReduce v1.

a unified, enterprise platform for big data stream and batch processing.

an unified model and set of language-specific SDKs for defining and executing data processing workflows.

a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce.

collection of user-defined functions for Hadoop and Pig developed by LinkedIn.

A platform for efficient, distributed, general-purpose data processing.

real-time big data streaming engine based on Akka.

framework for in-memory data model and persistence.

Apache Hama is an Apache Top-Level open source project, allowing you to do advanced analytics beyond MapReduce.

programming model for processing large data sets with a parallel, distributed algorithm on a cluster.

high level language to express data analysis programs for Hadoop.

retainable evaluator execution framework to simplify and unify the lower layers of big data systems.

framework for stream processing, implementation of S4.

framework for in-memory cluster computing.

framework for stream processing, part of Spark.

framework for stream processing by Twitter also on YARN.

stream processing framework, based on Kafka and YARN.

application framework for executing a complex DAG (directed acyclic graph) of tasks, built on YARN.

abstraction over YARN that reduces the complexity of developing distributed applications.

an interface that allows for writing distributed computing programs providing lots of simple, flexible, powerful APIs to easily handle data of any scale.

High Performance, Custom Data Warehouse on Top of MapReduce.

framework for data management/analytics on Hadoop.

real-time engine is designed to enable distributed, asynchronous, real time in-memory big-data computations in as unblocked a way as possible, with minimal overhead and impact on performance.

Hadoop enhancement which removes single point of failure.

create data pipelines to help themæingest, transform and analyze data.

fault tolerant stream processing framework.

platform for distributed processing and real-time analytics. Integrates with many of the popular technologies in the Big Data ecosystem (Kafka, HDFS, Spark, etc.)

declarative programming language for working with structured, semi-structured and unstructured data.

is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem.

framework for real-time analysis of large datasets.

MapReduce framework developed by Nokia.

Distributed computation for the cloud.

Python MapReduce and HDFS API for Hadoop.

multi-tenant distributed metric processing system

general purpose cluster computing framework.

useful for counting activities of event streams over different time windows and finding the most active one.

The ultrafast and elastic data processing engine. Big or fast data - no fuss, no Java needed.

Distributed Filesystem

Distributed object store

1.41K
252
5d
Apache-2.0

SeaweedFS is a distributed object store and file system to store and serve billions of files fast! Object store has O(1) disk seek, transparent cloud integration. Filer supports cross-cluster active-active replication, Kubernetes, POSIX, S3 API, encryption, Erasure Coding for warm storage, FUSE mount, Hadoop, WebDAV.

11.02K
1.41K
3d
Apache-2.0

The Baidu File System.

2.66K
536
1y 12m
BSD-3-Clause

framework for distributed processing. Integrates MapReduce (parallel processing), YARN (job scheduling) and HDFS (distributed file system).

Hadoop's storage layer to enable fast analytics on fast data.

formerly FhGFS, parallel distributed file system.

software storage platform designed.

GGFS, Hadoop compliant in-memory file system.

high-performance distributed filesystem.

scale-out network-attached storage file system.

reliable file sharing at memory speed across cluster frameworks.

decentralized cloud storage system.

Distributed Index

Pilosa is an open source, distributed bitmap index that dramatically accelerates queries across multiple, massive data sets.

2.1K
191
34d
Apache-2.0

Document Data Model

commercial object-oriented database management systems .

is an open source massively scalable data store. It requires zero administration.

Facebook’s Paxos-like NoSQL database.

document oriented datastore over Hadoop.

horizontally scalable document-oriented NoSQL data store.

Schema-agnostic Enterprise NoSQL database technology.

NoSQL cloud database service with protocol support for MongoDB

Document-oriented database system.

A transactional, open-source Document Database.

document database that supports queries like table joins and group by.

Key Map Data Model

An Internet-Scale Database.

1.79K
437
1y 10m
n/a

InfiniDB Data Warehouse

235
92
3y 44d
GPL-2.0

Apache Tephra: Transactions for HBase.

156
44
4y 16d
Apache-2.0

distributed key/value store, built on Hadoop.

column-oriented distributed datastore, inspired by BigTable.

column-oriented distributed datastore, inspired by BigTable.

column-oriented distributed datastore.

is a fully managed, schemaless database for storing non-relational data over BigTable.

column-oriented distributed datastore, inspired by BigTable.

real-time, multi-tenant distributed database for Twitter scale.

column-oriented distributed datastore written in C++, totally compatible with Apache Cassandra.

Key-value Data Model

An embedded key/value database for Go.

11.18K
1.38K
2y 9m
MIT

Key Value Database in .Net with Object DB Layer, RPC, dynamic IL and much more

107
37
7d
MIT

BuntDB is an embeddable, in-memory key/value database for Go with custom indexing and geospatial support

2.99K
215
23d
MIT

An Erlang implementation of Redis

453
37
5y 75d
Apache-2.0

Distributed database specialized in exporting key/value data from Hadoop

540
52
6y 5m
BSD-3-Clause

GhostDB is a distributed, in-memory, general purpose key-value data store that delivers microsecond performance at any scale.

658
32
75d
BSD-3-Clause

Graviton Database: ZFS for key-value stores.

383
14
81d
GPL-3.0

GridDB is a next-generation open source database that makes time series IoT and big data fast,and easy.

1.34K
3.13K
8d
n/a

HyperDex is a scalable, searchable key-value store

1.36K
163
3y 12m
BSD-3-Clause

SNA team homepage

24
4
8y 84d
n/a

Riak is a decentralized datastore from Basho Technologies.

3.47K
555
98d
Apache-2.0

Storehaus is a library that makes it easy to work with asynchronous key value stores

461
80
1y 5m
Apache-2.0

In-memory NoSQL database with ACID transactions, Raft consensus, and Redis API

1.29K
75
1y 12d
n/a

Get your data in RAM. Get compute close to data. Enjoy the performance.

2.48K
253
12d
n/a

Distributed transactional key-value database, originally created to complement TiDB

8.37K
1.29K
2d
Apache-2.0

Real-time Geospatial and Geofencing

7.13K
427
17d
MIT

The DB that's replicated, sharded and transactional.

183
23
5y 28d
Apache-2.0

NoSQL flash-optimized, in-memory. Open source and "Server code in 'C' (not Java or Erlang) precisely tuned to avoid context switching and memory copies."

Provides a scalable, low-latency NoSQL online Database Service backed by SSDs.

a fast, simple, efficient, and persistent key-value store written natively in Go.

distributed time series database.

is an in-memory key-value data store providing full SQL-compliant data access that can optionally be backed by disk storage.

distributed key/value storage system.

distributed key-value database by Oracle Corporation.

Advanced key-value store. 3-clause BSD

Graph Data Model

Native GraphQL Database with graph backend

14.31K
1.06K
2d
n/a

EliasDB is a graph-based database.

594
32
7m
MPL-2.0

A large-scale entity and relation database supporting aggregation of properties

1.6K
328
10d
Apache-2.0

An open-source graph database

13.67K
1.22K
4m
Apache-2.0

A Graph Traversal Language (no longer active - see Apache TinkerPop)

1.86K
233
3y 83d
n/a

RDF-Centric Map/Reduce Framework and Freebase data conversion tool

148
19
5y 11m
n/a

Microsoft Graph Engine

1.92K
280
5m
n/a

Phoebus is a distributed framework for large scale graph processing written in Erlang.

383
39
8y 10m
Apache-2.0

A distributed, fault-tolerant graph database

3.24K
262
3y 8m
n/a

a new generation multi-model graph database for the modern complex data environment.

implementation of Pregel, based on Hadoop.

implementation of Pregel, part of Spark.

multi model distributed database.

TAO is the distributed data store that is widely used at facebook to store and serve the social graph.

A library with various machine learning models (regression, clustering, recommender systems, graph analytics, etc.) implemented on top of a disk-backed DataFrame.

resilient Distributed Graph System on Spark.

tools to construct large-scale graphs on top of Hadoop.

open-source, distributed graph database

Massively Parallel Graph processing on GPUs.

graph database written entirely in Java.

document and graph database.

distributed graph database, built over Cassandra.

A free, open-source template for Microsoft® Excel® 2007, 2010, 2013 and 2016 that makes it easy to explore network graphs.

Columnar Databases

An open-source columnar data format designed for fast & realtime analytic with big data.

431
123
1y 4m
Apache-2.0

Massively parallel, high performance analytics database that will rapidly devour all of your data.

1.2K
57
4m
n/a

an explanation of what columnar storage is and when you might want it.

column-oriented analytic database.

an open-source column-oriented database management system that allows generating analytical data reports in real time.

a distributed, column-oriented database built for large-scale event collection and analytics.

column store database.

columnar storage format for Hadoop.

purpose-built, dedicated analytic data warehouse that offers a columnar engine as well as a traditional row-based one.

is designed to manage large, fast-growing volumes of data and provide very fast query performance when used for data warehouses.

A GPU powered big data database, designed for analytics and data warehousing, with ANSI-92 compliant SQL, suitable for data sets from 10TB to 1PB.

Google's cloud offering backed by their pioneering work on Dremel.

Provides petabyte-scale data warehousing with columnar storage and multi-node compute.

NewSQL Databases

ActorDB distributed SQL database

1.84K
75
1y 57d
MPL-2.0

A Bayesian database table for querying the probable implications of data as easily as SQL databases query the data itself. New implementation in http://github.com/probcomp/bayeslite

882
56
5y 65d
Apache-2.0

CockroachDB - the open source, cloud-native distributed SQL database.

19.42K
2.41K
2d
n/a

Bloomberg's distributed RDBMS

952
157
3d
n/a

Haeinsa is linearly scalable multi-row, multi-table transaction library for HBase

155
44
3y 9m
Apache-2.0

A Relational Database Backed by Apache Kafka

351
21
2d
Apache-2.0

TiDB is an open source distributed HTAP database compatible with the MySQL protocol

25.92K
4.04K
2d
Apache-2.0

The high-performance distributed SQL database for global, internet-scale apps.

4.59K
473
3d
n/a

commercially supported, open-source SQL relational database management system.

data warehouse service, based on PostgreSQL.

a simple, modular, networked and distributed transaction layer built atop SQLite.

scales out PostgreSQL through sharding and replication.

distributed database, inspired by F1.

distributed SQL database built on Spanner.

globally distributed semi-relational database.

is an experimental main-memory, parallel database management system that is optimized for on-line transaction processing (OLTP) applications.

infinity scalable RDBMS.

GPU in-memory database, big data analysis and visualization platform.

in memory SQL database witho optimized columnar storage on flash.

SQL/ACID compliant distributed database.

in-memory, relational database management system with persistence and recoverability.

Low-latency, in-memory, distributed SQL data store. Provides SQL interface to in-memory table data, persistable in HDFS.

is an in-memory, column-oriented, relational database management system.

distributed, realtime, semi-structured database.

Sky

database used for flexible, high performance analysis of behavioral data.

open source software for both file and database synchronization.

claims to be fastest in-memory database.

Time-Series Databases

Fast scalable time series database

1.57K
335
6d
Apache-2.0

An open-source big data platform designed and optimized for the Internet of Things (IoT).

13.09K
3.38K
84d
AGPL-3.0

Beringei is a high performance, in-memory storage engine for time series data.

3.06K
296
2y 4m
n/a

Apache Druid: a high performance real-time analytics database.

10.03K
2.68K
84d
Apache-2.0

Time-series database

742
74
6m
Apache-2.0

A time-series object store for Cassandra that handles all the complexity of building wide row indexes.

123
11
5m
MIT

See gitlab: https://gitlab.com/Project-FiFo/DalmatinerDB/dalmatinerdb

706
44
2y 8m
MIT

A distributed system designed to ingest and process time series data

593
100
2y 5m
Apache-2.0

Accumulo backed time series database

349
105
20d
Apache-2.0

SiriDB is a highly-scalable, robust and super fast time series database. Build from the ground up SiriDB uses a unique mechanism to operate without a global index and allows server resources to be added on the fly. SiriDB's unique query language includes dynamic grouping of time series for easy analysis over large amounts of time series.

426
41
7d
MIT

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.

6.69K
989
3d
Apache-2.0

VictoriaMetrics: fast, cost-effective monitoring solution and time series database

3.26K
250
2d
Apache-2.0

Integrated time series database on top of HBase with built-in visualization, rule-engine and SQL support.

a time series storage built to store time series highly compressed and for fast access times.

is a scalable time series database based on Cassandra and Elasticsearch.

scalable, general-purpose time series database.

a distributed time series database that can be used for storing realtime metrics at long retention.

a time series database based on Apache Cassandra.

distributed time series database on top of HBase.

heavy_dollar_sign: - Open-source service monitoring system and time series database

an efficient tool for storing and querying series of events.

Riak TS is the only enterprise-grade NoSQL time series database optimized specifically for IoT and Time Series data.

SQL-like processing

The Streaming SQL Database powered by Timely Dataflow

1.62K
93
56d
n/a

Apache Spark - A unified analytics engine for large-scale data processing

28.11K
22.91K
3d
Apache-2.0

high performance interactive SQL access to all Hadoop data.

framework for interactive analysis, inspired by Dremel.

table and storage management layer for Hadoop.

SQL-like data warehouse system for Hadoop.

framework that allows efficient translation of queries involving heterogeneous and federated data.

SQL-like analytic processing for MapReduce.

framework for interactive analysis, Inspired by Dremel.

SQL-like query language for Cascading.

full SQL query engine for big datasets.

an open-source, SQL-like Data-as-a-Service Platform based on Apache Arrow.

distributed SQL query engine.

framework for interactive analysis, implementation of Dremel.

an open-source relational database that runs SQL queries continuously on streams, incrementally storing results in tables.

SQL-like data warehouse system for Hadoop.

database for storing petabyte-scale volumes of structured and semi-structured data.

Manipulating Structured Data Using Spark.

a full-featured SQL-on-Hadoop RDBMS with ACID transactions.

distributed data warehouse system on Hadoop.

enterprise-class SQL-on-HBase solution targeting big data transactional or operational workloads.

Data Ingestion

Apache Pulsar - distributed pub-sub messaging system

6.84K
1.67K
3d
Apache-2.0

Scribe is a server for aggregating log data streamed in real time from a large number of servers.

3.9K
801
6y 6m
Apache-2.0

Build platforms that flexibly mix SQL, batch, and stream processing paradigms

202
30
70d
MIT

DEPRECATED: Data collection and processing made easy.

3.42K
560
4y 114d
n/a

Hadoop Data Integration with various databases, ftp servers, salesforce. Incremental update, dedup, append, merge your data on Hadoop.

87
32
7y 7m
Apache-2.0

simple, distributed message queue system (active — currently managed by Papertrail)

6
2
2y 9m
n/a

DocId set compression and set operation library

22
8
6y 8m
Apache-2.0

Hadoop log aggregator and dashboard

192
68
7y 30d
n/a

Netflix's distributed Data Pipeline

747
173
4y 11m
Apache-2.0

Secor is a service implementing Kafka log persistence

1.63K
511
3d
Apache-2.0

Gobblin is a distributed big data integration framework (ingestion, replication, compliance, retention) for batch and streaming systems. Gobblin features integrations with Apache Hadoop, Apache Kafka, Salesforce, S3, MySQL, Google etc.

1.8K
667
6d
Apache-2.0

A probabilistic data structure service and storage

781
63
4y 6m
MIT

StreamSets Data Collector - Continuous big data and cloud platform ingest infrastructure

1.07K
578
5d
Apache-2.0

Privacy and Security focused Segment-alternative, in Golang and React

2.14K
99
3d
AGPL-3.0

Provides real-time data processing over large, distributed data streams.

Prepare and load data to data stores.

data collection system.

service to manage large amount of log data.

distributed publish-subscribe messaging system.

Apache NiFi is an integrated data logistics platform for automating the movement of data between disparate systems.

tool to transfer data between Hadoop and a structured datastore.

open-source bulk data loader that helps data transfer between various databases, storages, file formats, and cloud services.

tool to collect events and logs.

geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency.

horizontally scalable document-oriented NoSQL data store.

a tool for managing events and logs.

data pipeline as a service enabling moving data sources such as MySQL into data warehouses.

Service Programming

Serverless proxy for Spark cluster

303
66
1y 51d
Apache-2.0

A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow

1.57K
77
118d
MIT

Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.

13.94K
2.2K
8d
Apache-2.0

Spring XD makes it easy to solve common big data problems such as data ingestion and export, real-time analytics, and batch workflow orchestration

474
299
3y 8m
n/a

Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.

1.1K
384
1y 5m
Apache-2.0

runtime for distributed, and fault tolerant event-driven applications on the JVM.

data serialization system.

Java libaries for Apache ZooKeeper.

OSGi runtime that runs on top of any OSGi framework.

framework to build binary protocols.

centralized service for process management.

a lock service for loosely-coupled distributed systems.

horizontally scalable document-oriented NoSQL data store.

message passing framework.

decentralized solution for service discovery and orchestration.

asynchronous network stack for the JVM.

Scheduling

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

19.36K
7.53K
1d
Apache-2.0

A data orchestrator for machine learning, analytics, and ETL.

2.36K
237
2d
Apache-2.0

Schedoscope is a scheduling framework for painfree agile development, testing, (re)loading, and monitoring of your datahub, lake, or whatever you choose to call your Hadoop data warehouse these days.

94
28
1y 13d
Apache-2.0

Sparrow scheduling platform (U.C. Berkeley).

294
85
7y 19d
Apache-2.0

is a service scheduler that runs on top of Apache Mesos.

data management framework.

workflow job scheduler.

cloud-based pipeline orchestration for on-prem, cloud and HDInsight

distributed and fault-tolerant scheduler.

batch workflow job scheduler.

Machine Learning

[UNMAINTAINED] Simple feed-forward neural network in JavaScript

8.04K
943
2y 9m
MIT

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

1.73K
411
11d
Apache-2.0

Deep Learning in Javascript. Train Convolutional Neural Networks (or ordinary ones) in your browser.

10.07K
2.01K
4y 2d
MIT

ETL Library for Machine Learning - data pipelines, data munging and wrangling

267
176
5m
Apache-2.0

Flexible and Extensible Machine Learning in Ruby

392
62
11y 91d
MIT

Scalable Machine Learning in Scalding

359
59
2y 9m
MIT

Feature Store for Machine Learning

1.2K
208
3d
Apache-2.0

Open Source Fast Scalable Machine Learning Platform For Smarter Applications: Deep Learning, Gradient Boosting & XGBoost, Random Forest, Generalized Linear Modeling (Logistic Regression, Elastic Net), K-Means, PCA, Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

5.07K
1.79K
2d
Apache-2.0

Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

1.04K
125
3d
GPL-3.0

Deep Learning for humans

50.21K
18.67K
37d
n/a

A column-oriented approach to feature engineering. Feature engineering and machine learning: together at last!

1
20
2y 78d
MIT

Little Ball of Fur - A graph sampling extension library for NetworKit and NetworkX (CIKM 2020)

468
30
9d
GPL-3.0

Fast multilayer perceptron neural network library for iOS and Mac OS X

903
237
4y 4m
BSD-2-Clause

🛠 All-in-one web-based IDE specialized for machine learning and data science.

1.53K
197
7d
Apache-2.0

Fast, Scientific and Numerical Computing for the JVM (NDArrays)

1.71K
533
2y 6m
Apache-2.0

Numenta Platform for Intelligent Computing is an implementation of Hierarchical Temporal Memory (HTM), a theory of intelligence based strictly on the neuroscience of the neocortex.

6.19K
1.58K
1y 35d
AGPL-3.0

Deep Reinforcement Learning for the JVM (Deep-Q, A3C)

324
122
58d
n/a

scikit-learn: machine learning in Python

43.15K
20.65K
3d
BSD-3-Clause

An Open Source Machine Learning Framework for Everyone

150.76K
83.32K
2d
Apache-2.0
111
26
3y 7m
Apache-2.0

Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.

7.2K
1.68K
85d
n/a

CPU and GPU-accelerated Machine Learning Library

907
172
5m
n/a

Cloud-based AzureML, R, Python Machine Learning platform

machine learning library for Cascading.

machine learning framework that supports a variety of advanced algorithms, as well as support classes to normalize and process data.

text classification with machine learning.

A machine learning platform in Python with a broad collection of ML toolkits, data engineering, and deployment tools.

An Apache-backed machine learning library for Hadoop.

distributed machine learning libraries for the BDAS stack.

MOA

MOA performs big data stream mining in real time, and large scale machine learning.

Text mining made easy. Extract and classify data from text.

machine learning server buit on Hadoop, Mahout and Cascading.

distributed streaming machine learning framework.

a Spark implementation of some common machine learning (ML) functionality.

System for Large Scale Machine Learning at Google.

Weka is a collection of machine learning algorithms for data mining tasks.

Benchmarking

Statistical Workload Injector for MapReduce - Project at UC Berkeley AMP Lab

120
92
6y 6m
n/a

HiBench is a big data benchmark suite.

1.07K
638
36d
n/a

Repo to track dl4j benchmark code.

32
15
1y 85d
n/a

micro-benchmarks for testing Hadoop performances.

benchmark suite for MapReduce applications.

Hadoop cluster benchmarking from Yahoo engineer team.

Security

🌲 Configuration flaws detector for Hadoop, MongoDB, MySQL, and more!

105
14
5m
MIT

Central security admin & fine-grained authorization for Hadoop

real time monitoring solution

single point of secure access for Hadoop clusters.

security module for data stored in Hadoop.

System Deployment

Mirror of Apache Slider

78
73
2y 6m
Apache-2.0

Deploy and manage containers (including Docker) on top of Apache Mesos at scale.

3.99K
879
28d
Apache-2.0

Linkis helps easily connect to various back-end computation/storage engines(Spark, Python, TiDB...), exposes various interfaces(REST, JDBC, Java ...), with multi-tenancy, high performance, and resource control.

1.49K
484
80d
Apache-2.0

operational framework for Hadoop mangement.

system deployment framework for the Hadoop ecosystem.

cluster management framework.

set of libraries for running cloud services.

library that simplifies application deployment and management.

Similar to Apache BigTop based on Groovy language.

web application for interacting with Hadoop.

multi datacenters replication system.

job scheduling and monitoring system.

job scheduling and monitoring system.

application that can deploy HBase cluster on YARN.

a system for automating deployment, scaling, and management of containerized applications.

Applications

An Alert Management Web Application

939
112
1y 5m
MIT

Next-generation web analytics processing with Scala, Spark, and Parquet.

336
59
5y 8m
Apache-2.0

Time series monitoring and alerting platform.

461
137
1y 98d
BSD-3-Clause

SQL-based streaming analytics platform at scale

1.16K
276
2y 115d
Apache-2.0

In-memory dimensional time series database.

2.77K
230
15d
Apache-2.0

Easy & Flexible Alerting With ElasticSearch

7.13K
1.62K
17d
Apache-2.0

An open source event analytics platform

1.3K
136
6y 102d
n/a

Fast and reliable message broker built on top of Kafka.

655
145
6d
n/a

Open source framework for processing, monitoring, and alerting on time series data

1.97K
469
6d
n/a

An convenient R tool for manipulating tables in PostgreSQL type databases and a wrapper of Apache MADlib.

115
49
21d
n/a

📈 Collect customer event data from your apps. (Note that this project only includes the API collector, not the visualization platform)

767
101
6d
AGPL-3.0

Project SnappyData - memory optimized analytics database, based on Apache Spark™ and Apache Geode™. Stream, Transact, Analyze, Predict in one cluster

989
192
99d
n/a

Cloud-native web, mobile and event analytics, running on AWS and GCP

5.52K
1.14K
44d
Apache-2.0

a platform that integrates a variety of open source big data technologies in order to offer a centralized tool for security monitoring and analysis.

open source web crawler.

capturing, processing and sharing of data for NASA's scientific archives.

content analysis toolkit.

open source mobile and web analytics platform, based on Node.js & MongoDB.

Run, scale, share, and deploy models — without any infrastructure.

Eclipse-based reporting system.

Large scale analytics platform by indeed.

Web & mobile analytics tool, with data warehouse (AWS, BigQuery) integration.

Notebook and project application for interactive data science and scientific computing across all programming languages.

data-processing library of an RDBMS to analyze data.

open source Distributed Analytics Engine from eBay.

auto-scaling Hadoop cluster, built-in data connectors.

analyzer for machine-generated data.

cloud based analyzer for machine-generated data.

unified open source environment for YARN, Hadoop, HBASE, Hive, HCatalog & Pig.

Search engine and framework

Elassandra = Elasticsearch + Apache Cassandra

1.5K
191
10d
Apache-2.0

A flexible, partial, out-of-order and real-time typeahead search library

542
76
7y 15d
Apache-2.0

realtime search/indexing system

354
132
6y 9m
n/a

A library for efficient similarity search and clustering of dense vectors.

11.61K
2.03K
4d
MIT

Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk

7.83K
858
55d
Apache-2.0

Weaviate is a cloud-native, realtime vector search engine that allows you to bring your machine learning models to scale

421
38
6d
BSD-3-Clause

Search engine library.

Search platform for Apache Lucene.

Search and analytics engine based on Apache Lucene.

Freemium robust web application for exploring, filtering, analyzing, searching and exporting massive datasets scraped from across the Web.

implementation of Percolator, part of HBase.

quickly and easily search for any content stored in HBase.

is a Faceted Search implementation written purely in Java, an extension to Apache Lucene.

MG4J (Managing Gigabytes for Java) is a full-text search engine for large document collections written in Java. It is highly customisable, high-performance and provides state-of-the-art features and new research algorithms.

is an engine for low-latency computation over large data sets. It stores and indexes your data such that queries, selection and processing over the data can be performed at serving time.

MySQL forks and evolutions

21
5
3y 10d
GPL-3.0

Provides a scalable database server with MySQL, Oracle, SQL Server, PostgreSQL, and MariaDB support.

evolution of MySQL 6.0.

MySQL databases in Google's cloud.

enhanced, drop-in replacement for MySQL.

MySQL implementation using NDB Cluster storage engine.

enhanced, drop-in replacement for MySQL.

TokuDB is a storage engine for MySQL and MariaDB.

is a collaboration among engineers from several companies that face similar challenges in running MySQL at scale.

PostgreSQL forks and evolutions

high-performance data warehouse appliances.

Scalable Open Source PostgreSQL-based Database Cluster.

Open Source Recommendation Engine Built Entirely Inside PostgreSQL.

open source MPP database system solely targeted at data warehousing and data mart applications.

multi-peta-byte database / MPP derived by PostgreSQL.

An open-source time-series database optimized for fast ingest and complex queries

an open-source relational database that runs SQL queries continuously on streams, incrementally storing results in tables.

Memcached forks and evolutions

A fast, light-weight proxy for memcached and redis

10.26K
1.83K
21d
Apache-2.0

Memcache on SSD

1.28K
172
3y 7m
Apache-2.0

Twemcache is the Twitter Memcached

882
150
1y 5m
BSD-3-Clause

Embedded Databases

Erlang LSM BTree Storage

285
55
4y 8m
Apache-2.0

LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.

22.29K
5.18K
85d
BSD-3-Clause

commercially supported, open-source SQL relational database management system.

a software library that provides a high-performance embedded database for key/value data.

ultra-fast, ultra-compact key-value embedded data store developed by Symas.

embeddable persistent key-value store for fast storage based on LevelDB.

Business Intelligence

Business intelligence made simple

2.67K
338
2d
MIT

The simplest, fastest way to get business intelligence and analytics to everyone in your company

22.8K
3.07K
2d
n/a

business intelligence platform in the cloud.

lean business intelligence platform to visualize and explore your data.

self-service business intelligence tool in the cloud.

platform for data products and embedded analytics.

powerful business intelligence suite.

customisable Business Intelligence platform.

Interactive Big Data Analytics.

Performance Monitoring for Amazon Redshift

business intelligence software and platform.

software platforms for business intelligence, mobile intelligence, and network applications.

Fast, clean SQL client and business intelligence.

business intelligence platform.

business intelligence and analytics platform.

Open source analytics platform.

open source business intelligence platform. (former SpagoBi)

modern B.I platform powered by Apache Spark.

business intelligence platform.

Big Data Analytics.

Data Visualization

Web UI for PrestoDB.

2.72K
473
4y 93d
Apache-2.0

a graph visualization library using web workers and jQuery

2.56K
604
8y 6m
n/a

Banana for Solr - A Port of Kibana

644
236
5m
n/a

Web UI for Impala

16
4
3y 8m
Apache-2.0

Location Intelligence & Data Visualization tool

2.46K
641
3d
BSD-3-Clause

Simple responsive charts

12.39K
2.58K
1y 26d
n/a

Cubism.js: A JavaScript library for time series visualization.

4.9K
575
1y 92d
n/a

Compose complex, data-driven visualizations from reusable charts and components with d3

700
24
8m
MIT

A powerful, interactive charting and data visualization library for browser

43.73K
16.1K
3d
Apache-2.0

Dynamic HTML5 visualization

1.57K
260
7y 6m
MIT