User Experience on mobile might not be great yet, but I'm working on it.

Your first time on this page? Allow me to give some explanations.

Awesome Big Data

A curated list of awesome big data frameworks, ressources and other awesomeness.

Here you can see meta information about this topic like the time we last updated this page, the original creator of the awesome list and a link to the original GitHub repository.

Last Update: Oct. 24, 2021, 12:09 a.m.

Thank you onurakpolat & contributors
View Topic on GitHub:
onurakpolat/awesome-bigdata

Search for resources by name or description.
Simply type in what you are looking for and the results will be filtered on the fly.

Further filter the resources on this page by type (repository/other resource), number of stars on GitHub and time of last commit in months.

RDBMS

The world's most popular open source database.

Powerful object-relational database system. PostgreSQL licence

object-relational database management system.

high-performance MPP data warehouse platform.

Frameworks

Bistro is a flexible distributed scheduler, a high-performance framework supporting multiple paradigms while retaining ease of configuration, management, and monitoring.

942
126
8m
MIT

High Throughput Real-time Stream Processing Framework

274
34
4y 6m
Apache-2.0

Machine Learning Platform for Kubernetes

2.73K
265
8m
Apache-2.0

An extensible Java framework for building XML and non-XML (CSV, EDI, Java, etc...) streaming applications

289
315
8m
n/a

platform for distributed processing and real-time analytics. Integrates with many of the popular technologies in the Big Data ecosystem (Kafka, HDFS, Spark, etc.)

framework for distributed processing. Integrates MapReduce (parallel processing), YARN (job scheduling) and HDFS (distributed file system).

Pachyderm is a data storage platform built on Docker and Kubernetes to provide reproducible data processing and analysis.

Distributed Programming

437
91
1y 115d
n/a

Hadoop MapReduce in idiomatic Clojure.

258
19
6y 97d
Apache-2.0

Tuple MapReduce for Hadoop: Hadoop API made easy

58
13
9m
Apache-2.0

Map-Reduce for Clojure

527
58
4y 5m
Apache-2.0

An open source framework that provides a simple, universal API for building distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library.

14.88K
2.4K
8m
Apache-2.0

High performance distributed data processing engine

395
55
4m
Apache-2.0

Develop streaming applications for IBM Streams in Python, Java & Scala.

27
43
9m
Apache-2.0

Big Data Science Swiss Army Knife - http://www.tuktu.io --

57
17
3y 8m
n/a

Apache Heron (Incubating) is a realtime, distributed, fault-tolerant stream processing engine from Twitter

3.5K
614
8m
Apache-2.0

A Scala API for Cascading

3.29K
685
8m
Apache-2.0

Streaming MapReduce with Scalding and Storm

2.09K
260
2y 8m
Apache-2.0

run Spark on Hadoop MapReduce v1.

a unified, enterprise platform for big data stream and batch processing.

an unified model and set of language-specific SDKs for defining and executing data processing workflows.

a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce.

collection of user-defined functions for Hadoop and Pig developed by LinkedIn.

A platform for efficient, distributed, general-purpose data processing.

real-time big data streaming engine based on Akka.

framework for in-memory data model and persistence.

Apache Hama is an Apache Top-Level open source project, allowing you to do advanced analytics beyond MapReduce.

programming model for processing large data sets with a parallel, distributed algorithm on a cluster.

high level language to express data analysis programs for Hadoop.

retainable evaluator execution framework to simplify and unify the lower layers of big data systems.

framework for stream processing, implementation of S4.

framework for in-memory cluster computing.

framework for stream processing, part of Spark.

framework for stream processing by Twitter also on YARN.

stream processing framework, based on Kafka and YARN.

application framework for executing a complex DAG (directed acyclic graph) of tasks, built on YARN.

abstraction over YARN that reduces the complexity of developing distributed applications.

an interface that allows for writing distributed computing programs providing lots of simple, flexible, powerful APIs to easily handle data of any scale.

High Performance, Custom Data Warehouse on Top of MapReduce.

framework for data management/analytics on Hadoop.

real-time engine is designed to enable distributed, asynchronous, real time in-memory big-data computations in as unblocked a way as possible, with minimal overhead and impact on performance.

Hadoop enhancement which removes single point of failure.

create data pipelines to help themæingest, transform and analyze data.

fault tolerant stream processing framework.

platform for distributed processing and real-time analytics. Integrates with many of the popular technologies in the Big Data ecosystem (Kafka, HDFS, Spark, etc.)

declarative programming language for working with structured, semi-structured and unstructured data.

is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem.

framework for real-time analysis of large datasets.

MapReduce framework developed by Nokia.

Distributed computation for the cloud.

Python MapReduce and HDFS API for Hadoop.

multi-tenant distributed metric processing system

general purpose cluster computing framework.

useful for counting activities of event streams over different time windows and finding the most active one.

The ultrafast and elastic data processing engine. Big or fast data - no fuss, no Java needed.

Distributed Filesystem

Distributed object store

1.43K
252
8m
Apache-2.0

SeaweedFS is a distributed object store and file system to store and serve billions of files fast! Object store has O(1) disk seek, local tiering, cloud tiering. Filer supports cross-cluster active-active replication, Kubernetes, POSIX, S3 API, encryption, Erasure Coding for warm storage, FUSE mount, Hadoop, WebDAV.

11.49K
1.48K
8m
Apache-2.0

The Baidu File System.

2.69K
541
2y 11m
BSD-3-Clause

framework for distributed processing. Integrates MapReduce (parallel processing), YARN (job scheduling) and HDFS (distributed file system).

Hadoop's storage layer to enable fast analytics on fast data.

formerly FhGFS, parallel distributed file system.

software storage platform designed.

GGFS, Hadoop compliant in-memory file system.

high-performance distributed filesystem.

scale-out network-attached storage file system.

reliable file sharing at memory speed across cluster frameworks.

decentralized cloud storage system.

Distributed Index

Pilosa is an open source, distributed bitmap index that dramatically accelerates queries across multiple, massive data sets.

2.13K
199
9m
Apache-2.0

Document Data Model

commercial object-oriented database management systems .

is an open source massively scalable data store. It requires zero administration.

Facebook’s Paxos-like NoSQL database.

document oriented datastore over Hadoop.

horizontally scalable document-oriented NoSQL data store.

Schema-agnostic Enterprise NoSQL database technology.

NoSQL cloud database service with protocol support for MongoDB

Document-oriented database system.

A transactional, open-source Document Database.

document database that supports queries like table joins and group by.

Key Map Data Model

An Internet-Scale Database.

1.81K
439
2y 9m
n/a

InfiniDB Data Warehouse

235
92
4y 11d
GPL-2.0

Apache Tephra: Transactions for HBase.

157
44
4y 11m
Apache-2.0

distributed key/value store, built on Hadoop.

column-oriented distributed datastore, inspired by BigTable.

column-oriented distributed datastore, inspired by BigTable.

column-oriented distributed datastore.

is a fully managed, schemaless database for storing non-relational data over BigTable.

column-oriented distributed datastore, inspired by BigTable.

real-time, multi-tenant distributed database for Twitter scale.

column-oriented distributed datastore written in C++, totally compatible with Apache Cassandra.

Key-value Data Model

An embedded key/value database for Go.

11.52K
1.46K
3y 7m
MIT

Key Value Database in .Net with Object DB Layer, RPC, dynamic IL and much more

115
38
8m
MIT

BuntDB is an embeddable, in-memory key/value database for Go with custom indexing and geospatial support

3.12K
224
8m
MIT

An Erlang implementation of Redis

455
37
6y 42d
Apache-2.0

Distributed database specialized in exporting key/value data from Hadoop

541
53
7y 4m
BSD-3-Clause

GhostDB is a distributed, in-memory, general purpose key-value data store that delivers microsecond performance at any scale.

683
35
8m
BSD-3-Clause

Graviton Database: ZFS for key-value stores.

383
14
1y 48d
GPL-3.0

GridDB is a next-generation open source database that makes time series IoT and big data fast,and easy.

1.39K
3.27K
8m
n/a

HyperDex is a scalable, searchable key-value store

1.37K
162
4y 10m
BSD-3-Clause

SNA team homepage

24
4
9y 51d
n/a

Riak is a decentralized datastore from Basho Technologies.

3.5K
556
9m
Apache-2.0

Storehaus is a library that makes it easy to work with asynchronous key value stores

458
79
2y 4m
Apache-2.0

In-memory NoSQL database with ACID transactions, Raft consensus, and Redis API

1.29K
74
1y 11m
n/a

Get your data in RAM. Get compute close to data. Enjoy the performance.

2.54K
254
8m
n/a

Distributed transactional key-value database, originally created to complement TiDB

8.83K
1.37K
8m
Apache-2.0

Real-time Geospatial and Geofencing

7.26K
441
8m
MIT

The DB that's replicated, sharded and transactional.

183
23
5y 12m
Apache-2.0

NoSQL flash-optimized, in-memory. Open source and "Server code in 'C' (not Java or Erlang) precisely tuned to avoid context switching and memory copies."

Provides a scalable, low-latency NoSQL online Database Service backed by SSDs.

a fast, simple, efficient, and persistent key-value store written natively in Go.

distributed time series database.

is an in-memory key-value data store providing full SQL-compliant data access that can optionally be backed by disk storage.

distributed key/value storage system.

distributed key-value database by Oracle Corporation.

Advanced key-value store. 3-clause BSD

Graph Data Model

Native GraphQL Database with graph backend

15.42K
1.12K
8m
n/a

EliasDB a graph-based database.

607
32
10m
MPL-2.0

A large-scale entity and relation database supporting aggregation of properties

1.6K
330
8m
Apache-2.0

An open-source graph database

13.76K
1.23K
1y 100d
Apache-2.0

A Graph Traversal Language (no longer active - see Apache TinkerPop)

1.88K
232
4y 50d
n/a

RDF-Centric Map/Reduce Framework and Freebase data conversion tool

149
19
6y 10m
n/a

Microsoft Graph Engine

1.94K
284
1y 4m
n/a

Phoebus is a distributed framework for large scale graph processing written in Erlang.

384
38
9y 9m
Apache-2.0

A distributed, fault-tolerant graph database

3.25K
259
4y 7m
n/a

a new generation multi-model graph database for the modern complex data environment.

implementation of Pregel, based on Hadoop.

implementation of Pregel, part of Spark.

multi model distributed database.

TAO is the distributed data store that is widely used at facebook to store and serve the social graph.

A library with various machine learning models (regression, clustering, recommender systems, graph analytics, etc.) implemented on top of a disk-backed DataFrame.

resilient Distributed Graph System on Spark.

tools to construct large-scale graphs on top of Hadoop.

open-source, distributed graph database

Massively Parallel Graph processing on GPUs.

graph database written entirely in Java.

document and graph database.

distributed graph database, built over Cassandra.

A free, open-source template for Microsoft® Excel® 2007, 2010, 2013 and 2016 that makes it easy to explore network graphs.

Columnar Databases

An open-source columnar data format designed for fast & realtime analytic with big data.

430
123
2y 100d
Apache-2.0

Massively parallel, high performance analytics database that will rapidly devour all of your data.

1.24K
58
8m
n/a

an explanation of what columnar storage is and when you might want it.

column-oriented analytic database.

an open-source column-oriented database management system that allows generating analytical data reports in real time.

a distributed, column-oriented database built for large-scale event collection and analytics.

column store database.

columnar storage format for Hadoop.

purpose-built, dedicated analytic data warehouse that offers a columnar engine as well as a traditional row-based one.

is designed to manage large, fast-growing volumes of data and provide very fast query performance when used for data warehouses.

A GPU powered big data database, designed for analytics and data warehousing, with ANSI-92 compliant SQL, suitable for data sets from 10TB to 1PB.

Google's cloud offering backed by their pioneering work on Dremel.

Provides petabyte-scale data warehousing with columnar storage and multi-node compute.

NewSQL Databases

ActorDB distributed SQL database

1.84K
75
2y 24d
MPL-2.0

A Bayesian database table for querying the probable implications of data as easily as SQL databases query the data itself. New implementation in http://github.com/probcomp/bayeslite

883
56
6y 32d
Apache-2.0

CockroachDB - the open source, cloud-native distributed SQL database.

19.96K
2.54K
8m
n/a

Bloomberg's distributed RDBMS

973
166
8m
n/a

Haeinsa is linearly scalable multi-row, multi-table transaction library for HBase

156
44
4y 7m
Apache-2.0

A Relational Database Backed by Apache Kafka

356
21
11m
Apache-2.0

TiDB is an open source distributed HTAP database compatible with the MySQL protocol

26.9K
4.22K
8m
Apache-2.0

The high-performance distributed SQL database for global, internet-scale apps.

4.84K
525
8m
n/a

commercially supported, open-source SQL relational database management system.

data warehouse service, based on PostgreSQL.

a simple, modular, networked and distributed transaction layer built atop SQLite.

scales out PostgreSQL through sharding and replication.

distributed database, inspired by F1.

distributed SQL database built on Spanner.

globally distributed semi-relational database.

is an experimental main-memory, parallel database management system that is optimized for on-line transaction processing (OLTP) applications.

infinity scalable RDBMS.

GPU in-memory database, big data analysis and visualization platform.

in memory SQL database witho optimized columnar storage on flash.

SQL/ACID compliant distributed database.

in-memory, relational database management system with persistence and recoverability.

Low-latency, in-memory, distributed SQL data store. Provides SQL interface to in-memory table data, persistable in HDFS.

is an in-memory, column-oriented, relational database management system.

distributed, realtime, semi-structured database.

Sky

database used for flexible, high performance analysis of behavioral data.

open source software for both file and database synchronization.

claims to be fastest in-memory database.

Time-Series Databases

Fast scalable time series database

1.58K
336
11m
Apache-2.0

An open-source big data platform designed and optimized for the Internet of Things (IoT).

13.09K
3.38K
1y 51d
AGPL-3.0

Beringei is a high performance, in-memory storage engine for time series data.

3.08K
300
3y 106d
n/a

Apache Druid: a high performance real-time analytics database.

10.03K
2.68K
1y 51d
Apache-2.0

Time-series database

748
75
1y 5m
Apache-2.0

A time-series object store for Cassandra that handles all the complexity of building wide row indexes.

123
11
1y 4m
MIT

See gitlab: https://gitlab.com/Project-FiFo/DalmatinerDB/dalmatinerdb

702
44
3y 7m
MIT

A distributed system designed to ingest and process time series data

593
100
3y 4m
Apache-2.0

Accumulo backed time series database

354
104
11m
Apache-2.0

SiriDB is a highly-scalable, robust and super fast time series database. Build from the ground up SiriDB uses a unique mechanism to operate without a global index and allows server resources to be added on the fly. SiriDB's unique query language includes dynamic grouping of time series for easy analysis over large amounts of time series.

435
43
8m
MIT

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.

7.54K
1.11K
8m
Apache-2.0

VictoriaMetrics: fast, cost-effective monitoring solution and time series database

3.8K
304
8m
Apache-2.0

Integrated time series database on top of HBase with built-in visualization, rule-engine and SQL support.

a time series storage built to store time series highly compressed and for fast access times.

is a scalable time series database based on Cassandra and Elasticsearch.

scalable, general-purpose time series database.

a distributed time series database that can be used for storing realtime metrics at long retention.

a time series database based on Apache Cassandra.

distributed time series database on top of HBase.

heavy_dollar_sign: - Open-source service monitoring system and time series database

an efficient tool for storing and querying series of events.

Riak TS is the only enterprise-grade NoSQL time series database optimized specifically for IoT and Time Series data.

SQL-like processing

The Streaming SQL Database powered by Timely Dataflow

1.62K
93
1y 23d
n/a

Apache Spark - A unified analytics engine for large-scale data processing

28.83K
23.38K
8m
Apache-2.0

high performance interactive SQL access to all Hadoop data.

framework for interactive analysis, inspired by Dremel.

table and storage management layer for Hadoop.

SQL-like data warehouse system for Hadoop.

framework that allows efficient translation of queries involving heterogeneous and federated data.

SQL-like analytic processing for MapReduce.

framework for interactive analysis, Inspired by Dremel.

SQL-like query language for Cascading.

full SQL query engine for big datasets.

an open-source, SQL-like Data-as-a-Service Platform based on Apache Arrow.

distributed SQL query engine.

framework for interactive analysis, implementation of Dremel.

an open-source relational database that runs SQL queries continuously on streams, incrementally storing results in tables.

SQL-like data warehouse system for Hadoop.

database for storing petabyte-scale volumes of structured and semi-structured data.

Manipulating Structured Data Using Spark.

a full-featured SQL-on-Hadoop RDBMS with ACID transactions.

distributed data warehouse system on Hadoop.

enterprise-class SQL-on-HBase solution targeting big data transactional or operational workloads.

Data Ingestion

Apache Pulsar - distributed pub-sub messaging system

7.3K
1.83K
8m
Apache-2.0

Scribe is a server for aggregating log data streamed in real time from a large number of servers.

3.91K
805
7y 5m
Apache-2.0

Build platforms that flexibly mix SQL, batch, and stream processing paradigms

228
30
1y 37d
MIT

DEPRECATED: Data collection and processing made easy.

3.41K
559
5y 81d
n/a

Hadoop Data Integration with various databases, ftp servers, salesforce. Incremental update, dedup, append, merge your data on Hadoop.

87
32
8y 6m
Apache-2.0

simple, distributed message queue system (active — currently managed by Papertrail)

6
2
3y 8m
n/a

DocId set compression and set operation library

22
8
7y 7m
Apache-2.0

Hadoop log aggregator and dashboard

191
67
7y 12m
n/a

Netflix's distributed Data Pipeline

752
173
5y 10m
Apache-2.0

Secor is a service implementing Kafka log persistence

1.66K
513
8m
Apache-2.0

A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.

1.87K
677
8m
Apache-2.0

A probabilistic data structure service and storage

777
62
5y 5m
MIT

StreamSets Data Collector - Continuous big data and cloud platform ingest infrastructure

1.12K
603
8m
Apache-2.0

Privacy and Security focused Segment-alternative, in Golang and React

2.28K
111
8m
AGPL-3.0

Provides real-time data processing over large, distributed data streams.

Prepare and load data to data stores.

data collection system.

service to manage large amount of log data.

distributed publish-subscribe messaging system.

Apache NiFi is an integrated data logistics platform for automating the movement of data between disparate systems.

tool to transfer data between Hadoop and a structured datastore.

open-source bulk data loader that helps data transfer between various databases, storages, file formats, and cloud services.

tool to collect events and logs.

geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency.

horizontally scalable document-oriented NoSQL data store.

a tool for managing events and logs.

data pipeline as a service enabling moving data sources such as MySQL into data warehouses.

Service Programming

Serverless proxy for Spark cluster

305
69
2y 18d
Apache-2.0

A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow

1.62K
80
9m
MIT

Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.

14.23K
2.23K
8m
Apache-2.0

Spring XD makes it easy to solve common big data problems such as data ingestion and export, real-time analytics, and batch workflow orchestration

477
301
4y 7m
n/a

Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.

1.11K
384
2y 4m
Apache-2.0

runtime for distributed, and fault tolerant event-driven applications on the JVM.

data serialization system.

Java libaries for Apache ZooKeeper.

OSGi runtime that runs on top of any OSGi framework.

framework to build binary protocols.

centralized service for process management.

a lock service for loosely-coupled distributed systems.

horizontally scalable document-oriented NoSQL data store.

message passing framework.

decentralized solution for service discovery and orchestration.

asynchronous network stack for the JVM.

Scheduling

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

20.49K
8K
8m
Apache-2.0

A simple, distributed task scheduler and runner with a web based UI.

958
158
9m
n/a

A data orchestrator for machine learning, analytics, and ETL.

2.93K
297
8m
Apache-2.0

Schedoscope is a scheduling framework for painfree agile development, testing, (re)loading, and monitoring of your datahub, lake, or whatever you choose to call your Hadoop data warehouse these days.

94
28
1y 11m
Apache-2.0

Sparrow scheduling platform (U.C. Berkeley).

295
86
7y 11m
Apache-2.0

is a service scheduler that runs on top of Apache Mesos.

data management framework.

workflow job scheduler.

cloud-based pipeline orchestration for on-prem, cloud and HDInsight

distributed and fault-tolerant scheduler.

batch workflow job scheduler.

Machine Learning

[UNMAINTAINED] Simple feed-forward neural network in JavaScript

8.04K
941
3y 8m
MIT

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

1.74K
410
8m
Apache-2.0

Deep Learning in Javascript. Train Convolutional Neural Networks (or ordinary ones) in your browser.

10.13K
2.02K
4y 11m
MIT

ETL Library for Machine Learning - data pipelines, data munging and wrangling

271
178
1y 4m
Apache-2.0

Flexible and Extensible Machine Learning in Ruby

392
61
12y 58d
MIT

Scalable Machine Learning in Scalding

359
60
3y 8m
MIT

Feature Store for Machine Learning

1.45K
252
8m
Apache-2.0

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

5.21K
1.82K
8m
Apache-2.0

Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

1.16K
140
9m
GPL-3.0

Deep Learning for humans

50.73K
18.71K
8m
n/a

A column-oriented approach to feature engineering. Feature engineering and machine learning: together at last!

1
19
3y 45d
MIT

Little Ball of Fur - A graph sampling extension library for NetworKit and NetworkX (CIKM 2020)

495
32
8m
GPL-3.0

Fast multilayer perceptron neural network library for iOS and Mac OS X

904
236
5y 104d
BSD-2-Clause

🛠 All-in-one web-based IDE specialized for machine learning and data science.

1.74K
232
9m
Apache-2.0

Fast, Scientific and Numerical Computing for the JVM (NDArrays)

1.74K
539
3y 5m
Apache-2.0

Numenta Platform for Intelligent Computing is an implementation of Hierarchical Temporal Memory (HTM), a theory of intelligence based strictly on the neuroscience of the neocortex.

6.19K
1.58K
2y 2d
AGPL-3.0

A Temporal Extension Library for PyTorch Geometric

349
40
8m
MIT

Deep Reinforcement Learning for the JVM (Deep-Q, A3C)

330
121
9m
n/a

scikit-learn: machine learning in Python

44.62K
21.08K
8m
BSD-3-Clause

A data-driven approach to quantify the value of classifiers in a machine learning ensemble.

15
2
9m
MIT

An Open Source Machine Learning Framework for Everyone

153.46K
84.06K
8m
Apache-2.0
111
26
4y 6m
Apache-2.0

Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.

7.2K
1.68K
1y 52d
n/a

CPU and GPU-accelerated Machine Learning Library

909
171
1y 4m
n/a

Cloud-based AzureML, R, Python Machine Learning platform

machine learning library for Cascading.

machine learning framework that supports a variety of advanced algorithms, as well as support classes to normalize and process data.

text classification with machine learning.

A machine learning platform in Python with a broad collection of ML toolkits, data engineering, and deployment tools.

An Apache-backed machine learning library for Hadoop.

distributed machine learning libraries for the BDAS stack.

MOA

MOA performs big data stream mining in real time, and large scale machine learning.

Text mining made easy. Extract and classify data from text.

machine learning server buit on Hadoop, Mahout and Cascading.

distributed streaming machine learning framework.

a Spark implementation of some common machine learning (ML) functionality.

System for Large Scale Machine Learning at Google.

Weka is a collection of machine learning algorithms for data mining tasks.

Benchmarking

Statistical Workload Injector for MapReduce - Project at UC Berkeley AMP Lab

120
92
7y 5m
n/a

HiBench is a big data benchmark suite.

1.12K
657
10m
n/a

Repo to track dl4j benchmark code.

33
15
2y 52d
n/a

micro-benchmarks for testing Hadoop performances.

benchmark suite for MapReduce applications.

Hadoop cluster benchmarking from Yahoo engineer team.

Security

🌲 Configuration flaws detector for Hadoop, MongoDB, MySQL, and more!

105
14
1y 4m
MIT

Central security admin & fine-grained authorization for Hadoop

real time monitoring solution

single point of secure access for Hadoop clusters.

security module for data stored in Hadoop.

System Deployment

Mirror of Apache Slider

78
73
3y 5m
Apache-2.0

Deploy and manage containers (including Docker) on top of Apache Mesos at scale.

3.99K
880
12m
Apache-2.0

Linkis helps easily connect to various back-end computation/storage engines(Spark, Python, TiDB...), exposes various interfaces(REST, JDBC, Java ...), with multi-tenancy, high performance, and resource control.

1.49K
484
1y 47d
Apache-2.0

operational framework for Hadoop mangement.

system deployment framework for the Hadoop ecosystem.

cluster management framework.

set of libraries for running cloud services.

library that simplifies application deployment and management.

Similar to Apache BigTop based on Groovy language.

web application for interacting with Hadoop.

multi datacenters replication system.

job scheduling and monitoring system.

job scheduling and monitoring system.

application that can deploy HBase cluster on YARN.

a system for automating deployment, scaling, and management of containerized applications.

Applications

An Alert Management Web Application

943
112
2y 4m
MIT

Next-generation web analytics processing with Scala, Spark, and Parquet.

335
59
6y 7m
Apache-2.0

Time series monitoring and alerting platform.

466
137
2y 65d
BSD-3-Clause

SQL-based streaming analytics platform at scale

1.17K
281
3y 82d
Apache-2.0

In-memory dimensional time series database.

2.83K
237
8m
Apache-2.0

Easy & Flexible Alerting With ElasticSearch

7.29K
1.67K
11m
Apache-2.0

An open source event analytics platform

1.3K
135
7y 69d
n/a

Fast and reliable message broker built on top of Kafka.

663
152
8m
n/a

Open source framework for processing, monitoring, and alerting on time series data

2.01K
481
8m
n/a

An convenient R tool for manipulating tables in PostgreSQL type databases and a wrapper of Apache MADlib.

116
49
11m
n/a

📈 Collect customer event data from your apps. (Note that this project only includes the API collector, not the visualization platform)

769
102
8m
AGPL-3.0

Project SnappyData - memory optimized analytics database, based on Apache Spark™ and Apache Geode™. Stream, Transact, Analyze, Predict in one cluster

995
195
10m
n/a

Cloud-native web, mobile and event analytics, running on AWS and GCP

5.62K
1.16K
8m
Apache-2.0

a platform that integrates a variety of open source big data technologies in order to offer a centralized tool for security monitoring and analysis.

open source web crawler.

capturing, processing and sharing of data for NASA's scientific archives.

content analysis toolkit.

open source mobile and web analytics platform, based on Node.js & MongoDB.

Run, scale, share, and deploy models — without any infrastructure.

Eclipse-based reporting system.

Large scale analytics platform by indeed.

Web & mobile analytics tool, with data warehouse (AWS, BigQuery) integration.

Notebook and project application for interactive data science and scientific computing across all programming languages.

data-processing library of an RDBMS to analyze data.

open source Distributed Analytics Engine from eBay.

auto-scaling Hadoop cluster, built-in data connectors.

analyzer for machine-generated data.

cloud based analyzer for machine-generated data.

unified open source environment for YARN, Hadoop, HBASE, Hive, HCatalog & Pig.

Search engine and framework

Elassandra = Elasticsearch + Apache Cassandra

1.52K
194
9m
Apache-2.0

A flexible, partial, out-of-order and real-time typeahead search library

541
75
7y 11m
Apache-2.0

realtime search/indexing system

357
132
7y 8m
n/a

A library for efficient similarity search and clustering of dense vectors.

12.5K
2.14K
8m
MIT

Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk

8.14K
893
10m
Apache-2.0

Weaviate is a cloud-native, modular, real-time vector search engine

486
40
8m
BSD-3-Clause

Search engine library.

Search platform for Apache Lucene.

Search and analytics engine based on Apache Lucene.

Freemium robust web application for exploring, filtering, analyzing, searching and exporting massive datasets scraped from across the Web.

implementation of Percolator, part of HBase.

quickly and easily search for any content stored in HBase.

is a Faceted Search implementation written purely in Java, an extension to Apache Lucene.

MG4J (Managing Gigabytes for Java) is a full-text search engine for large document collections written in Java. It is highly customisable, high-performance and provides state-of-the-art features and new research algorithms.

is an engine for low-latency computation over large data sets. It stores and indexes your data such that queries, selection and processing over the data can be performed at serving time.

MySQL forks and evolutions

21
5
3y 11m
GPL-3.0

Provides a scalable database server with MySQL, Oracle, SQL Server, PostgreSQL, and MariaDB support.

evolution of MySQL 6.0.

MySQL databases in Google's cloud.

enhanced, drop-in replacement for MySQL.

MySQL implementation using NDB Cluster storage engine.

enhanced, drop-in replacement for MySQL.

TokuDB is a storage engine for MySQL and MariaDB.

is a collaboration among engineers from several companies that face similar challenges in running MySQL at scale.

PostgreSQL forks and evolutions

high-performance data warehouse appliances.

Scalable Open Source PostgreSQL-based Database Cluster.

Open Source Recommendation Engine Built Entirely Inside PostgreSQL.

open source MPP database system solely targeted at data warehousing and data mart applications.

multi-peta-byte database / MPP derived by PostgreSQL.

An open-source time-series database optimized for fast ingest and complex queries

an open-source relational database that runs SQL queries continuously on streams, incrementally storing results in tables.

Memcached forks and evolutions

A fast, light-weight proxy for memcached and redis

10.44K
1.87K
11m
Apache-2.0

Memcache on SSD

1.28K
173
4y 6m
Apache-2.0

Twemcache is the Twitter Memcached

890
150
2y 4m
BSD-3-Clause

Embedded Databases

Erlang LSM BTree Storage

287
56
5y 7m
Apache-2.0

LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.

23.06K
5.33K
8m
BSD-3-Clause

commercially supported, open-source SQL relational database management system.

a software library that provides a high-performance embedded database for key/value data.

ultra-fast, ultra-compact key-value embedded data store developed by Symas.

embeddable persistent key-value store for fast storage based on LevelDB.

Business Intelligence

Business intelligence made simple

2.86K
348
8m
MIT

The simplest, fastest way to get business intelligence and analytics to everyone in your company

23.96K
3.18K
8m
n/a

business intelligence platform in the cloud.