User Experience on mobile might not be great yet, but I'm working on it.

Your first time on this page? Allow me to give some explanations.

Awesome Data Engineering

A curated list of data engineering tools for software developers

Here you can see meta information about this topic like the time we last updated this page, the original creator of the awesome list and a link to the original GitHub repository.

Last Update: Dec. 5, 2020, 6:15 a.m.

Thank you igorbarinov & contributors
View Topic on GitHub:
igorbarinov/awesome-data-engineering

Search for resources by name or description.
Simply type in what you are looking for and the results will be filtered on the fly.

Further filter the resources on this page by type (repository/other resource), number of stars on GitHub and time of last commit in months.

Databases

The lightweight, distributed relational database built on SQLite.

6.34K
346
6d
MIT

TiDB is an open source distributed HTAP database compatible with the MySQL protocol

26.05K
4.05K
2d
Apache-2.0

Pinterest MySQL Management Tools

877
146
1y 5m
GPL-2.0

HyperDex is a scalable, searchable key-value store

1.36K
163
4y 5d
BSD-3-Clause

Kyoto Tycoon key-value store (and the underlying Kyoto Cabinet library)

217
34
3y 8m
GPL-3.0

IonDB, a key-value datastore for resource constrained systems.

548
44
2y 9m
BSD-3-Clause

A script to easily create and destroy an Apache Cassandra cluster on localhost

1.15K
279
64d
Apache-2.0

NoSQL data store using the seastar framework, compatible with Apache Cassandra

6.48K
756
3d
AGPL-3.0

Distributed Prometheus time series database

1.25K
211
4d
Apache-2.0

Distributed Transactional In-Memory Database (全球首个支持分布式事务的MongoDB)

592
194
2y 7m
Apache-2.0

A distributed, fault-tolerant graph database

3.24K
262
3y 8m
n/a

A large-scale entity and relation database supporting aggregation of properties

1.6K
328
3d
Apache-2.0

Scalable datastore for metrics, events, and real-time analytics

20.05K
2.83K
2d
MIT

A scalable, distributed Time Series Database.

4.31K
1.22K
40d
LGPL-2.1

Fast scalable time series database

1.57K
334
15d
Apache-2.0

The Heroic Time Series Database

822
106
15d
Apache-2.0

Apache Druid: a high performance real-time analytics database.

10.32K
2.77K
2d
Apache-2.0

Time-series database

742
74
6m
Apache-2.0

A time-series object store for Cassandra that handles all the complexity of building wide row indexes.

123
11
5m
MIT

See gitlab: https://gitlab.com/Project-FiFo/DalmatinerDB/dalmatinerdb

705
44
2y 8m
MIT

A distributed system designed to ingest and process time series data

594
100
2y 6m
Apache-2.0

Accumulo backed time series database

349
105
29d
Apache-2.0

Get your data in RAM. Get compute close to data. Enjoy the performance.

2.48K
254
3d
n/a

Greenplum Database

4.29K
1.23K
2d
n/a

An open-source graph database

13.68K
1.22K
4m
Apache-2.0

Project SnappyData - memory optimized analytics database, based on Apache Spark™ and Apache Geode™. Stream, Transact, Analyze, Predict in one cluster

989
192
108d
n/a

The world's most popular open source database.

Percona XtraBackup is a free, open source, complete online backup solution for all versions of Percona Server, MySQL® and MariaDB®

enhanced, drop-in replacement for MySQL.

Powerful object-relational database system. PostgreSQL licence

Provides a scalable database server with MySQL, Oracle, SQL Server, PostgreSQL, and MariaDB support.

is an open source massively scalable data store. It requires zero administration.

Advanced key-value store. 3-clause BSD

A distributed database designed to deliver maximum data availability by distributing data across multiple servers.

Provides a scalable, low-latency NoSQL online Database Service backed by SSDs.

A high performance NoSQL database supporting many data structures, an alternative to Redis

The right choice when you need scalability and high availability without compromising performance.

This simple form allows you to try out different values for your Apache Cassandra cluster and see what the impact is for your application.

The Hadoop database, a distributed, scalable, big data store.

Provides petabyte-scale data warehousing with columnar storage and multi-node compute.

Distributed, MPP columnar database with extensive analytics SQL.

An open-source, document database designed for ease of development and scaling.

Percona Server for MongoDB® is a free, enhanced, fully compatible, open source, drop-in replacement for the MongoDB® Community Edition that includes enterprise-grade features and functionality.

Search and analytics engine based on Apache Lucene.

The highest performing NoSQL distributed database.

document database that supports queries like table joins and group by.

A transactional, open-source Document Database.

graph database written entirely in Java.

2nd Generation Distributed Graph Database with the flexibility of Documents in one product with an Open Source commercial friendly license.

multi model distributed database.

A scalable graph database optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster.

The fully transactional, cloud-ready, distributed database.

An open source, distributed, in-memory database for scale-out applications.

Riak TS is the only enterprise-grade NoSQL time series database optimized specifically for IoT and Time Series data.

Built as an extension on top of PostgreSQL, TimescaleDB is a time-series SQL database providing fast analytics, scalability, with automated data management on a proven storage engine.

Data Ingestion

Change data capture from PostgreSQL into Kafka

1.48K
148
3y 108d
Apache-2.0

KafkaT-ool

470
78
4y 8d
Apache-2.0

Generic command line non-JVM Apache Kafka producer and consumer

2.96K
276
59d
n/a

INACTIVE: A PostgreSQL extension to produce messages to Apache Kafka.

108
15
5y 8m
n/a

The Apache Kafka C/C++ library

4.73K
2.17K
15d
n/a

Dockerfile for Apache Kafka

4.82K
2.2K
110d
Apache-2.0

CMAK is a tool for managing Apache Kafka clusters

9.6K
2.23K
103d
Apache-2.0

Node.js client for Apache Kafka 0.8 and later.

2.44K
629
1y 32d
MIT

Secor is a service implementing Kafka log persistence

1.63K
512
3d
Apache-2.0

A kafka logger for winston

45
8
2y 56d
MIT

DEPRECATED: Data collection and processing made easy.

3.42K
559
4y 4m
n/a

Gobblin is a distributed big data integration framework (ingestion, replication, compliance, retention) for batch and streaming systems. Gobblin features integrations with Apache Hadoop, Apache Kafka, Salesforce, S3, MySQL, Google etc.

1.81K
666
3d
Apache-2.0

Pandas on AWS

1.23K
204
3d
Apache-2.0

Publish-subscribe messaging rethought as a distributed commit log.

Provides real-time data processing over large, distributed data streams.

Robust messaging for applications.

An open source data collector for unified logging layer.

An open source bulk data loader that helps data transfer between various databases, storages, file formats, and cloud services.

A tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

Nakadi is an open source event messaging platform that provides a REST API on top of Kafka-like queues.

Pravega provides a new storage abstraction - a stream - for continuous and unbounded data.

Apache Pulsar is an open-source distributed pub-sub messaging system.

File System

A pure python HDFS client

823
216
4y 4m
Apache-2.0

Utils for streaming large files (S3, HDFS, gzip, bz2...)

1.85K
261
4d
MIT

The GA Release of SnackFS

14
5
6y 9m
n/a

SeaweedFS is a distributed object store and file system to store and serve billions of files fast! Object store has O(1) disk seek, transparent cloud integration. Filer supports cross-cluster active-active replication, Kubernetes, POSIX, S3 API, encryption, Erasure Coding for warm storage, FUSE mount, Hadoop, WebDAV.

11.06K
1.42K
1d
Apache-2.0

a full featured file system for online data storage

639
68
117d
GPL-3.0

Provides Web Service based storage.

Alluxio is a memory-centric distributed storage system enabling reliable data sharing at memory-speed across cluster frameworks, such as Spark and MapReduce

Ceph is a unified, distributed storage system designed for excellent performance, reliability and scalability

Orange File System is a branch of the Parallel Virtual File System

Gluster Filesystem

fault-tolerant distributed file system for all storage needs

LizardFS Software Defined Storage is a distributed, parallel, scalable, fault-tolerant, Geo-Redundant and highly available file system.

Serialization format

A fast compressor/decompressor

4.41K
771
2d
n/a

Protocol Buffers - Google's data interchange format

45.1K
12.1K
2d
n/a

Java binary serialization and cloning: fast, efficient, automatic

4.87K
739
4d
BSD-3-Clause

Data interchange format with dynamic typing, untagged data, and absence of manually assigned IDs.

Columnar storage format based on assembly algorithms from Google's paper on Dremel.

A parallel implementation of gzip for modern

The smallest, fastest columnar storage for Hadoop workloads

Data interchange format that originated at Facebook.

SequenceFile is a flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats

Stream Processing

High-performance time-series aggregation for PostgreSQL

2.35K
213
1y 7m
Apache-2.0

Python Stream Processing

5.07K
422
57d
n/a

an unified model and set of language-specific SDKs for defining and executing data processing workflows.

Spark Streaming makes it easy to build scalable fault-tolerant streaming applications.

Open source platform for distributed stream and batch data processing.

Realtime computation system.

Apache Samza is a distributed stream processing framework

Apache NiFi is an integrated data logistics platform for automating the movement of data between disparate systems.

Apache Hudi is an open source framework for managing storage for real time processing, one of the most interesting feature is the Upsert

VoltDb is an ACID-compliant RDBMS which uses a shared nothing architecture.

Streaming and tasks execution between Spring Boot apps

Bonobo is a data-processing toolkit for python 3.5+

Batch Processing

Connecting Apache Spark with different data stores [DEPRECATED]

197
44
4y 5m
Apache-2.0

A general-purpose data analysis engine radically changing the way batch and stream data is processed

0
0
2y 90d
MIT

Mirror of Apache Hivemall (incubating)

273
107
100d
Apache-2.0

Python interface to Hive and Presto. 🐝

1.33K
431
24d
n/a

Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner

Lightning-fast cluster computing

A community index of packages for Apache Spark

Livy, the REST Spark Server

A web service that makes it easy to quickly and cost-effectively process vast amounts of data.

Tez

An application framework which allows for a complex directed-acyclic-graph of tasks for processing data.

H2O

Fast statistical, machine learning & math runtime.

An environment for quickly creating scalable performant machine learning applications.

Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives.

A library with various machine learning models (regression, clustering, recommender systems, graph analytics, etc.) implemented on top of a disk-backed DataFrame.

An iterative graph processing system built for high scalability.

Apache Spark's API for graphs and graph-parallel computation.

A distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources.

Data warehouse software facilitates querying and managing large datasets residing in distributed storage.

Schema-free SQL Query Engine

Charts and Dashboards

Python helpers for building dashboards using Flask and React

2.25K
269
2y 9m
MIT

Analytical Web Apps for Python, R, Julia, and Jupyter. No JavaScript Required.

13.45K
1.37K
5d
MIT

Apache Superset is a Data Visualization and Data Exploration Platform

31.41K
6.47K
2d
Apache-2.0

The simplest, fastest way to get business intelligence and analytics to everyone in your company

22.88K
3.08K
2d
n/a

Interactive charts for web.

library written on Vanilla JS for big data visualization.

D3-based reusable chart library.

Allows the user to manipulate documents based on data to render charts in SVG.

D3's simplier, easier to use cousin. Mostly predefined templates that you can just plug data in.

A JavaScript Charting Library for Streaming Data.

Interactive and realtime 2D/3D/Image plotting and science/engineering widgets.

Workflow

Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.

13.98K
2.2K
17d
Apache-2.0

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

19.45K
7.57K
2d
Apache-2.0

Pinball is a scalable workflow manager

1.04K
141
12m
Apache-2.0

A data orchestrator for machine learning, analytics, and ETL.

2.38K
244
3d
Apache-2.0

Java based application development platform.

batch workflow job scheduler.

Oozie is a workflow scheduler system to manage Apache Hadoop jobs

Data Lake Management

An open source platform that delivers resilience and manageability to object-storage based data lakes

85
5
95d
Apache-2.0

ELK Elastic Logstash Kibana

Docker image for Logstash 1.4

238
95
4y 11m
MIT

JDBC importer for Elasticsearch

2.81K
713
3y 8m
n/a

Making Postgres and Elasticsearch work together like it's 2020

3.08K
143
11d
n/a

Docker

Package golang service into minimal docker containers.

665
17
2y 10m
n/a

Container data volume manager for your Dockerized application

3.29K
299
3y 11m
Apache-2.0

Simple, resilient multi-host containers networking and more.

5.96K
595
100d
n/a

A lightweight tool for easy deployment and rollback of dockerized applications.

183
19
10m
Apache-2.0

Analyzes resource usage and performance characteristics of running containers.

11.48K
1.72K
11d
n/a

Docker microservice for saving/restoring volume data to S3

9
0
1y 73d
n/a

Docker composition tool with idempotency features for deploying apps composed of multiple containers.

409
21
2y 8m
n/a

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.

7.17K
1.2K
3d
MPL-2.0

RancherOS is a 20mb Linux distro that runs the entire OS as Docker containers

Application Containers for Masses

Vizualize docker images and the layers that compose them

Realtime

Event data simulator. Generates a stream of pseudo-random events from a set of users, designed to simulate web traffic.

313
67
2y 11m
n/a

The Streaming APIs give developers low latency access to Twitter’s global stream of Tweet data.

Real-time data is available including comments, submissions and links posted to reddit

Data Dumps

GitHub's public timeline since 2011, updated every hour

Open source repository of web crawl data

Wikipedia's complete copy of all wikis, in the form of wikitext source and metadata embedded in XML. A number of raw database tables in SQL form are also available.

Prometheus

The Prometheus monitoring system and time series database.

34.14K
5.39K
2d
Apache-2.0

Simple server that scrapes HAProxy stats and exports them via HTTP for Prometheus consumption

444
173
31d
Apache-2.0

Forums

News, tips and background on Data Engineering

Subreddit focused on ETL

Conferences

DataEngConf is the first technical conference that bridges the gap between data scientists, data engineers and data analysts.

Podcasts

The show about modern data infrastructure.