User Experience on mobile might not be great yet, but I'm working on it.

Your first time on this page? Allow me to give some explanations.

Awesome Data Engineering

A curated list of data engineering tools for software developers

Here you can see meta information about this topic like the time we last updated this page, the original creator of the awesome list and a link to the original GitHub repository.

Last Update: Oct. 28, 2021, 12:07 a.m.

Thank you igorbarinov & contributors
View Topic on GitHub:
igorbarinov/awesome-data-engineering

Search for resources by name or description.
Simply type in what you are looking for and the results will be filtered on the fly.

Further filter the resources on this page by type (repository/other resource), number of stars on GitHub and time of last commit in months.

Databases

The lightweight, distributed relational database built on SQLite

7.77K
398
8m
MIT

TiDB is an open source distributed HTAP database compatible with the MySQL protocol

26.9K
4.22K
8m
Apache-2.0

Pinterest MySQL Management Tools

876
147
2y 4m
GPL-2.0

HyperDex is a scalable, searchable key-value store

1.37K
162
4y 11m
BSD-3-Clause

Kyoto Tycoon key-value store (and the underlying Kyoto Cabinet library)

230
34
4y 6m
GPL-3.0

IonDB, a key-value datastore for resource constrained systems.

551
44
3y 8m
BSD-3-Clause

A script to easily create and destroy an Apache Cassandra cluster on localhost

1.16K
283
9m
Apache-2.0

NoSQL data store using the seastar framework, compatible with Apache Cassandra

6.64K
781
8m
AGPL-3.0

Distributed Prometheus time series database

1.28K
215
8m
Apache-2.0

Distributed Transactional In-Memory Database (全球首个支持分布式事务的MongoDB)

595
194
3y 6m
Apache-2.0

A distributed, fault-tolerant graph database

3.25K
259
4y 7m
n/a

A large-scale entity and relation database supporting aggregation of properties

1.6K
330
8m
Apache-2.0

Scalable datastore for metrics, events, and real-time analytics

20.52K
2.88K
8m
MIT

A scalable, distributed Time Series Database.

4.36K
1.23K
1y 2d
LGPL-2.1

Fast scalable time series database

1.58K
336
11m
Apache-2.0

The Heroic Time Series Database

836
107
8m
Apache-2.0

Apache Druid: a high performance real-time analytics database.

10.55K
2.84K
8m
Apache-2.0

Time-series database

748
75
1y 5m
Apache-2.0

A time-series object store for Cassandra that handles all the complexity of building wide row indexes.

123
11
1y 4m
MIT

See gitlab: https://gitlab.com/Project-FiFo/DalmatinerDB/dalmatinerdb

702
44
3y 7m
MIT

A distributed system designed to ingest and process time series data

593
100
3y 4m
Apache-2.0

Accumulo backed time series database

354
104
11m
Apache-2.0

Get your data in RAM. Get compute close to data. Enjoy the performance.

2.54K
254
8m
n/a

Greenplum Database

4.39K
1.25K
1y 10d
n/a

An open-source graph database

13.76K
1.23K
1y 104d
Apache-2.0

Project SnappyData - memory optimized analytics database, based on Apache Spark™ and Apache Geode™. Stream, Transact, Analyze, Predict in one cluster

995
195
10m
n/a

The world's most popular open source database.

Percona XtraBackup is a free, open source, complete online backup solution for all versions of Percona Server, MySQL® and MariaDB®

enhanced, drop-in replacement for MySQL.

Powerful object-relational database system. PostgreSQL licence

Provides a scalable database server with MySQL, Oracle, SQL Server, PostgreSQL, and MariaDB support.

is an open source massively scalable data store. It requires zero administration.

Advanced key-value store. 3-clause BSD

A distributed database designed to deliver maximum data availability by distributing data across multiple servers.

Provides a scalable, low-latency NoSQL online Database Service backed by SSDs.

A high performance NoSQL database supporting many data structures, an alternative to Redis

The right choice when you need scalability and high availability without compromising performance.

This simple form allows you to try out different values for your Apache Cassandra cluster and see what the impact is for your application.

The Hadoop database, a distributed, scalable, big data store.

Provides petabyte-scale data warehousing with columnar storage and multi-node compute.

Distributed, MPP columnar database with extensive analytics SQL.

An open-source, document database designed for ease of development and scaling.

Percona Server for MongoDB® is a free, enhanced, fully compatible, open source, drop-in replacement for the MongoDB® Community Edition that includes enterprise-grade features and functionality.

Search and analytics engine based on Apache Lucene.

The highest performing NoSQL distributed database.

document database that supports queries like table joins and group by.

A transactional, open-source Document Database.

graph database written entirely in Java.

2nd Generation Distributed Graph Database with the flexibility of Documents in one product with an Open Source commercial friendly license.

multi model distributed database.

A scalable graph database optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster.

The fully transactional, cloud-ready, distributed database.

An open source, distributed, in-memory database for scale-out applications.

Riak TS is the only enterprise-grade NoSQL time series database optimized specifically for IoT and Time Series data.

Built as an extension on top of PostgreSQL, TimescaleDB is a time-series SQL database providing fast analytics, scalability, with automated data management on a proven storage engine.

Data Ingestion

Change data capture from PostgreSQL into Kafka

1.49K
151
4y 70d
Apache-2.0

KafkaT-ool

482
78
4y 11m
Apache-2.0

Generic command line non-JVM Apache Kafka producer and consumer

3.13K
295
9m
n/a

INACTIVE: A PostgreSQL extension to produce messages to Apache Kafka.

109
15
6y 7m
n/a

The Apache Kafka C/C++ library

4.93K
2.23K
8m
n/a

Dockerfile for Apache Kafka

5.07K
2.28K
9m
Apache-2.0

CMAK is a tool for managing Apache Kafka clusters

9.83K
2.25K
1y 65d
Apache-2.0

Node.js client for Apache Kafka 0.8 and later.

2.48K
628
1y 11m
MIT

Secor is a service implementing Kafka log persistence

1.66K
513
8m
Apache-2.0

A kafka logger for winston

45
9
3y 18d
MIT

DEPRECATED: Data collection and processing made easy.

3.41K
559
5y 85d
n/a

A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.

1.87K
677
8m
Apache-2.0

Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

1.5K
236
8m
Apache-2.0

Publish-subscribe messaging rethought as a distributed commit log.

Provides real-time data processing over large, distributed data streams.

Robust messaging for applications.

An open source data collector for unified logging layer.

An open source bulk data loader that helps data transfer between various databases, storages, file formats, and cloud services.

A tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

Nakadi is an open source event messaging platform that provides a REST API on top of Kafka-like queues.

Pravega provides a new storage abstraction - a stream - for continuous and unbounded data.

Apache Pulsar is an open-source distributed pub-sub messaging system.

File System

A pure python HDFS client

827
217
5y 82d
Apache-2.0

Utils for streaming large files (S3, HDFS, gzip, bz2...)

1.94K
271
8m
MIT

The GA Release of SnackFS

14
5
7y 8m
n/a

SeaweedFS is a distributed object store and file system to store and serve billions of files fast! Object store has O(1) disk seek, local tiering, cloud tiering. Filer supports cross-cluster active-active replication, Kubernetes, POSIX, S3 API, encryption, Erasure Coding for warm storage, FUSE mount, Hadoop, WebDAV.

11.49K
1.48K
8m
Apache-2.0

a full featured file system for online data storage

639
68
1y 79d
GPL-3.0

Provides Web Service based storage.

Alluxio is a memory-centric distributed storage system enabling reliable data sharing at memory-speed across cluster frameworks, such as Spark and MapReduce

Ceph is a unified, distributed storage system designed for excellent performance, reliability and scalability

Orange File System is a branch of the Parallel Virtual File System

Gluster Filesystem

fault-tolerant distributed file system for all storage needs

LizardFS Software Defined Storage is a distributed, parallel, scalable, fault-tolerant, Geo-Redundant and highly available file system.

Serialization format

A fast compressor/decompressor

4.54K
793
8m
n/a

Protocol Buffers - Google's data interchange format

46.46K
12.42K
8m
n/a

Java binary serialization and cloning: fast, efficient, automatic

4.95K
743
9m
BSD-3-Clause

Data interchange format with dynamic typing, untagged data, and absence of manually assigned IDs.

Columnar storage format based on assembly algorithms from Google's paper on Dremel.

A parallel implementation of gzip for modern

The smallest, fastest columnar storage for Hadoop workloads

Data interchange format that originated at Facebook.

SequenceFile is a flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats

Stream Processing

High-performance time-series aggregation for PostgreSQL

2.36K
217
2y 6m
Apache-2.0

Python Stream Processing

5.32K
438
1y 19d
n/a

The database built for IoT streaming data storage and real-time stream processing.

258
21
6m
n/a

A lightweight IoT edge analytics software

396
113
4m
Apache-2.0

an unified model and set of language-specific SDKs for defining and executing data processing workflows.

Spark Streaming makes it easy to build scalable fault-tolerant streaming applications.

Open source platform for distributed stream and batch data processing.

Realtime computation system.

Apache Samza is a distributed stream processing framework

Apache NiFi is an integrated data logistics platform for automating the movement of data between disparate systems.

Apache Hudi is an open source framework for managing storage for real time processing, one of the most interesting feature is the Upsert

VoltDb is an ACID-compliant RDBMS which uses a shared nothing architecture.

Streaming and tasks execution between Spring Boot apps

Bonobo is a data-processing toolkit for python 3.5+

Batch Processing

Connecting Apache Spark with different data stores [DEPRECATED]

198
44
5y 4m
Apache-2.0

A general-purpose data analysis engine radically changing the way batch and stream data is processed

0
0
3y 52d
MIT

Mirror of Apache Hivemall (incubating)

280
108
1y 62d
Apache-2.0

Python interface to Hive and Presto. 🐝

1.37K
444
11m
n/a

Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner

Lightning-fast cluster computing

A community index of packages for Apache Spark

Livy, the REST Spark Server

A web service that makes it easy to quickly and cost-effectively process vast amounts of data.

Tez

An application framework which allows for a complex directed-acyclic-graph of tasks for processing data.

H2O

Fast statistical, machine learning & math runtime.

An environment for quickly creating scalable performant machine learning applications.

Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives.

A library with various machine learning models (regression, clustering, recommender systems, graph analytics, etc.) implemented on top of a disk-backed DataFrame.

An iterative graph processing system built for high scalability.

Apache Spark's API for graphs and graph-parallel computation.

A distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources.

Data warehouse software facilitates querying and managing large datasets residing in distributed storage.

Schema-free SQL Query Engine

Charts and Dashboards

Python helpers for building dashboards using Flask and React

2.25K
270
3y 7m
MIT

Analytical Web Apps for Python, R, Julia, and Jupyter. No JavaScript Required.

13.96K
1.43K
8m
MIT

Apache Superset is a Data Visualization and Data Exploration Platform

35.42K
6.9K
8m
Apache-2.0

The simplest, fastest way to get business intelligence and analytics to everyone in your company

23.96K
3.18K
8m
n/a

Interactive charts for web.

library written on Vanilla JS for big data visualization.

D3-based reusable chart library.

Allows the user to manipulate documents based on data to render charts in SVG.

D3's simplier, easier to use cousin. Mostly predefined templates that you can just plug data in.

A JavaScript Charting Library for Streaming Data.

Interactive and realtime 2D/3D/Image plotting and science/engineering widgets.

Workflow

Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.

14.23K
2.23K
8m
Apache-2.0

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

20.5K
8.01K
8m
Apache-2.0

Pinball is a scalable workflow manager

1.05K
144
1y 10m
Apache-2.0

A data orchestrator for machine learning, analytics, and ETL.

2.93K
297
8m
Apache-2.0

Java based application development platform.

batch workflow job scheduler.

Oozie is a workflow scheduler system to manage Apache Hadoop jobs

Data Lake Management

An open source platform that delivers resilience and manageability to object-storage based data lakes

85
5
1y 57d
Apache-2.0

ELK Elastic Logstash Kibana

Docker image for Logstash 1.4

239
95
5y 10m
MIT

JDBC importer for Elasticsearch

2.82K
717
4y 7m
n/a

Making Postgres and Elasticsearch work together like it's 2021

3.23K
148
9m
n/a

Docker

Package golang service into minimal docker containers.

669
17
3y 8m
n/a

Container data volume manager for your Dockerized application

3.31K
300
4y 10m
Apache-2.0

Simple, resilient multi-host containers networking and more.

6.01K
610
8m
n/a

A lightweight tool for easy deployment and rollback of dockerized applications.

186
19
1y 9m
Apache-2.0

Analyzes resource usage and performance characteristics of running containers.

11.75K
1.77K
8m
n/a

Docker microservice for saving/restoring volume data to S3

10
0
2y 35d
n/a

Docker composition tool with idempotency features for deploying apps composed of multiple containers.

410
21
3y 7m
n/a

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.

8.11K
1.27K
8m
MPL-2.0

RancherOS is a 20mb Linux distro that runs the entire OS as Docker containers

Application Containers for Masses

Vizualize docker images and the layers that compose them

Realtime

Event data simulator. Generates a stream of pseudo-random events from a set of users, designed to simulate web traffic.

331
71
3y 10m
n/a

The Streaming APIs give developers low latency access to Twitter’s global stream of Tweet data.

Real-time data is available including comments, submissions and links posted to reddit

Data Dumps

GitHub's public timeline since 2011, updated every hour

Open source repository of web crawl data

Wikipedia's complete copy of all wikis, in the form of wikitext source and metadata embedded in XML. A number of raw database tables in SQL form are also available.

Prometheus

The Prometheus monitoring system and time series database.

35.44K
5.67K
8m
Apache-2.0

Simple server that scrapes HAProxy stats and exports them via HTTP for Prometheus consumption

462
185
9m
Apache-2.0

Forums

News, tips and background on Data Engineering

Subreddit focused on ETL

Conferences

DataEngConf is the first technical conference that bridges the gap between data scientists, data engineers and data analysts.

Podcasts

The show about modern data infrastructure.