User Experience on mobile might not be great yet, but I'm working on it.

Your first time on this page? Allow me to give some explanations.

Awesome Data Engineering

A curated list of data engineering tools for software developers

Here you can see meta information about this topic like the time we last updated this page, the original creator of the awesome list and a link to the original GitHub repository.

Last Update: Nov. 30, 2021, 11:21 a.m.

Thank you igorbarinov & contributors
View Topic on GitHub:
igorbarinov/awesome-data-engineering

Search for resources by name or description.
Simply type in what you are looking for and the results will be filtered on the fly.

Further filter the resources on this page by type (repository/other resource), number of stars on GitHub and time of last commit in months.

Databases

The lightweight, distributed relational database built on SQLite

9.07K
481
7d
MIT

TiDB is an open source distributed HTAP database compatible with the MySQL protocol

29.64K
4.76K
6d
Apache-2.0

Pinterest MySQL Management Tools

880
146
2y 5m
GPL-2.0

HyperDex is a scalable, searchable key-value store

1.37K
161
5y 0d
BSD-3-Clause

Kyoto Tycoon key-value store (and the underlying Kyoto Cabinet library)

238
38
2y 106d
GPL-3.0

IonDB, a key-value datastore for resource constrained systems.

554
47
1y 11m
BSD-3-Clause

A script to easily create and destroy an Apache Cassandra cluster on localhost

1.18K
287
110d
Apache-2.0

NoSQL data store using the seastar framework, compatible with Apache Cassandra

7.28K
850
26d
AGPL-3.0

Distributed Prometheus time series database

1.32K
217
26d
Apache-2.0

Distributed Transactional In-Memory Database (ๅ…จ็ƒ้ฆ–ไธชๆ”ฏๆŒๅˆ†ๅธƒๅผไบ‹ๅŠก็š„MongoDB)

596
195
3y 7m
Apache-2.0

A distributed, fault-tolerant graph database

3.28K
264
4y 8m
n/a

A large-scale entity and relation database supporting aggregation of properties

1.64K
336
29d
Apache-2.0

Scalable datastore for metrics, events, and real-time analytics

22.32K
3.05K
27d
MIT

A scalable, distributed Time Series Database.

4.54K
1.25K
27d
LGPL-2.1

Fast scalable time series database

1.63K
343
40d
Apache-2.0

The Heroic Time Series Database

841
107
8m
Apache-2.0

Apache Druid: a high performance real-time analytics database.

11.31K
3.08K
26d
Apache-2.0

Time-series database

771
79
1y 5m
Apache-2.0

See gitlab: https://gitlab.com/Project-FiFo/DalmatinerDB/dalmatinerdb

703
45
2y 9m
MIT

A distributed system designed to ingest and process time series data

591
102
61d
Apache-2.0

Accumulo backed time series database

362
110
29d
Apache-2.0

Get your data in RAM. Get compute close to data. Enjoy the performance.

2.73K
285
27d
n/a

Greenplum Database - Massively Parallel PostgreSQL for Analytics. An open-source massively parallel data platform for analytics, machine learning and AI.

4.84K
1.38K
26d
n/a

An open-source graph database

13.98K
1.26K
90d
Apache-2.0

Project SnappyData - memory optimized analytics database, based on Apache Sparkโ„ข and Apache Geodeโ„ข. Stream, Transact, Analyze, Predict in one cluster

1.01K
200
34d
n/a

Data Ingestion

Change data capture from PostgreSQL into Kafka

1.5K
149
4y 103d
Apache-2.0

KafkaT-ool

492
82
2y 6m
Apache-2.0

Generic command line non-JVM Apache Kafka producer and consumer

3.77K
354
29d
n/a

INACTIVE: A PostgreSQL extension to produce messages to Apache Kafka.

111
15
6y 8m
n/a

The Apache Kafka C/C++ library

5.56K
2.5K
7d
n/a

Dockerfile for Apache Kafka

5.78K
2.52K
27d
Apache-2.0

CMAK is a tool for managing Apache Kafka clusters

10.46K
2.37K
42d
Apache-2.0

Node.js client for Apache Kafka 0.8 and later.

2.57K
636
57d
MIT

Secor is a service implementing Kafka log persistence

1.72K
523
27d
Apache-2.0

A kafka logger for winston

44
10
3y 51d
MIT

DEPRECATED: Data collection and processing made easy.

3.41K
553
2y 8m
n/a

A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.

1.99K
709
27d
Apache-2.0

Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

2.32K
378
8d
Apache-2.0

File System

A pure python HDFS client

841
220
2y 2d
Apache-2.0

Utils for streaming large files (S3, HDFS, gzip, bz2...)

2.27K
310
35d
MIT

The GA Release of SnackFS

13
5
6y 4m
n/a

SeaweedFS is a fast distributed storage system for blobs, objects, files, and data lake, for billions of files! Blob store has O(1) disk seek, cloud tiering. Filer supports Cloud Drive, cross-DC active-active replication, Kubernetes, POSIX FUSE mount, S3 API, S3 Gateway, Hadoop, WebDAV, encryption, Erasure Coding.

13.19K
1.63K
7d
Apache-2.0

a full featured file system for online data storage

806
87
57d
GPL-3.0

Serialization format

A fast compressor/decompressor

4.99K
853
33d
n/a

Protocol Buffers - Google's data interchange format

51.9K
13.45K
6d
n/a

Java binary serialization and cloning: fast, efficient, automatic

5.22K
763
20d
BSD-3-Clause

Stream Processing

Batch Processing

Charts and Dashboards

Python helpers for building dashboards using Flask and React

2.27K
273
3y 9m
MIT

Analytical Web Apps for Python, R, Julia, and Jupyter. No JavaScript Required.

15.35K
1.58K
27d
MIT

Apache Superset is a Data Visualization and Data Exploration Platform

41.44K
8.08K
26d
Apache-2.0

The simplest, fastest way to get business intelligence and analytics to everyone in your company

26.58K
3.61K
6d
n/a

Workflow

Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.

15.16K
2.31K
23d
Apache-2.0

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

23.85K
9.63K
7d
Apache-2.0

Pinball is a scalable workflow manager

1.05K
142
1y 11m
Apache-2.0

An orchestration platform for the development, production, and observation of data assets.

3.97K
470
26d
Apache-2.0

Data Lake Management

Git-like capabilities for your object storage

1.82K
186
26d
Apache-2.0

ELK Elastic Logstash Kibana

Docker image for Logstash 1.4

239
94
5y 11m
MIT

JDBC importer for Elasticsearch

2.84K
719
2y 9m
n/a

Making Postgres and Elasticsearch work together like it's 2021

3.74K
168
29d
n/a

Docker

Package golang service into minimal docker containers.

665
17
3y 9m
n/a

Container data volume manager for your Dockerized application

3.32K
299
4y 6m
Apache-2.0

Simple, resilient multi-host containers networking and more.

6.2K
630
50d
n/a

A lightweight tool for easy deployment and rollback of dockerized applications.

188
20
1y 10m
Apache-2.0

Analyzes resource usage and performance characteristics of running containers.

12.7K
1.9K
27d
n/a

Docker microservice for saving/restoring volume data to S3

10
1
2y 68d
n/a

Docker composition tool with idempotency features for deploying apps composed of multiple containers.

409
22
3y 8m
n/a

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.

10.19K
1.43K
26d
MPL-2.0

Realtime

Event data simulator. Generates a stream of pseudo-random events from a set of users, designed to simulate web traffic.

362
79
2y 12m
n/a

Data Dumps

Prometheus

The Prometheus monitoring system and time series database.

39.75K
6.59K
6d
Apache-2.0

Simple server that scrapes HAProxy stats and exports them via HTTP for Prometheus consumption

514
204
34d
Apache-2.0

Forums

Conferences

Podcasts