Your first time on this page? Allow me to give some explanations.
Awesome Data Engineering
A curated list of data engineering tools for software developers
Here you can see meta information about this topic like the time we last updated this page, the original creator of the awesome list and a link to the original GitHub repository.
Thank you igorbarinov & contributors
View Topic on GitHub:
Search for resources by name or description.
Simply type in what you are looking for and the results will be filtered on the fly.
Further filter the resources on this page by type (repository/other resource), number of stars on GitHub and time of last commit in months.
The lightweight, distributed relational database built on SQLite.
TiDB is an open source distributed HTAP database compatible with the MySQL protocol
Pinterest MySQL Management Tools
HyperDex is a scalable, searchable key-value store
Kyoto Tycoon key-value store (and the underlying Kyoto Cabinet library)
IonDB, a key-value datastore for resource constrained systems.
A script to easily create and destroy an Apache Cassandra cluster on localhost
NoSQL data store using the seastar framework, compatible with Apache Cassandra
Distributed Prometheus time series database
Distributed Transactional In-Memory Database (全球首个支持分布式事务的MongoDB)
A distributed, fault-tolerant graph database
A large-scale entity and relation database supporting aggregation of properties
Scalable datastore for metrics, events, and real-time analytics
A scalable, distributed Time Series Database.
Fast scalable time series database
The Heroic Time Series Database
Apache Druid: a high performance real-time analytics database.
A time-series object store for Cassandra that handles all the complexity of building wide row indexes.
See gitlab: https://gitlab.com/Project-FiFo/DalmatinerDB/dalmatinerdb
A distributed system designed to ingest and process time series data
Accumulo backed time series database
Get your data in RAM. Get compute close to data. Enjoy the performance.
An open-source graph database
Project SnappyData - memory optimized analytics database, based on Apache Spark™ and Apache Geode™. Stream, Transact, Analyze, Predict in one cluster
Percona XtraBackup is a free, open source, complete online backup solution for all versions of Percona Server, MySQL® and MariaDB®
Powerful object-relational database system.
Provides a scalable database server with MySQL, Oracle, SQL Server, PostgreSQL, and MariaDB support.
is an open source massively scalable data store. It requires zero administration.
A distributed database designed to deliver maximum data availability by distributing data across multiple servers.
Provides a scalable, low-latency NoSQL online Database Service backed by SSDs.
A high performance NoSQL database supporting many data structures, an alternative to Redis
The right choice when you need scalability and high availability without compromising performance.
This simple form allows you to try out different values for your Apache Cassandra cluster and see what the impact is for your application.
Provides petabyte-scale data warehousing with columnar storage and multi-node compute.
An open-source, document database designed for ease of development and scaling.
Percona Server for MongoDB® is a free, enhanced, fully compatible, open source, drop-in replacement for the MongoDB® Community Edition that includes enterprise-grade features and functionality.
document database that supports queries like table joins and group by.
2nd Generation Distributed Graph Database with the flexibility of Documents in one product with an Open Source commercial friendly license.
A scalable graph database optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster.
An open source, distributed, in-memory database for scale-out applications.
Riak TS is the only enterprise-grade NoSQL time series database optimized specifically for IoT and Time Series data.
Change data capture from PostgreSQL into Kafka
Generic command line non-JVM Apache Kafka producer and consumer
INACTIVE: A PostgreSQL extension to produce messages to Apache Kafka.
The Apache Kafka C/C++ library
Dockerfile for Apache Kafka
CMAK is a tool for managing Apache Kafka clusters
Node.js client for Apache Kafka 0.8 and later.
Secor is a service implementing Kafka log persistence
A kafka logger for winston
DEPRECATED: Data collection and processing made easy.
Gobblin is a distributed big data integration framework (ingestion, replication, compliance, retention) for batch and streaming systems. Gobblin features integrations with Apache Hadoop, Apache Kafka, Salesforce, S3, MySQL, Google etc.
Pandas on AWS
Provides real-time data processing over large, distributed data streams.
An open source bulk data loader that helps data transfer between various databases, storages, file formats, and cloud services.
A tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
Nakadi is an open source event messaging platform that provides a REST API on top of Kafka-like queues.
Pravega provides a new storage abstraction - a stream - for continuous and unbounded data.
A pure python HDFS client
Utils for streaming large files (S3, HDFS, gzip, bz2...)
The GA Release of SnackFS
SeaweedFS is a distributed object store and file system to store and serve billions of files fast! Object store has O(1) disk seek, transparent cloud integration. Filer supports cross-cluster active-active replication, Kubernetes, POSIX, S3 API, encryption, Erasure Coding for warm storage, FUSE mount, Hadoop, WebDAV.
a full featured file system for online data storage
Alluxio is a memory-centric distributed storage system enabling reliable data sharing at memory-speed across cluster frameworks, such as Spark and MapReduce
Ceph is a unified, distributed storage system designed for excellent performance, reliability and scalability
Orange File System is a branch of the Parallel Virtual File System
A fast compressor/decompressor
Protocol Buffers - Google's data interchange format
Java binary serialization and cloning: fast, efficient, automatic
Data interchange format with dynamic typing, untagged data, and absence of manually assigned IDs.
Columnar storage format based on assembly algorithms from Google's paper on Dremel.
High-performance time-series aggregation for PostgreSQL
Python Stream Processing
an unified model and set of language-specific SDKs for defining and executing data processing workflows.
Spark Streaming makes it easy to build scalable fault-tolerant streaming applications.
Open source platform for distributed stream and batch data processing.
Apache NiFi is an integrated data logistics platform for automating the movement of data between disparate systems.
Apache Hudi is an open source framework for managing storage for real time processing, one of the most interesting feature is the Upsert
VoltDb is an ACID-compliant RDBMS which uses a shared nothing architecture.
Streaming and tasks execution between Spring Boot apps
Connecting Apache Spark with different data stores [DEPRECATED]
A general-purpose data analysis engine radically changing the way batch and stream data is processed
Mirror of Apache Hivemall (incubating)
Python interface to Hive and Presto. 🐝
Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner
A web service that makes it easy to quickly and cost-effectively process vast amounts of data.
An application framework which allows for a complex directed-acyclic-graph of tasks for processing data.
An environment for quickly creating scalable performant machine learning applications.
Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives.
A library with various machine learning models (regression, clustering, recommender systems, graph analytics, etc.) implemented on top of a disk-backed DataFrame.
Apache Spark's API for graphs and graph-parallel computation.
A distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources.
Data warehouse software facilitates querying and managing large datasets residing in distributed storage.
Charts and Dashboards
Python helpers for building dashboards using Flask and React
Apache Superset is a Data Visualization and Data Exploration Platform
The simplest, fastest way to get business intelligence and analytics to everyone in your company
Allows the user to manipulate documents based on data to render charts in SVG.
D3's simplier, easier to use cousin. Mostly predefined templates that you can just plug data in.
Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
Pinball is a scalable workflow manager
A data orchestrator for machine learning, analytics, and ETL.
Data Lake Management
An open source platform that delivers resilience and manageability to object-storage based data lakes
ELK Elastic Logstash Kibana
Docker image for Logstash 1.4
JDBC importer for Elasticsearch
Making Postgres and Elasticsearch work together like it's 2020
Package golang service into minimal docker containers.
Container data volume manager for your Dockerized application
Simple, resilient multi-host containers networking and more.
A lightweight tool for easy deployment and rollback of dockerized applications.
Analyzes resource usage and performance characteristics of running containers.
Docker microservice for saving/restoring volume data to S3
Docker composition tool with idempotency features for deploying apps composed of multiple containers.
Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
RancherOS is a 20mb Linux distro that runs the entire OS as Docker containers
Event data simulator. Generates a stream of pseudo-random events from a set of users, designed to simulate web traffic.
The Streaming APIs give developers low latency access to Twitter’s global stream of Tweet data.
The Prometheus monitoring system and time series database.
Simple server that scrapes HAProxy stats and exports them via HTTP for Prometheus consumption
News, tips and background on Data Engineering