Your first time on this page? Allow me to give some explanations.
Awesome Apache Spark
A curated list of awesome Apache Spark packages and resources.
Here you can see meta information about this topic like the time we last updated this page, the original creator of the awesome list and a link to the original GitHub repository.
Thank you awesome-spark & contributors
View Topic on GitHub:
awesome-spark/awesome-spark
Search for resources by name or description.
Simply type in what you are looking for and the results will be filtered on the fly.
Further filter the resources on this page by type (repository/other resource), number of stars on GitHub and time of last commit in months.
Language Bindings
A Clojure DSL for Apache Spark
C# and F# language binding and extensions to Apache Spark
.NET for Apacheยฎ Sparkโข makes Apache Sparkโข easily accessible to .NET developers.
R interface for Apache Spark
Haskell on Apache Spark.
Notebooks and IDEs
Interactive and Reactive Data Science using Scala and Spark.
Jupyter magics and kernels for working with remote Spark clusters
General Purpose Libraries
A library that brings useful functions from various modern database management systems to Apache Spark
Essential Spark extensions and helper methods โจ๐ฒ
pyspark methods to enhance developer productivity ๐ฃ ๐ฏ ๐
Mirror of Apache DataFu
Joblib Apache Spark Backend
SQL Data Sources
CSV Data Source for Apache Spark 1.x
Avro Data Source for Apache Spark
XML data source for Spark SQL and DataFrames
DataStax Spark Cassandra Connector
The official Riak Spark Connector for Apache Spark with Riak TS and Riak KV
The MongoDB Spark Connector
Apache Spark datasource for OrientDB
Storage
An open-source storage layer that brings scalable, ACID transactions to Apache Sparkโข and big data workloads.
Bioinformatics
ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.
Scalable genomic data analysis.
GIS
Geo Spatial Data Analytics on Spark
A cluster computing framework for processing large-scale geospatial data
Time Series Analytics
A library for time series analysis on Apache Spark
A Time Series Library for Apache Spark
Graph Processing
Mazerunner extends a Neo4j graph database to run scheduled big data graph compute algorithms at scale with HDFS and Apache Spark.
Neo4j Connector for Apache Spark, which provides bi-directional read/write access to Neo4j from Spark, using the Spark DataSource APIs
Machine Learning Extension
C4E, a JVM friendly library written in Scala for both local and distributed (Spark) Clustering.
An implementation of DBSCAN runing on top of Apache Spark
(Deprecated) Scikit-learn integration package for Apache Spark
PMML evaluator library for the Apache Spark cluster computing system (http://spark.apache.org/)
Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.
Sparkling Water provides H2O functionality inside Spark cluster
Building Large-Scale AI Applications for Distributed Big Data
MLeap: Deploy ML Pipelines to Production
Microsoft Machine Learning for Apache Spark
[status unknown] - linear algebra DSL and optimizer with R-like syntax.
Middleware
Mirror of Apache livy (Incubating)
REST job server for Apache Spark
Serverless proxy for Spark cluster
Mirror of Apache Toree (Incubating)
Apache Kyuubi is a distributed multi-tenant JDBC server for large-scale data processing and analytics, built on top of Apache Spark
Monitoring
A Spark UI and Spark History Server alternative with CPU and Memory metrics! Delight is free, cross-platform, and open-source.
Utilities
something to help you spark
Helpers & syntactic sugar for PySpark.
Apache (Py)Spark type annotations (stub files).
A command-line tool for launching Apache Spark clusters.
Agile Data Preparation Workflows madeย easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
Natural Language Processing
Stanford CoreNLP wrapper for Apache Spark
State of the Art Natural Language Processing
Streaming
Interfaces
NumPy and Pandas interface to Big Data
Koalas: pandas API on Apache Spark
Testing
Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Base classes to use when writing tests with Spark
Apache Spark testing helpers (dependency free & works with Scalatest, uTest, and MUnit)
Web Archives
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Workflow Management
Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
Books
Spark Gotchas. A subjective compilation of the Apache Spark tips and tricks
Jacek Laskowski. Focused on different aspects of Spark internals.
Papers
MOOCS
Introduction to Apache Spark, Distributed Machine Learning with Apache Spark, Big Data Analysis with Apache Spark, Advanced Apache Spark for Data Science and Data Engineering, Advanced Distributed Machine Learning with Apache Spark) covering different aspects of software engineering and data science. Python oriented.
Workshops
UC Berkeley AMPLab. A source of useful exercise and recorded workshops covering different tools from the Berkeley Data Analytics Stack.
Projects Using Spark
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning
A scalable machine learning library on Apache Spark
DISCONTINUED - Easy access to big things. Library for Apache Spark extending and improving its capabilities
Docker Images
Ready-to-run Docker images containing Jupyter applications
Miscellaneous
Apache Spark Developers List - Mailing lists dedicated to usage questions and development topics respectively.