Your first time on this page? Allow me to give some explanations.
Awesome Apache Spark
A curated list of awesome Apache Spark packages and resources.
Here you can see meta information about this topic like the time we last updated this page, the original creator of the awesome list and a link to the original GitHub repository.
Thank you awesome-spark & contributors
View Topic on GitHub:
Search for resources by name or description.
Simply type in what you are looking for and the results will be filtered on the fly.
Further filter the resources on this page by type (repository/other resource), number of stars on GitHub and time of last commit in months.
A Clojure DSL for Apache Spark
C# and F# language binding and extensions to Apache Spark
.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.
R interface for Apache Spark
Haskell on Apache Spark.
Notebooks and IDEs
Interactive and Reactive Data Science using Scala and Spark.
Jupyter magics and kernels for working with remote Spark clusters
img src="https://img.shields.io/github/last-commit/almond-sh/almond.svg"> - A scala kernel for Jupyter.
A web-based notebook that enables interactive data analytics
img src="https://img.shields.io/github/last-commit/polynote/polynote.svg"> - Polynote: an IDE-inspired polyglot notebook. It supports mixing multiple languages in one notebook, and sharing data between them seamlessly. It encourages reproducible notebooks with its immutable data model. Orginating from Netflix.
General Purpose Libraries
A library that brings excellent and useful functions from various modern database management systems to Apache Spark
SQL Data Sources
CSV Data Source for Apache Spark 1.x
Avro Data Source for Apache Spark
XML data source for Spark SQL and DataFrames
Spark library for easy MongoDB access
DataStax Spark Cassandra Connector
The official Riak Spark Connector for Apache Spark with Riak TS and Riak KV
The MongoDB Spark Connector
Apache Spark datasource for OrientDB
An open-source storage layer that brings scalable, ACID transactions to Apache Spark™ and big data workloads.
ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.
Scalable genomic data analysis.
Geo Spatial Data Analytics on Spark
A cluster computing framework for processing large-scale geospatial data
Time Series Analytics
A library for time series analysis on Apache Spark
A Time Series Library for Apache Spark
Mazerunner extends a Neo4j graph database to run scheduled big data graph compute algorithms at scale with HDFS and Apache Spark.
Neo4j Connector for Apache Spark, which provides bi-directional read/write access to Neo4j from Spark, using the Spark DataSource APIs
Machine Learning Extension
C4E, a JVM friendly library written in Scala for both local and distributed (Spark) Clustering.
An implementation of DBSCAN runing on top of Apache Spark
(Deprecated) Scikit-learn integration package for Apache Spark
PMML evaluator library for the Apache Spark cluster computing system (http://spark.apache.org/)
Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.
Sparkling Water provides H2O functionality inside Spark cluster
BigDL: Distributed Deep Learning Framework for Apache Spark
MLeap: Deploy ML Pipelines to Production
img src="https://img.shields.io/github/last-commit/apache/systemml.svg"> - Declarative machine learning framework on top of Spark.
status unknown] - linear algebra DSL and optimizer with R-like syntax.
Mirror of Apache livy (Incubating)
REST job server for Apache Spark
Serverless proxy for Spark cluster
Mirror of Apache Toree (Incubating)
something to help you spark
Helpers & syntactic sugar for PySpark.
Apache (Py)Spark type annotations (stub files).
A command-line tool for launching Apache Spark clusters.
Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark
Natural Language Processing
Stanford CoreNLP wrapper for Apache Spark
State of the Art Natural Language Processing
NumPy and Pandas interface to Big Data
Koalas: pandas API on Apache Spark
Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Base classes to use when writing tests with Spark
Apache Spark testing helpers (dependency free & works with Scalatest, uTest, and MUnit)
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
Spark Gotchas. A subjective compilation of the Apache Spark tips and tricks
Slightly outdated (Spark 1.3) introduction to Spark API. Good source of knowledge about basic concepts.
Useful collection of Spark processing patterns. Accompanying GitHub repository: sryza/aas.
Interesting compilation of notes by Jacek Laskowski. Focused on different aspects of Spark internals.
Paper introducing a core distributed memory abstraction.
Series of five courses (Introduction to Apache Spark, Distributed Machine Learning with Apache Spark, Big Data Analysis with Apache Spark, Advanced Apache Spark for Data Science and Data Engineering, Advanced Distributed Machine Learning with Apache Spark) covering different aspects of software engineering and data science. Python oriented.
Projects Using Spark
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning
A scalable machine learning library on Apache Spark
DISCONTINUED - Easy access to big things. Library for Apache Spark extending and improving its capabilities
Ready-to-run Docker images containing Jupyter applications
A place to discuss and ask questions about using Scala for Spark programming" started by @deanwampler.