Your first time on this page? Allow me to give some explanations.
Awesome Apache Spark
A curated list of awesome Apache Spark packages and resources.
Here you can see meta information about this topic like the time we last updated this page, the original creator of the awesome list and a link to the original GitHub repository.
Thank you awesome-spark & contributors
View Topic on GitHub:
awesome-spark/awesome-spark
Search for resources by name or description.
Simply type in what you are looking for and the results will be filtered on the fly.
Further filter the resources on this page by type (repository/other resource), number of stars on GitHub and time of last commit in months.
Language Bindings
A Clojure DSL for Apache Spark
C# and F# language binding and extensions to Apache Spark
.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.
R interface for Apache Spark
Haskell on Apache Spark.
Notebooks and IDEs
Interactive and Reactive Data Science using Scala and Spark.
Jupyter magics and kernels for working with remote Spark clusters
img src="https://img.shields.io/github/last-commit/almond-sh/almond.svg"> - A scala kernel for Jupyter.
A web-based notebook that enables interactive data analytics
img src="https://img.shields.io/github/last-commit/polynote/polynote.svg"> - Polynote: an IDE-inspired polyglot notebook. It supports mixing multiple languages in one notebook, and sharing data between them seamlessly. It encourages reproducible notebooks with its immutable data model. Orginating from Netflix.
General Purpose Libraries
A library that brings excellent and useful functions from various modern database management systems to Apache Spark
img src="https://img.shields.io/github/last-commit/amplab/succinct.svg">- Support for efficient queries on compressed data.
SQL Data Sources
CSV Data Source for Apache Spark 1.x
Avro Data Source for Apache Spark
XML data source for Spark SQL and DataFrames
Spark library for easy MongoDB access
DataStax Spark Cassandra Connector
The official Riak Spark Connector for Apache Spark with Riak TS and Riak KV
The MongoDB Spark Connector
Apache Spark datasource for OrientDB
Storage
An open-source storage layer that brings scalable, ACID transactions to Apache Spark™ and big data workloads.
Bioinformatics
ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.
Scalable genomic data analysis.
GIS
Geo Spatial Data Analytics on Spark
A cluster computing framework for processing large-scale geospatial data
Time Series Analytics
A library for time series analysis on Apache Spark
A Time Series Library for Apache Spark
Graph Processing
Mazerunner extends a Neo4j graph database to run scheduled big data graph compute algorithms at scale with HDFS and Apache Spark.
Neo4j Connector for Apache Spark, which provides bi-directional read/write access to Neo4j from Spark, using the Spark DataSource APIs
img src="https://img.shields.io/github/last-commit/sparkling-graph/sparkling-graph.svg"> - Library extending GraphX features with multiple functionalities useful in graph analytics (measures, generators, link prediction etc.).
Machine Learning Extension
C4E, a JVM friendly library written in Scala for both local and distributed (Spark) Clustering.
An implementation of DBSCAN runing on top of Apache Spark
(Deprecated) Scikit-learn integration package for Apache Spark
PMML evaluator library for the Apache Spark cluster computing system (http://spark.apache.org/)
Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.
Sparkling Water provides H2O functionality inside Spark cluster
BigDL: Distributed Deep Learning Framework for Apache Spark
MLeap: Deploy ML Pipelines to Production
img src="https://img.shields.io/github/last-commit/apache/systemml.svg"> - Declarative machine learning framework on top of Spark.
status unknown] - linear algebra DSL and optimizer with R-like syntax.
img src="https://img.shields.io/github/last-commit/mitdbg/modeldb.svg"> - A system to manage machine learning models for spark.ml
and scikit-learn
.
Middleware
Mirror of Apache livy (Incubating)
REST job server for Apache Spark
Serverless proxy for Spark cluster
Mirror of Apache Toree (Incubating)
Kyuubi JDBC.
Utilities
something to help you spark
Helpers & syntactic sugar for PySpark.
Apache (Py)Spark type annotations (stub files).
A command-line tool for launching Apache Spark clusters.
Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark
Natural Language Processing
Stanford CoreNLP wrapper for Apache Spark
State of the Art Natural Language Processing
Streaming
img src="https://img.shields.io/github/last-commit/apache/bahir.svg"> - Collection of the streaming connectors excluded from Spark 2.0 (Akka, MQTT, Twitter. ZeroMQ).
Interfaces
NumPy and Pandas interface to Big Data
Koalas: pandas API on Apache Spark
an unified model and set of language-specific SDKs for defining and executing data processing workflows.
Testing
Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Base classes to use when writing tests with Spark
Apache Spark testing helpers (dependency free & works with Scalatest, uTest, and MUnit)
Web Archives
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Workflow Management
Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
Books
Spark Gotchas. A subjective compilation of the Apache Spark tips and tricks
Slightly outdated (Spark 1.3) introduction to Spark API. Good source of knowledge about basic concepts.
Useful collection of Spark processing patterns. Accompanying GitHub repository: sryza/aas.
Interesting compilation of notes by Jacek Laskowski. Focused on different aspects of Spark internals.
Papers
Paper introducing a core distributed memory abstraction.
Paper introducing relational underpinnings, code generation and Catalyst optimizer.
MOOCS
Series of five courses (Introduction to Apache Spark, Distributed Machine Learning with Apache Spark, Big Data Analysis with Apache Spark, Advanced Apache Spark for Data Science and Data Engineering, Advanced Distributed Machine Learning with Apache Spark) covering different aspects of software engineering and data science. Python oriented.
Scala oriented introductory course. Part of Functional Programming in Scala Specialization.
Workshops
Periodical training event organized by the UC Berkeley AMPLab. A source of useful exercise and recorded workshops covering different tools from the Berkeley Data Analytics Stack.
Projects Using Spark
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning
A scalable machine learning library on Apache Spark
DISCONTINUED - Easy access to big things. Library for Apache Spark extending and improving its capabilities
Machine Learning server for developers and data scientists to build and deploy predictive applications in a fraction of the time.
Blogs
Great source of highly diverse posts related to Spark ecosystem. From practical advices to Spark commiter profiles.
Docker Images
Ready-to-run Docker images containing Jupyter applications
Miscellaneous
A place to discuss and ask questions about using Scala for Spark programming" started by @deanwampler.
and Apache Spark Developers List - Mailing lists dedicated to usage questions and development topics respectively.