User Experience on mobile might not be great yet, but I'm working on it.

Your first time on this page? Allow me to give some explanations.

Awesome Apache Spark

A curated list of awesome Apache Spark packages and resources.

Here you can see meta information about this topic like the time we last updated this page, the original creator of the awesome list and a link to the original GitHub repository.

Last Update: Dec. 5, 2020, 6:15 a.m.

Thank you awesome-spark & contributors
View Topic on GitHub:
awesome-spark/awesome-spark

Search for resources by name or description.
Simply type in what you are looking for and the results will be filtered on the fly.

Further filter the resources on this page by type (repository/other resource), number of stars on GitHub and time of last commit in months.

Language Bindings

A Clojure DSL for Apache Spark

606
90
2y 4m
EPL-1.0

C# and F# language binding and extensions to Apache Spark

923
210
1y 7m
MIT

.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.

1.5K
214
64d
MIT

R interface for Apache Spark

761
282
3d
Apache-2.0

Haskell on Apache Spark.

413
28
1y 5m
n/a

Notebooks and IDEs

Interactive and Reactive Data Science using Scala and Spark.

2.98K
648
1y 9m
Apache-2.0

Jupyter magics and kernels for working with remote Spark clusters

902
320
17d
n/a

img src="https://img.shields.io/github/last-commit/almond-sh/almond.svg"> - A scala kernel for Jupyter.

A web-based notebook that enables interactive data analytics

img src="https://img.shields.io/github/last-commit/polynote/polynote.svg"> - Polynote: an IDE-inspired polyglot notebook. It supports mixing multiple languages in one notebook, and sharing data between them seamlessly. It encourages reproducible notebooks with its immutable data model. Orginating from Netflix.

General Purpose Libraries

img src="https://img.shields.io/github/last-commit/amplab/succinct.svg">- Support for efficient queries on compressed data.

SQL Data Sources

CSV Data Source for Apache Spark 1.x

1.02K
449
3y 11m
Apache-2.0

Avro Data Source for Apache Spark

534
318
1y 11m
Apache-2.0

XML data source for Spark SQL and DataFrames

325
183
50d
Apache-2.0

Spark library for easy MongoDB access

307
101
4y 98d
Apache-2.0

DataStax Spark Cassandra Connector

1.75K
843
32d
Apache-2.0

The official Riak Spark Connector for Apache Spark with Riak TS and Riak KV

52
29
3y 8m
Apache-2.0

The MongoDB Spark Connector

574
250
5m
n/a

Apache Spark datasource for OrientDB

18
8
1y 10m
n/a

Storage

An open-source storage layer that brings scalable, ACID transactions to Apache Spark™ and big data workloads.

2.85K
634
47d
Apache-2.0

Bioinformatics

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.

869
293
30d
n/a

Scalable genomic data analysis.

687
184
2d
MIT

GIS

Geo Spatial Data Analytics on Spark

498
146
2y 5m
Apache-2.0

A cluster computing framework for processing large-scale geospatial data

805
379
2d
Apache-2.0

Time Series Analytics

A library for time series analysis on Apache Spark

1.14K
422
3y 8m
Apache-2.0

A Time Series Library for Apache Spark

852
173
1y 4m
Apache-2.0

Graph Processing

Mazerunner extends a Neo4j graph database to run scheduled big data graph compute algorithms at scale with HDFS and Apache Spark.

374
110
1y 22d
Apache-2.0
683
189
110d
Apache-2.0

Neo4j Connector for Apache Spark, which provides bi-directional read/write access to Neo4j from Spark, using the Spark DataSource APIs

232
85
15d
Apache-2.0

img src="https://img.shields.io/github/last-commit/sparkling-graph/sparkling-graph.svg"> - Library extending GraphX features with multiple functionalities useful in graph analytics (measures, generators, link prediction etc.).

Machine Learning Extension

C4E, a JVM friendly library written in Scala for both local and distributed (Spark) Clustering.

90
11
4d
Apache-2.0

An implementation of DBSCAN runing on top of Apache Spark

154
50
2y 11m
Apache-2.0

(Deprecated) Scikit-learn integration package for Apache Spark

1.05K
236
1y 3d
Apache-2.0

PMML evaluator library for the Apache Spark cluster computing system (http://spark.apache.org/)

89
41
1y 10m
AGPL-3.0

Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.

614
170
2y 9m
GPL-3.0

Sparkling Water provides H2O functionality inside Spark cluster

881
360
4d
Apache-2.0

BigDL: Distributed Deep Learning Framework for Apache Spark

3.67K
898
2d
Apache-2.0

MLeap: Deploy Spark Pipelines to Production

1.15K
264
12d
Apache-2.0

img src="https://img.shields.io/github/last-commit/apache/systemml.svg"> - Declarative machine learning framework on top of Spark.

status unknown] - linear algebra DSL and optimizer with R-like syntax.

Type safe machine learning pipelines with RDDs.

img src="https://img.shields.io/github/last-commit/mitdbg/modeldb.svg"> - A system to manage machine learning models for spark.ml and scikit-learn .

Middleware

Mirror of Apache livy (Incubating)

573
378
112d
Apache-2.0

REST job server for Apache Spark

2.62K
984
82d
n/a

Serverless proxy for Spark cluster

303
66
1y 60d
Apache-2.0

Mirror of Apache Toree (Incubating)

663
216
104d
Apache-2.0

Kyuubi JDBC.

271
84
54d
Apache-2.0

Utilities

something to help you spark

17
0
3y 6m
Apache-2.0

Helpers & syntactic sugar for PySpark.

43
3
32d
Apache-2.0

Apache (Py)Spark type annotations (stub files).

86
30
23d
Apache-2.0

A command-line tool for launching Apache Spark clusters.

542
104
42d
n/a

Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark

959
192
16d
Apache-2.0

Natural Language Processing

Stanford CoreNLP wrapper for Apache Spark

417
122
2y 21d
GPL-3.0

State of the Art Natural Language Processing

1.73K
382
3d
Apache-2.0

Streaming

img src="https://img.shields.io/github/last-commit/apache/bahir.svg"> - Collection of the streaming connectors excluded from Spark 2.0 (Akka, MQTT, Twitter. ZeroMQ).

Interfaces

NumPy and Pandas interface to Big Data

2.91K
376
1y 113d
n/a

Koalas: pandas API on Apache Spark

2.51K
290
3d
Apache-2.0

an unified model and set of language-specific SDKs for defining and executing data processing workflows.

Testing

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

1.33K
257
38d
Apache-2.0

Base classes to use when writing tests with Spark

1.2K
331
74d
Apache-2.0

Apache Spark testing helpers (dependency free & works with Scalatest, uTest, and MUnit)

234
40
44d
MIT

Web Archives

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

97
30
21d
Apache-2.0

Workflow Management

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments

617
224
4d
n/a

Books

Spark Gotchas. A subjective compilation of the Apache Spark tips and tricks

297
66
3y 6m
n/a

Slightly outdated (Spark 1.3) introduction to Spark API. Good source of knowledge about basic concepts.

Useful collection of Spark processing patterns. Accompanying GitHub repository: sryza/aas.

Interesting compilation of notes by Jacek Laskowski. Focused on different aspects of Spark internals.

Petar Zečević, Marko Bonaći. (2016)

Papers

MOOCS

Workshops

Periodical training event organized by the UC Berkeley AMPLab. A source of useful exercise and recorded workshops covering different tools from the Berkeley Data Analytics Stack.

Projects Using Spark

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

1.73K
412
20d
Apache-2.0

A scalable machine learning library on Apache Spark

757
183
31d
n/a

DISCONTINUED - Easy access to big things. Library for Apache Spark extending and improving its capabilities

168
54
1y 16d
Apache-2.0

Machine Learning server for developers and data scientists to build and deploy predictive applications in a fraction of the time.

Blogs

Great source of highly diverse posts related to Spark ecosystem. From practical advices to Spark commiter profiles.

Docker Images

Ready-to-run Docker images containing Jupyter applications

5.3K
2.11K
101d
NOASSERTION
766
297
9m
Apache-2.0

Miscellaneous

A place to discuss and ask questions about using Scala for Spark programming" started by @deanwampler.

and Apache Spark Developers List - Mailing lists dedicated to usage questions and development topics respectively.