User Experience on mobile might not be great yet, but I'm working on it.

Your first time on this page? Allow me to give some explanations.

Awesome Apache Spark

A curated list of awesome Apache Spark packages and resources.

Here you can see meta information about this topic like the time we last updated this page, the original creator of the awesome list and a link to the original GitHub repository.

Last Update: Feb. 26, 2021, 12:11 a.m.

Thank you awesome-spark & contributors
View Topic on GitHub:
awesome-spark/awesome-spark

Search for resources by name or description.
Simply type in what you are looking for and the results will be filtered on the fly.

Further filter the resources on this page by type (repository/other resource), number of stars on GitHub and time of last commit in months.

Language Bindings

A Clojure DSL for Apache Spark

608
90
2y 7m
EPL-1.0

C# and F# language binding and extensions to Apache Spark

929
211
1y 10m
MIT

.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.

1.5K
214
4m
MIT

R interface for Apache Spark

772
283
11d
Apache-2.0

Haskell on Apache Spark.

416
29
21d
n/a

Notebooks and IDEs

Interactive and Reactive Data Science using Scala and Spark.

3.01K
656
1y 11m
Apache-2.0

Jupyter magics and kernels for working with remote Spark clusters

941
343
22d
n/a

img src="https://img.shields.io/github/last-commit/almond-sh/almond.svg"> - A scala kernel for Jupyter.

A web-based notebook that enables interactive data analytics

img src="https://img.shields.io/github/last-commit/polynote/polynote.svg"> - Polynote: an IDE-inspired polyglot notebook. It supports mixing multiple languages in one notebook, and sharing data between them seamlessly. It encourages reproducible notebooks with its immutable data model. Orginating from Netflix.

General Purpose Libraries

A library that brings excellent and useful functions from various modern database management systems to Apache Spark

10
0
9d
Apache-2.0

img src="https://img.shields.io/github/last-commit/amplab/succinct.svg">- Support for efficient queries on compressed data.

SQL Data Sources

CSV Data Source for Apache Spark 1.x

1.03K
449
4y 49d
Apache-2.0

Avro Data Source for Apache Spark

536
318
2y 70d
Apache-2.0

XML data source for Spark SQL and DataFrames

339
186
7d
Apache-2.0

Spark library for easy MongoDB access

307
101
4y 6m
Apache-2.0

DataStax Spark Cassandra Connector

1.77K
849
29d
Apache-2.0

The official Riak Spark Connector for Apache Spark with Riak TS and Riak KV

52
29
3y 11m
Apache-2.0

The MongoDB Spark Connector

587
254
23d
n/a

Apache Spark datasource for OrientDB

18
8
2y 30d
n/a

Storage

An open-source storage layer that brings scalable, ACID transactions to Apache Spark™ and big data workloads.

2.85K
634
4m
Apache-2.0

Bioinformatics

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.

874
293
14d
n/a

Scalable genomic data analysis.

700
188
7d
MIT

GIS

Geo Spatial Data Analytics on Spark

507
147
2y 7m
Apache-2.0

A cluster computing framework for processing large-scale geospatial data

837
395
12d
Apache-2.0

Time Series Analytics

A library for time series analysis on Apache Spark

1.15K
425
3y 11m
Apache-2.0

A Time Series Library for Apache Spark

874
173
1y 6m
Apache-2.0

Graph Processing

Mazerunner extends a Neo4j graph database to run scheduled big data graph compute algorithms at scale with HDFS and Apache Spark.

375
110
1y 105d
Apache-2.0
683
189
6m
Apache-2.0

Neo4j Connector for Apache Spark, which provides bi-directional read/write access to Neo4j from Spark, using the Spark DataSource APIs

242
89
9d
Apache-2.0

img src="https://img.shields.io/github/last-commit/sparkling-graph/sparkling-graph.svg"> - Library extending GraphX features with multiple functionalities useful in graph analytics (measures, generators, link prediction etc.).

Machine Learning Extension

C4E, a JVM friendly library written in Scala for both local and distributed (Spark) Clustering.

110
11
31d
Apache-2.0

An implementation of DBSCAN runing on top of Apache Spark

155
50
3y 49d
Apache-2.0

(Deprecated) Scikit-learn integration package for Apache Spark

1.05K
234
1y 86d
Apache-2.0

PMML evaluator library for the Apache Spark cluster computing system (http://spark.apache.org/)

91
41
2y 42d
AGPL-3.0

Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.

614
171
3y 13d
GPL-3.0

Sparkling Water provides H2O functionality inside Spark cluster

887
361
7d
Apache-2.0

BigDL: Distributed Deep Learning Framework for Apache Spark

3.7K
904
36d
Apache-2.0

MLeap: Deploy ML Pipelines to Production

1.21K
267
15d
Apache-2.0

img src="https://img.shields.io/github/last-commit/apache/systemml.svg"> - Declarative machine learning framework on top of Spark.

status unknown] - linear algebra DSL and optimizer with R-like syntax.

Type safe machine learning pipelines with RDDs.

img src="https://img.shields.io/github/last-commit/mitdbg/modeldb.svg"> - A system to manage machine learning models for spark.ml and scikit-learn .

Middleware

Mirror of Apache livy (Incubating)

604
397
6m
Apache-2.0

REST job server for Apache Spark

2.67K
990
20d
n/a

Serverless proxy for Spark cluster

305
69
1y 4m
Apache-2.0

Mirror of Apache Toree (Incubating)

663
219
6m
Apache-2.0

Kyuubi JDBC.

271
84
4m
Apache-2.0

Utilities

something to help you spark

17
0
3y 9m
Apache-2.0

Helpers & syntactic sugar for PySpark.

43
3
115d
Apache-2.0

Apache (Py)Spark type annotations (stub files).

96
31
36d
Apache-2.0

A command-line tool for launching Apache Spark clusters.

562
106
11d
n/a

Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark

978
194
3d
Apache-2.0

Natural Language Processing

Stanford CoreNLP wrapper for Apache Spark

421
121
2y 104d
GPL-3.0

State of the Art Natural Language Processing

1.95K
418
8d
Apache-2.0

Streaming

img src="https://img.shields.io/github/last-commit/apache/bahir.svg"> - Collection of the streaming connectors excluded from Spark 2.0 (Akka, MQTT, Twitter. ZeroMQ).

Interfaces

NumPy and Pandas interface to Big Data

2.93K
378
1y 6m
n/a

Koalas: pandas API on Apache Spark

2.66K
298
3d
Apache-2.0

an unified model and set of language-specific SDKs for defining and executing data processing workflows.

Testing

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

1.47K
279
53d
Apache-2.0

Base classes to use when writing tests with Spark

1.22K
334
5m
Apache-2.0

Apache Spark testing helpers (dependency free & works with Scalatest, uTest, and MUnit)

244
43
11d
MIT

Web Archives

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

99
30
30d
Apache-2.0

Workflow Management

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments

645
232
7d
n/a

Books

Spark Gotchas. A subjective compilation of the Apache Spark tips and tricks

305
69
3y 9m
n/a

Slightly outdated (Spark 1.3) introduction to Spark API. Good source of knowledge about basic concepts.

Useful collection of Spark processing patterns. Accompanying GitHub repository: sryza/aas.

Interesting compilation of notes by Jacek Laskowski. Focused on different aspects of Spark internals.

Petar Zečević, Marko Bonaći. (2016)

Papers

MOOCS

Workshops

Periodical training event organized by the UC Berkeley AMPLab. A source of useful exercise and recorded workshops covering different tools from the Berkeley Data Analytics Stack.

Projects Using Spark

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

1.74K
410
8d
Apache-2.0

A scalable machine learning library on Apache Spark

760
183
114d
n/a

DISCONTINUED - Easy access to big things. Library for Apache Spark extending and improving its capabilities

166
54
1y 99d
Apache-2.0

Machine Learning server for developers and data scientists to build and deploy predictive applications in a fraction of the time.

Blogs

Great source of highly diverse posts related to Spark ecosystem. From practical advices to Spark commiter profiles.

Docker Images

Ready-to-run Docker images containing Jupyter applications

5.3K
2.11K
6m
NOASSERTION
766
297
1y 15d
Apache-2.0

Miscellaneous

A place to discuss and ask questions about using Scala for Spark programming" started by @deanwampler.

and Apache Spark Developers List - Mailing lists dedicated to usage questions and development topics respectively.