User Experience on mobile might not be great yet, but I'm working on it.

Your first time on this page? Allow me to give some explanations.

Awesome Apache Spark

A curated list of awesome Apache Spark packages and resources.

Here you can see meta information about this topic like the time we last updated this page, the original creator of the awesome list and a link to the original GitHub repository.

Last Update: June 26, 2022, 6:15 p.m.

Thank you awesome-spark & contributors
View Topic on GitHub:
awesome-spark/awesome-spark

Search for resources by name or description.
Simply type in what you are looking for and the results will be filtered on the fly.

Further filter the resources on this page by type (repository/other resource), number of stars on GitHub and time of last commit in months.

Language Bindings

A Clojure DSL for Apache Spark

609
88
3y 11m
EPL-1.0

C# and F# language binding and extensions to Apache Spark

928
215
3y 64d
MIT

.NET for Apacheยฎ Sparkโ„ข makes Apache Sparkโ„ข easily accessible to .NET developers.

1.71K
259
8m
MIT

R interface for Apache Spark

819
291
8m
Apache-2.0

Haskell on Apache Spark.

429
30
9m
n/a

Notebooks and IDEs

Interactive and Reactive Data Science using Scala and Spark.

3.07K
659
8m
Apache-2.0

Jupyter magics and kernels for working with remote Spark clusters

1.06K
381
8m
n/a

General Purpose Libraries

A library that brings useful functions from various modern database management systems to Apache Spark

30
2
1y 48d
Apache-2.0

Essential Spark extensions and helper methods โœจ๐Ÿ˜ฒ

621
137
1y 7d
MIT

pyspark methods to enhance developer productivity ๐Ÿ“ฃ ๐Ÿ‘ฏ ๐ŸŽ‰

290
33
1y 92d
n/a

Mirror of Apache DataFu

83
57
7m
Apache-2.0

Joblib Apache Spark Backend

174
20
7m
Apache-2.0

SQL Data Sources

CSV Data Source for Apache Spark 1.x

1.04K
455
3y 6m
Apache-2.0

Avro Data Source for Apache Spark

540
317
3y 6m
Apache-2.0

XML data source for Spark SQL and DataFrames

385
201
8m
Apache-2.0

DataStax Spark Cassandra Connector

1.81K
872
8m
Apache-2.0

The official Riak Spark Connector for Apache Spark with Riak TS and Riak KV

54
28
5y 103d
Apache-2.0

The MongoDB Spark Connector

621
270
7m
n/a

Apache Spark datasource for OrientDB

19
9
1y 15d
n/a

Storage

An open-source storage layer that brings scalable, ACID transactions to Apache Sparkโ„ข and big data workloads.

3.77K
892
7m
Apache-2.0

Bioinformatics

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.

904
298
8m
n/a

Scalable genomic data analysis.

756
205
7m
MIT

GIS

Geo Spatial Data Analytics on Spark

513
146
10m
Apache-2.0

A cluster computing framework for processing large-scale geospatial data

1.02K
454
6m
Apache-2.0

Time Series Analytics

A library for time series analysis on Apache Spark

1.16K
430
1y 8m
Apache-2.0

A Time Series Library for Apache Spark

915
182
1y 11m
Apache-2.0

Graph Processing

Mazerunner extends a Neo4j graph database to run scheduled big data graph compute algorithms at scale with HDFS and Apache Spark.

377
113
9m
Apache-2.0
816
204
8m
Apache-2.0

Neo4j Connector for Apache Spark, which provides bi-directional read/write access to Neo4j from Spark, using the Spark DataSource APIs

267
97
7m
Apache-2.0

Machine Learning Extension

C4E, a JVM friendly library written in Scala for both local and distributed (Spark) Clustering.

117
12
1y 5m
Apache-2.0

An implementation of DBSCAN runing on top of Apache Spark

163
51
4y 5m
Apache-2.0

(Deprecated) Scikit-learn integration package for Apache Spark

1.06K
235
2y 6m
Apache-2.0

PMML evaluator library for the Apache Spark cluster computing system (http://spark.apache.org/)

91
43
3y 5m
AGPL-3.0

Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.

618
171
3y 11m
GPL-3.0

Sparkling Water provides H2O functionality inside Spark cluster

912
367
7m
Apache-2.0

Building Large-Scale AI Applications for Distributed Big Data

3.79K
942
7m
n/a

MLeap: Deploy ML Pipelines to Production

1.31K
287
8m
Apache-2.0

Microsoft Machine Learning for Apache Spark

2.46K
579
7m
MIT

[status unknown] - linear algebra DSL and optimizer with R-like syntax.

Middleware

Mirror of Apache livy (Incubating)

653
457
8m
Apache-2.0

REST job server for Apache Spark

2.74K
998
9m
n/a

Serverless proxy for Spark cluster

316
68
1y 8m
Apache-2.0

Mirror of Apache Toree (Incubating)

684
219
7m
Apache-2.0

Apache Kyuubi is a distributed multi-tenant JDBC server for large-scale data processing and analytics, built on top of Apache Spark

807
250
7m
Apache-2.0

Monitoring

A Spark UI and Spark History Server alternative with CPU and Memory metrics! Delight is free, cross-platform, and open-source.

185
24
11m
n/a

Utilities

something to help you spark

17
0
4y 11m
Apache-2.0

Helpers & syntactic sugar for PySpark.

47
3
8m
Apache-2.0

Apache (Py)Spark type annotations (stub files).

108
34
5m
Apache-2.0

A command-line tool for launching Apache Spark clusters.

584
111
1y 14d
n/a

Agile Data Preparation Workflows madeย easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark

1.18K
211
4m
Apache-2.0

Natural Language Processing

Stanford CoreNLP wrapper for Apache Spark

424
118
3y 7m
GPL-3.0

State of the Art Natural Language Processing

2.45K
506
7m
Apache-2.0

Streaming

Interfaces

NumPy and Pandas interface to Big Data

3.02K
380
2y 10m
n/a

Koalas: pandas API on Apache Spark

3.08K
330
8m
Apache-2.0

Testing

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

1.96K
355
8m
Apache-2.0

Base classes to use when writing tests with Spark

1.31K
342
7m
Apache-2.0

Apache Spark testing helpers (dependency free & works with Scalatest, uTest, and MUnit)

302
58
8m
MIT

Web Archives

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

109
32
7m
Apache-2.0

Workflow Management

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments

742
259
7m
n/a

Books

Papers

MOOCS

Workshops

UC Berkeley AMPLab. A source of useful exercise and recorded workshops covering different tools from the Berkeley Data Analytics Stack.

Projects Using Spark

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

1.79K
411
11m
Apache-2.0

A scalable machine learning library on Apache Spark

779
185
10m
n/a

DISCONTINUED - Easy access to big things. Library for Apache Spark extending and improving its capabilities

170
53
2y 7m
Apache-2.0

Docker Images

Ready-to-run Docker images containing Jupyter applications

6.31K
2.56K
7m
n/a
769
296
1y 108d
Apache-2.0

Miscellaneous