User Experience on mobile might not be great yet, but I'm working on it.

Your first time on this page? Allow me to give some explanations.

Awesome Apache Spark

A curated list of awesome Apache Spark packages and resources.

Here you can see meta information about this topic like the time we last updated this page, the original creator of the awesome list and a link to the original GitHub repository.

Last Update: Nov. 30, 2022, 3:01 p.m.

Thank you awesome-spark & contributors
View Topic on GitHub:
awesome-spark/awesome-spark

Search for resources by name or description.
Simply type in what you are looking for and the results will be filtered on the fly.

Further filter the resources on this page by type (repository/other resource), number of stars on GitHub and time of last commit in months.

Language Bindings

A Clojure DSL for Apache Spark

609
88
4y 4m
EPL-1.0

C# and F# language binding and extensions to Apache Spark

928
215
3y 7m
MIT

.NET for Apacheยฎ Sparkโ„ข makes Apache Sparkโ„ข easily accessible to .NET developers.

1.71K
259
1y 31d
MIT

R interface for Apache Spark

819
291
1y 54d
Apache-2.0

Haskell on Apache Spark.

429
30
1y 89d
n/a

Notebooks and IDEs

Interactive and Reactive Data Science using Scala and Spark.

3.07K
659
1y 42d
Apache-2.0

Jupyter magics and kernels for working with remote Spark clusters

1.06K
381
1y 36d
n/a

General Purpose Libraries

A library that brings useful functions from various modern database management systems to Apache Spark

30
2
1y 6m
Apache-2.0

Essential Spark extensions and helper methods โœจ๐Ÿ˜ฒ

621
137
1y 5m
MIT

pyspark methods to enhance developer productivity ๐Ÿ“ฃ ๐Ÿ‘ฏ ๐ŸŽ‰

290
33
1y 8m
n/a

Mirror of Apache DataFu

83
57
1y 28d
Apache-2.0

Joblib Apache Spark Backend

174
20
1y 27d
Apache-2.0

SQL Data Sources

CSV Data Source for Apache Spark 1.x

1.04K
455
3y 11m
Apache-2.0

Avro Data Source for Apache Spark

540
317
3y 11m
Apache-2.0

XML data source for Spark SQL and DataFrames

385
201
1y 33d
Apache-2.0

DataStax Spark Cassandra Connector

1.81K
872
1y 36d
Apache-2.0

The official Riak Spark Connector for Apache Spark with Riak TS and Riak KV

54
28
5y 8m
Apache-2.0

The MongoDB Spark Connector

621
270
1y 27d
n/a

Apache Spark datasource for OrientDB

19
9
1y 5m
n/a

Storage

An open-source storage layer that brings scalable, ACID transactions to Apache Sparkโ„ข and big data workloads.

3.77K
892
1y 26d
Apache-2.0

Bioinformatics

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.

904
298
1y 34d
n/a

Scalable genomic data analysis.

756
205
1y 26d
MIT

GIS

Geo Spatial Data Analytics on Spark

513
146
1y 96d
Apache-2.0

A cluster computing framework for processing large-scale geospatial data

1.02K
454
11m
Apache-2.0

Time Series Analytics

A library for time series analysis on Apache Spark

1.16K
430
2y 48d
Apache-2.0

A Time Series Library for Apache Spark

915
182
2y 5m
Apache-2.0

Graph Processing

Mazerunner extends a Neo4j graph database to run scheduled big data graph compute algorithms at scale with HDFS and Apache Spark.

377
113
1y 90d
Apache-2.0
816
204
1y 43d
Apache-2.0

Neo4j Connector for Apache Spark, which provides bi-directional read/write access to Neo4j from Spark, using the Spark DataSource APIs

267
97
1y 27d
Apache-2.0

Machine Learning Extension

C4E, a JVM friendly library written in Scala for both local and distributed (Spark) Clustering.

117
12
1y 10m
Apache-2.0

An implementation of DBSCAN runing on top of Apache Spark

163
51
4y 10m
Apache-2.0

(Deprecated) Scikit-learn integration package for Apache Spark

1.06K
235
2y 12m
Apache-2.0

PMML evaluator library for the Apache Spark cluster computing system (http://spark.apache.org/)

91
43
3y 10m
AGPL-3.0

Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.

618
171
4y 4m
GPL-3.0

Sparkling Water provides H2O functionality inside Spark cluster

912
367
1y 27d
Apache-2.0

Building Large-Scale AI Applications for Distributed Big Data

3.79K
942
1y 26d
n/a

MLeap: Deploy ML Pipelines to Production

1.31K
287
1y 34d
Apache-2.0

Microsoft Machine Learning for Apache Spark

2.46K
579
1y 26d
MIT

[status unknown] - linear algebra DSL and optimizer with R-like syntax.

Middleware

Mirror of Apache livy (Incubating)

653
457
1y 39d
Apache-2.0

REST job server for Apache Spark

2.74K
998
1y 68d
n/a

Serverless proxy for Spark cluster

316
68
2y 32d
Apache-2.0

Mirror of Apache Toree (Incubating)

684
219
1y 26d
Apache-2.0

Apache Kyuubi is a distributed multi-tenant JDBC server for large-scale data processing and analytics, built on top of Apache Spark

807
250
1y 26d
Apache-2.0

Monitoring

A Spark UI and Spark History Server alternative with CPU and Memory metrics! Delight is free, cross-platform, and open-source.

185
24
1y 4m
n/a

Utilities

something to help you spark

17
0
5y 4m
Apache-2.0

Helpers & syntactic sugar for PySpark.

47
3
1y 48d
Apache-2.0

Apache (Py)Spark type annotations (stub files).

108
34
10m
Apache-2.0

A command-line tool for launching Apache Spark clusters.

584
111
1y 5m
n/a

Agile Data Preparation Workflows madeย easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark

1.18K
211
9m
Apache-2.0

Natural Language Processing

Stanford CoreNLP wrapper for Apache Spark

424
118
4y 16d
GPL-3.0

State of the Art Natural Language Processing

2.45K
506
1y 27d
Apache-2.0

Streaming

Interfaces

NumPy and Pandas interface to Big Data

3.02K
380
3y 108d
n/a

Koalas: pandas API on Apache Spark

3.08K
330
1y 40d
Apache-2.0

Testing

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

1.96K
355
1y 35d
Apache-2.0

Base classes to use when writing tests with Spark

1.31K
342
1y 27d
Apache-2.0

Apache Spark testing helpers (dependency free & works with Scalatest, uTest, and MUnit)

302
58
1y 35d
MIT

Web Archives

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

109
32
1y 29d
Apache-2.0

Workflow Management

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments

742
259
1y 26d
n/a

Books

Papers

MOOCS

Workshops

UC Berkeley AMPLab. A source of useful exercise and recorded workshops covering different tools from the Berkeley Data Analytics Stack.

Projects Using Spark

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

1.79K
411
1y 4m
Apache-2.0

A scalable machine learning library on Apache Spark

779
185
1y 92d
n/a

DISCONTINUED - Easy access to big things. Library for Apache Spark extending and improving its capabilities

170
53
3y 11d
Apache-2.0

Docker Images

Ready-to-run Docker images containing Jupyter applications

6.31K
2.56K
1y 26d
n/a
769
296
1y 8m
Apache-2.0

Miscellaneous