User Experience on mobile might not be great yet, but I'm working on it.

Your first time on this page? Allow me to give some explanations.

Awesome Apache Spark

A curated list of awesome Apache Spark packages and resources.

Here you can see meta information about this topic like the time we last updated this page, the original creator of the awesome list and a link to the original GitHub repository.

Last Update: Nov. 30, 2021, 11:21 a.m.

Thank you awesome-spark & contributors
View Topic on GitHub:
awesome-spark/awesome-spark

Search for resources by name or description.
Simply type in what you are looking for and the results will be filtered on the fly.

Further filter the resources on this page by type (repository/other resource), number of stars on GitHub and time of last commit in months.

Language Bindings

A Clojure DSL for Apache Spark

609
88
3y 4m
EPL-1.0

C# and F# language binding and extensions to Apache Spark

932
215
2y 7m
MIT

.NET for Apacheยฎ Sparkโ„ข makes Apache Sparkโ„ข easily accessible to .NET developers.

1.71K
259
31d
MIT

R interface for Apache Spark

819
291
54d
Apache-2.0

Haskell on Apache Spark.

429
30
89d
n/a

Notebooks and IDEs

Interactive and Reactive Data Science using Scala and Spark.

3.07K
659
42d
Apache-2.0

Jupyter magics and kernels for working with remote Spark clusters

1.06K
381
36d
n/a

General Purpose Libraries

A library that brings useful functions from various modern database management systems to Apache Spark

30
2
6m
Apache-2.0

Essential Spark extensions and helper methods โœจ๐Ÿ˜ฒ

621
137
5m
MIT

pyspark methods to enhance developer productivity ๐Ÿ“ฃ ๐Ÿ‘ฏ ๐ŸŽ‰

290
33
8m
n/a

Mirror of Apache DataFu

83
57
28d
Apache-2.0

SQL Data Sources

CSV Data Source for Apache Spark 1.x

1.04K
455
2y 11m
Apache-2.0

Avro Data Source for Apache Spark

540
317
2y 11m
Apache-2.0

XML data source for Spark SQL and DataFrames

385
201
33d
Apache-2.0

Spark library for easy MongoDB access

306
102
5y 93d
Apache-2.0

DataStax Spark Cassandra Connector

1.81K
872
36d
Apache-2.0

The official Riak Spark Connector for Apache Spark with Riak TS and Riak KV

54
28
4y 8m
Apache-2.0

The MongoDB Spark Connector

621
270
27d
n/a

Apache Spark datasource for OrientDB

19
9
5m
n/a

Storage

An open-source storage layer that brings scalable, ACID transactions to Apache Sparkโ„ข and big data workloads.

3.77K
892
26d
Apache-2.0

Bioinformatics

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.

904
298
34d
n/a

Scalable genomic data analysis.

756
205
26d
MIT

GIS

Geo Spatial Data Analytics on Spark

513
146
96d
Apache-2.0

A cluster computing framework for processing large-scale geospatial data

992
440
27d
Apache-2.0

Time Series Analytics

A library for time series analysis on Apache Spark

1.16K
430
1y 48d
Apache-2.0

A Time Series Library for Apache Spark

915
182
1y 5m
Apache-2.0

Graph Processing

Mazerunner extends a Neo4j graph database to run scheduled big data graph compute algorithms at scale with HDFS and Apache Spark.

377
113
90d
Apache-2.0
816
204
43d
Apache-2.0

Neo4j Connector for Apache Spark, which provides bi-directional read/write access to Neo4j from Spark, using the Spark DataSource APIs

267
97
27d
Apache-2.0

Machine Learning Extension

C4E, a JVM friendly library written in Scala for both local and distributed (Spark) Clustering.

117
12
10m
Apache-2.0

An implementation of DBSCAN runing on top of Apache Spark

163
51
3y 10m
Apache-2.0

(Deprecated) Scikit-learn integration package for Apache Spark

1.06K
235
1y 12m
Apache-2.0

PMML evaluator library for the Apache Spark cluster computing system (http://spark.apache.org/)

91
43
2y 10m
AGPL-3.0

Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.

618
171
3y 4m
GPL-3.0

Sparkling Water provides H2O functionality inside Spark cluster

912
367
27d
Apache-2.0

Building Large-Scale AI Applications for Distributed Big Data

3.79K
942
26d
n/a

MLeap: Deploy ML Pipelines to Production

1.31K
287
34d
Apache-2.0

Microsoft Machine Learning for Apache Spark

2.46K
579
26d
MIT

[status unknown] - linear algebra DSL and optimizer with R-like syntax.

Middleware

Mirror of Apache livy (Incubating)

653
457
39d
Apache-2.0

REST job server for Apache Spark

2.74K
998
68d
n/a

Serverless proxy for Spark cluster

316
68
1y 32d
Apache-2.0

Mirror of Apache Toree (Incubating)

684
219
26d
Apache-2.0

Apache Kyuubi is a distributed multi-tenant JDBC server for large-scale data processing and analytics, built on top of Apache Spark

807
250
26d
Apache-2.0

Monitoring

A Spark UI and Spark History Server alternative with CPU and Memory metrics! Delight is free, cross-platform, and open-source.

185
24
4m
n/a

Utilities

something to help you spark

17
0
4y 4m
Apache-2.0

Helpers & syntactic sugar for PySpark.

47
3
48d
Apache-2.0

Apache (Py)Spark type annotations (stub files).

107
33
50d
Apache-2.0

A command-line tool for launching Apache Spark clusters.

584
111
5m
n/a

Agile Data Preparation Workflows madeย easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark

1.14K
211
13d
Apache-2.0

Natural Language Processing

Stanford CoreNLP wrapper for Apache Spark

424
118
3y 16d
GPL-3.0

State of the Art Natural Language Processing

2.45K
506
27d
Apache-2.0

Streaming

Interfaces

NumPy and Pandas interface to Big Data

3.01K
385
2y 108d
n/a

Koalas: pandas API on Apache Spark

3.03K
328
40d
Apache-2.0

Testing

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

1.96K
355
35d
Apache-2.0

Base classes to use when writing tests with Spark

1.31K
342
27d
Apache-2.0

Apache Spark testing helpers (dependency free & works with Scalatest, uTest, and MUnit)

302
58
35d
MIT

Web Archives

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

109
32
29d
Apache-2.0

Workflow Management

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments

742
259
26d
n/a

Books

Papers

MOOCS

Workshops

UC Berkeley AMPLab. A source of useful exercise and recorded workshops covering different tools from the Berkeley Data Analytics Stack.

Projects Using Spark

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

1.78K
412
4m
Apache-2.0

A scalable machine learning library on Apache Spark

779
185
92d
n/a

DISCONTINUED - Easy access to big things. Library for Apache Spark extending and improving its capabilities

170
53
2y 11d
Apache-2.0

Docker Images

Ready-to-run Docker images containing Jupyter applications

6.31K
2.56K
26d
n/a
769
296
8m
Apache-2.0

Miscellaneous