User Experience on mobile might not be great yet, but I'm working on it.

Your first time on this page? Allow me to give some explanations.

Awesome Data Science

An awesome Data Science repository to learn and apply for real world problems.

Here you can see meta information about this topic like the time we last updated this page, the original creator of the awesome list and a link to the original GitHub repository.

Last Update: Oct. 27, 2021, 6:08 a.m.

Thank you academic & contributors
View Topic on GitHub:
academic/awesome-datascience

Search for resources by name or description.
Simply type in what you are looking for and the results will be filtered on the fly.

Further filter the resources on this page by type (repository/other resource), number of stars on GitHub and time of last commit in months.

What is Data Science?

COLLEGES

Intensive Programs

MOOC's

Course materials for the Data Science Specialization: https://www.coursera.org/specialization/jhudatascience/1

3.71K
30.47K
5y 7m
n/a

Data Science Harvard University Assignments Lecture Notes Readings

Tutorials

Official repo for the #tidytuesday project

2.85K
1.07K
1y 0d
CC0-1.0

Ways of doing Data Science Engineering and Machine Learning in R and Python

526
248
3y 6m
n/a

🐍 Quick reference guide to common patterns & functions in PySpark.

105
37
9m
MIT

source code from the book Genetic Algorithms with Python by Clinton Sheppard

797
351
1y 112d
Apache-2.0

splearn: package for signal processing and machine learning with Python. Contains tutorials on understanding and applying signal processing.

0
0
11m
BSD-3-Clause

Free Courses

Toolboxes - Environment

The Data Science Lifecycle Process is a process for taking data science teams from Idea to Value repeatedly and sustainably. The process is documented in this repo.

212
35
1y 43d
MIT

Template repository for data science lifecycle project

59
19
1y 117d
n/a

A Temporal Extension Library for PyTorch Geometric

349
40
8m
MIT

Little Ball of Fur - A graph sampling extension library for NetworKit and NetworkX (CIKM 2020)

495
32
8m
GPL-3.0

Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

1.16K
140
9m
GPL-3.0

🛠 All-in-one web-based IDE specialized for machine learning and data science.

1.74K
232
9m
Apache-2.0

Lightweight, Python library for fast and reproducible experimentation

126
31
2y 11m
MIT

Curated set of transformers that make your work with steppy faster and more effective

21
8
2y 11m
MIT

A GUI for Pandas DataFrames

1.75K
92
11m
MIT

Serverless proxy for Spark cluster

305
69
2y 21d
Apache-2.0

Intel® Nervana™ reference deep learning framework committed to best performance on all hardware

3.85K
840
2y 5m
Apache-2.0

High performance distributed data processing engine

395
55
5m
Apache-2.0

Intel® Deep Learning Framework

315
90
5y 4m
n/a

Julia kernel for Jupyter

2.19K
356
8m
MIT

An open source python library for automated feature engineering

5.4K
708
8m
BSD-3-Clause

Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark

978
194
8m
Apache-2.0

Fast image augmentation library and easy to use wrapper around other libraries. Documentation: https://albumentations.ai/docs/ Paper about library: https://www.mdpi.com/2078-2489/11/2/125

7.35K
960
8m
MIT

🦉Data Version Control | Git for Data & Models

7.31K
695
8m
Apache-2.0

Feature engineering and machine learning: together at last!

0
0
2y 9m
MIT

Feature Store for Machine Learning

1.45K
252
8m
Apache-2.0

Machine Learning Platform for Kubernetes

2.73K
265
8m
Apache-2.0

ClearML - Auto-Magical Suite of tools to streamline your ML workflow. Experiment Manager, ML-Ops and Data-Management

2.18K
328
8m
Apache-2.0

Hopsworks - Data-Intensive AI platform with a Feature Store

392
61
8m
n/a

Predictive AI layer for existing databases.

3.23K
413
8m
GPL-3.0

Lightwood is Legos for Machine Learning.

104
24
8m
GPL-3.0

Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

1.5K
236
8m
Apache-2.0

♾️ CML - Continuous Machine Learning | CI/CD for ML

2.17K
134
8m
Apache-2.0

Grid studio is a web-based application for data science with full integration of open source data science frameworks and languages.

8.07K
1.4K
1y 25d
AGPL-3.0

Python Data Science Handbook: full text in Jupyter Notebooks

28.15K
12.51K
2y 11m
n/a

A data-driven approach to quantify the value of classifiers in a machine learning ensemble.

15
2
10m
MIT

easily explore, visualize, analyze, and transform data using familiar languages, such as Python and SQL, interactively.

is a personal, portable Hadoop environment that comes with a dozen interactive Hadoop tutorials.

The R Project for Statistical Computing.

IDE – powerful user interface for R. It’s free and open source, works onWindows, Mac, and Linux.

Machine learning in Python. sklearn

A fundamental package for scientific computing with Python.

A Python-based ecosystem of open-source software for mathematics, science, and engineering.

Take numerical, textual, image, GIS or other data and give it the Wolfram treatment, carrying out a full spectrum of data science analysis and visualization and automatically generating rich interactive reports—all powered by the revolutionary knowledge-based Wolfram Language.

heavy_dollar_sign: - Datadog is a full-stack monitoring service for large-scale cloud environments that aggregates metrics/events from servers, databases, and applications. It includes support for Docker, Kubernetes, and Mesos.

Build powerful data visualizations for the web without writing JavaScript

The Kite Software Development Kit (Apache License, Version 2.0), or Kite for short, is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem.

Run, scale, share, and deploy your models — without any infrastructure or setup.

A platform for efficient, distributed, general-purpose data processing.

Apache Hama is an Apache Top-Level open source project, allowing you to do advanced analytics beyond MapReduce.

Weka is a collection of machine learning algorithms for data mining tasks.

GNU Octave is a high-level interpreted language, primarily intended for numerical computations.(Free Matlab)

Lightning-fast cluster computing

Deep Learning Framework

Scientific computing framework with wide support for machine learning algorithms, used by Facebook, Google, and more.

A machine learning package built for humans.

An open source data visualization platform helping everyone to create simple, correct and embeddable charts. Also at github.com

TensorFlow is an Open Source Software Library for Machine Intelligence

A leading platform for building Python programs to work with human language data.

high-level, high-performance dynamic programming language for technical computing

Web-based notebook that enables data-driven,

Text Annotation Tool for teams

A Pandas-like interface, but for larger-than-memory data and "under the hood" parallelism. Very interesting, but only needed when you're getting advanced.

Topic Modelling for Humans.

A library for industrial-strength natural language processing in Python and Cython.

Machine Learning in General Purpose

A scikit-learn based module for multi-label et. al. classification

632
122
2y 5m
BSD-2-Clause

Highly interpretable classifiers for scikit learn, producing easily understood decision rules instead of black box models

452
67
4y 78d
n/a

open-source feature selection repository in python

1.05K
350
1y 11m
GPL-2.0

A scikit-learn-compatible Python implementation of ReBATE, a suite of Relief-based feature selection algorithms for Machine Learning.

309
58
8m
MIT

Sequence learning toolkit for Python

586
98
5y 8m
MIT

Python package for Bayesian Machine Learning with scikit-learn API

420
108
1y 9m
MIT

scikit-learn inspired API for CRFsuite

360
154
1y 10m
n/a

Use evolutionary algorithms instead of gridsearch in scikit-learn

628
112
1y 10m
MIT

SigOpt wrappers for scikit-learn methods

69
12
1y 6m
MIT

Machine learning model evaluation made easy: plots, tables, HTML reports, experiment tracking and Jupyter notebook analysis.

286
26
10m
MIT

Image processing in Python

4.2K
1.76K
8m
n/a

Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Optimization Algorithm,Immune Algorithm, Artificial Fish Swarm Algorithm, Differential Evolution and TSP(Traveling salesman)

1.98K
463
8m
MIT

Multiple Pairwise Comparisons (Post Hoc) Tests in Python

178
19
8m
MIT

Simple structured learning framework for python

630
169
3y 26d
BSD-2-Clause

High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

2.84K
510
1y 7m
Apache-2.0

cuML - RAPIDS Machine Learning Library

1.95K
313
8m
Apache-2.0

Uplift modeling and causal inference with machine learning algorithms

1.68K
263
8m
n/a

mlpack: a scalable C++ machine learning library --

3.56K
1.32K
8m
n/a

A library of extension and helper modules for Python's data analysis and machine learning libraries.

3.35K
695
8m
n/a

A modular active learning framework for Python

1.1K
183
9m
MIT

PySpark + Scikit-learn = Sparkit-learn

1.06K
239
4y 4d
Apache-2.0

50% faster, 50% less RAM Machine Learning. Numba rewritten Sklearn. SVD, NNMF, PCA, LinearReg, RidgeReg, Randomized, Truncated SVD/PCA, CSR Matrices all 50+% faster

1.2K
108
1y 39d
BSD-3-Clause

A toolkit for making real world machine learning and data analysis applications in C++

9.91K
2.88K
8m
BSL-1.0

Python implementation of the rulefit algorithm

219
67
11m
MIT

[HELP REQUESTED] Generalized Additive Models in Python

561
105
1y 104d
Apache-2.0

The most popular Python library for Machine Learning.

Machine learning toolbox.

pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration

46.37K
12.33K
8m
n/a

Datasets, Transforms and Models specific to Computer Vision

8.42K
4.35K
8m
BSD-3-Clause

Data loaders and abstractions for text and NLP

2.65K
614
8m
BSD-3-Clause

Data manipulation and transformation for audio signal processing, powered by PyTorch

1.23K
280
8m
BSD-2-Clause

High-level library to help with training and evaluating neural networks in PyTorch flexibly and transparently.

3.27K
438
8m
BSD-3-Clause

Simple tools for logging and visualizing, loading and training

1.28K
190
9m
BSD-3-Clause

A simplified framework and utilities for PyTorch

452
49
8m
LGPL-3.0

A scikit-learn compatible neural network library that wraps PyTorch

3.79K
281
8m
BSD-3-Clause

Python package facilitating the use of Bayesian Deep Learning methods with Variational Inference for PyTorch

315
46
2y 10m
MIT

Geometric Deep Learning Extension Library for PyTorch

10.27K
1.76K
8m
MIT

A highly efficient and modular implementation of Gaussian Processes in PyTorch

2.3K
328
8m
MIT

Deep universal probabilistic programming with Python and PyTorch

6.74K
826
8m
Apache-2.0

Accelerated deep learning R&D

2.44K
280
9m
Apache-2.0

A standard framework for modelling Deep Learning Models for tabular data

318
26
5m
MIT

tensorflow

An Open Source Machine Learning Framework for Everyone

153.46K
84.06K
8m
Apache-2.0

Deep Learning and Reinforcement Learning Library for Scientists and Engineers 🔥

6.5K
1.46K
8m
n/a

Deep learning library featuring a higher-level API for TensorFlow.

9.52K
2.43K
11m
n/a

TensorFlow-based neural network library

8.77K
1.26K
8m
Apache-2.0

A Neural Net Training Interface on TensorFlow, with focus on speed + flexibility

5.93K
1.77K
8m
Apache-2.0

TensorFlow Reinforcement Learning

3.06K
371
1y 6m
Apache-2.0

Machine Learning Platform for Kubernetes

2.73K
265
8m
Apache-2.0

NeuPy is a Tensorflow based python library for prototyping and building neural networks

665
148
2y 56d
MIT

Deploy tensorflow graphs for fast evaluation and export to tensorflow-less environments running numpy.

346
39
9m
BSD-3-Clause

TensorFlow ROCm port

543
65
8m
Apache-2.0

Deep learning with dynamic computation graphs in TensorFlow

1.8K
279
3y 12m
Apache-2.0

📝 Wrapper library for text generation / language models at char and word level with RNN in TensorFlow

62
30
3y 6m
MIT

TensorLight - A high-level framework for TensorFlow

9
2
4y 5m
MIT

Mesh TensorFlow: Model Parallelism Made Easier

873
156
8m
Apache-2.0

Ludwig is a toolbox that allows to train and evaluate deep learning models without the need to write code.

7.57K
889
8m
Apache-2.0

TF-Agents is a library for Reinforcement Learning in TensorFlow

1.8K
471
8m
Apache-2.0

Tensorforce: a TensorFlow library for applied reinforcement learning

2.88K
489
8m
Apache-2.0

keras

Keras community contributions

1.49K
615
1y 10m
MIT

Keras + Hyperopt: A very simple wrapper for convenient hyperparameter optimization

2.08K
301
10m
MIT

Distributed Deep learning with Keras & Spark

1.45K
288
8m
MIT

Train/evaluate a Keras model, get metrics streamed to a dashboard in your browser.

496
51
4y 5m
MIT

Graph Neural Networks with Keras and Tensorflow 2.

1.63K
198
8m
MIT

QKeras: a quantization deep learning library for Tensorflow Keras

246
50
8m
Apache-2.0

Deep Reinforcement Learning for Keras.

4.96K
1.3K
1y 11m
MIT

Hyperparameter Optimization for TensorFlow, Keras and PyTorch

1.38K
229
11m
MIT

A high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. Keras compatible

Visualization Tools - Environments

Library for animated data visualizations and data stories.

386
10
8d
Apache-2.0

Debugging, monitoring and visualization for Python Machine Learning and Data Science

3.03K
326
9m
MIT

Declarative statistical visualization library for Python. Can easily do many data transformation within the code to create graph

Three libraries for traditional charts, stock, and maps. Features a hand-drawn style theme option.

Set of products for charting different types of data. Has a special Oracle Apex integration option.

Allows the user to manipulate documents based on data to render charts in SVG.

A data visualization package based on the grammar of graphics.

A series of charting libraries for a variety of uses. Can be compatible back to IE6.

A Python 2D plotting library.

list of open source data visualization tools

A python visualization library based on matplotlib.

A high-productivity software for complex networks.

C3

D3-based reusable chart library

Journals, Publications and Magazines

Presentations

Podcasts

Books

free e-book comprehended by an online course

Neural networks and deep learning currently provide the best solutions to many problems in image recognition, speech recognition, and natural language processing. This book will teach you the core concepts behind neural networks and deep learning

Bloggers

Greg Reda Personal Blog

Kevin Davenport Personal Blog

Recurse Center alumna

Tech Blog on Master Data Management And Every Buzz Surrounding It

The Open Source Data Science Masters

Based in the UK and working globally, Cloud of Data's consultancy services help clients understand the implications of taking data and more to the Cloud.

Data Science London is a non-profit organization dedicated to the free, open, dissemination of data science.

by Peter Skomoroch. MACHINE LEARNING, DATA MINING, AND MORE

Data Science Questions and Answers from experts

a PhD student at Berkeley

MDS, Inc. Helps Build Careers in Data Science, Advanced Analytics, Big Data Architecture, and High Performance Software Engineering

a technology guy with a penchant for the web and for data, big and small

about helping professional programmers to confidently apply machine learning algorithms to address complex problems.

data-driven consulting and design

a data scientist at Twitch. I handle the whole data pipeline, from tracking to model-building to reporting.

Data Mining, Analytics, Big Data, Data, Science not a blog a portal

is some of, all of, or much more than the above and this blog explores its impact on information technology, the business world, government agencies, and our lives.

How a Social Scientist Jumps into the World of Big Data

Thoughts on Statistical Computing and Visualization

Learning To Be A Data Scientist

Musings on data science, machine learning and stats.

The File Drawer](http://chris-said.io/) - Chris Said's science blog

Visualization and Statistics

A Machine Learning Craftsmanship Blog

Handbook and recipes for data-driven solutions of real-world problems

A blog on the new emerging data economy

A blog with resources for data science learners

A full-fledged website about data science and analytics study material.

Data science tutorials for beginners!

Blog for understanding Neural Networks!

Blog for NLP and transfer learning!

Dedicated to clear explanations of machine learning!

Data Science with Esoteric programming languages

Facebook Accounts

Twitter Accounts

Rapid-fire, live tryouts for data scientists seeking to monetize their models as trading strategies

Big Data, Data Science, Predictive Modeling, Business Analytics, Hadoop, Decision and Operations Research.

Data scientist at Twitter

Dev, Design, Data Science @mattermark #hackerei

datascientist @Ekimetrics. , #machinelearning #dataviz #DynamicCharts #Hadoop #R #Python #NLP #Bitcoin #dataenthousiast

Data Science Central is the industry's single resource for Big Data practitioners.

Data Science. Big Data. Data Hacks. Data Junkies. Data Startups. Open Data

Documenting my path from SQL Data Analyst pursuing an Engineering Master's Degree to Data Scientist

Mission is to help guide & advance careers in Data Science & Analytics

Tips and Tricks for Data Scientists around the world! #datascience #bigdata

White House Data Chief, VP @ RelateIQ.

Data nerd, hacker, student of conflict.

Running with #BigData--enjoying a love/hate relationship with its hype. @iSchoolSU #DataScience Program Mgr.

Working @ GrubHub about data and pandas

KDnuggets President, Analytics/Big Data/Data Mining/Data Science expert, KDD & SIGKDD co-founder, was Chief Scientist at 2 startups, part-time philosopher.

Data Scientist in Residence at @accel.

ReTweeting about data science

Scientist at Facebook and Julia developer. Author of Machine Learning for Hackers and Bandit Algorithms for Website Optimization. Tweets reflect my views only.

Principal Data Scientist @ Microsoft Data Science Team

Hacker - Pandas - Data Analyze

The Economist's Data Editor and co-author of Big Data (http://big-data-book.com ).

Data science instructor, and founder of Data School

Interactive data visualization and tools. Data flaneur.

DataScientist, PhD Astrophysicist, Top #BigData Influencer.

PhD Student. Programming, Mobile, Web. Artificial Intelligence, Intelligent Robotics Machine Learning, Data Mining, Natural Language Processing, Data Science.

Opinions of full-stack Python guy, author, instructor, currently playing Data Scientist. Occasional fathering, husbanding, ult|goalt-imate, organic gardening.

Data Scientist at BizQualify, Developer

Data @ Jawbone. Turned data into stories & products at LinkedIn. Text mining, applied machine learning, recommender systems. Ex-gamer, ex-machine coder; namer.

Visualization & interaction designer. Practical cyclist. Author of vis books: http://www.oreilly.com/pub/au/4419

Cloud Computing/ Big Data/ Open Data Analyst & Consultant. Writer, Speaker & Moderator. Gigaom Research Analyst.

Creating intelligent systems to automate tasks & improve decisions. Entrepreneur, ex Principal Data Scientist @LinkedIn. Machine Learning, ProductRei, Networks

Solution Architect @ IBM, Master Data Management, Data Quality & Data Governance Blogger. Data Science, Hadoop, Big Data & Cloud.

Tweet blog posts from the R blogosphere, data science conferences and (!) open jobs for data scientists.

Computer scientist researching artificial intelligence. Data tinkerer. Community leader for @DataIsBeautiful. #OpenScience advocate.

Data Science geek @ UALR

Data scientist, genetic origamist, hardware aficionado

Social Scientist. Hacker. Facebook Data Science Team. Keywords: Experiments, Causal Inference, Statistics, Machine Learning, Economics.

Data Scientist at BBVA Compass

Enjoys ABM, SNA, DM, ML, NLP, HI, Python, Java. Top percentile kaggler/data scientist

Complex Event Processing, Big Data, Artificial Intelligence and Machine Learning. Passionate about programming and open-source.

InfoGov; Bigdata; Data as a Service; Data Science; Open, Social & Business Data Convergence

IT analyst with Ovum covering Big Data & data management with some systems engineering thrown in.

Data Scientist | Author | Entrepreneur. Co-founder @DataCommunityDC. Founder @DistrictDataLab. #DataScience #BigData #DataDC

Data Science @ PayPal. #NLP, #machinelearning; PhD, Carnegie Mellon alumni (Blog: https://allthingsds.wordpress.com )

Pandas (Python Data Analysis library).

Senior Manager - @Seagate Big Data Analytics | @McKinsey Alum | #BigData + #Analytics Evangelist | #Hadoop, #Cloud, #Digital, & #R Enthusiast

The data news crew at @WNYC. Practicing data-driven journalism, making it visual and showing our work.

Newsletters

Youtube Videos & Channels