User Experience on mobile might not be great yet, but I'm working on it.

Your first time on this page? Allow me to give some explanations.

Awesome Empirical Software Engineering

A curated repository of software engineering repository mining data sets

Here you can see meta information about this topic like the time we last updated this page, the original creator of the awesome list and a link to the original GitHub repository.

Last Update: Dec. 3, 2020, 12:15 a.m.

Thank you dspinellis & contributors
View Topic on GitHub:
dspinellis/awesome-msr

Search for resources by name or description.
Simply type in what you are looking for and the results will be filtered on the fly.

Further filter the resources on this page by type (repository/other resource), number of stars on GitHub and time of last commit in months.

Awesome Empirical Software Engineering

๐Ÿ˜Ž Awesome lists about all kinds of interesting topics

147.36K
19.29K
5d
CC0-1.0

Repositories

SIR

Software-artifact infrastructure repository; Java, C, C++, and C# software together with test suites and fault data.

About 20 datasets related to software engineering research.

Collaborative collection and analysis of free/libre/open source project data.

Software data collections in CERN's open-access repository.

Data Sets

A Database of Real Faults and an Experimental Infrastructure to Enable Controlled Experiments in Software Engineering Research

318
145
16d
n/a

The Bug Catalog of the Maven Ecosystem

1
3
5y 5m
n/a

Generating the Blueprints of the Java Ecosystem (MSR Data Paper 2015)

0
0
5y 6m
n/a

Multi-extract and Multi-level Dataset of Mozilla Issue Tracking History

6
1
4y 6m
Apache-2.0

A Data Set of OCL Expressions on GitHub

4
1
2y 5m
n/a

Continuous Unix commit history from 1970 until today

4.47K
356
50y 5m
n/a

Graph-based dataset of commit history of 8,431 real-world Android apps.

Collection of Android Applications.

Collection of models and metrics from Eclipse JDT Core, PDE UI, Equinox Framework, Lucene, Mylyn, and their histories.

Code reviews of OpenStack, LibreOffice, AOSP, Qt, Eclipse.

Collection of 70 realistically Complex Regression Errors that were systematically extracted from the repositories and bug reports of four open-source software projects: Make, Grep, Findutils, and Coreutils.

Activity such as commits, stars, prices, and market cap of over 200 cryptocurrency projects on GitHub over time. Raw, historic data is also available.

Collection of stacktraces of Exceptions encountered by users of the Eclipse IDE, as retrieved by the AERI reporting system.

All the spreadsheets and emails used in the paper 'Enron's Spreadsheets and Related Emails: A Dataset and Analysis'.

Bug Dataset of 15 Java open-source projects characterized by static source code metrics.

GitHub data accessible through Google's BigQuery platform.

Collection of grammars of DSLs and GPLs, some extracted from metamodels and document schemata.

Developer tool interaction data.

Snapshot of the whole Maven Central taken on September 6, 2018, stored in a graph database.

Data set containing a collection of engineered software projects from GHTorrent.

Graph of the development history and file metadata of >80 million software projects from various forges (GitHub, Gitlab, Debian, PyPI, Google Code, etc) in a deduplicated and unified representation (paper here).

STAte Machine INference Approaches) data are used to benchmark techniques for learning deterministic finite state machines (FSMs).

Anonymized dump of all user-contributed content on the Stack Exchange network.

Provides free and easy-to-use Traivs CI build analyses.

Data about various aspects of Debian (e.g. packages, bugs, mainteners) in the same SQL database.

Static source code based datasets which includes the Bugcatchers Bug Dataset, the Bug Prediction Dataset, the Eclipse Bug Dataset, the GitHub Bug Dataset, some datasets from the PROMISE repository.

Tools

A tool for mining commits from Git repositories and diffs to automatically extract code change pattern instances and features with ast analysis

55
17
4m
MIT

Collect and view OSS cryptocurrency development.

7
0
1y 5m
MIT

Database smell detector

10
1
2y 10m
MIT

Detects smells and computes metrics of Java code

90
32
4m
Apache-2.0

An agile tool to analyze Git repositories

15
5
8m
LGPL-3.0

This projects mines maven central and creates a global dependency graph

21
7
11m
n/a

Send Sir Perceval on a quest to retrieve and gather data from software repositories.

192
111
10d
GPL-3.0

Smell detection tool for Puppet code

37
12
68d
Apache-2.0

Python Framework to analyse Git repositories

387
73
13d
Apache-2.0

C Quality Metrics

40
9
4m
n/a

Calculate the score of a repository based on best engineering practices.

76
16
2y 9m
Apache-2.0

A vulnerability patch gathering tool

22
13
1y 10m
Apache-2.0

Boa

Domain-specific language and infrastructure that eases mining software repositories.

Chidamber and Kemerer Java Metrics.

Compute source code metrics and detect a variety of implementation, design, and architecture smells for C#.

Free/Libre/Open Source tools for Software Development Analytics.

Lean Java DSL to mine and extract data (e.g. commits, developers, modifications, diffs) from Git and SVN repositories.

Research Outlets