User Experience on mobile might not be great yet, but I'm working on it.

Your first time on this page? Allow me to give some explanations.

Awesome Web Archiving

An Awesome List for getting started with web archiving

Here you can see meta information about this topic like the time we last updated this page, the original creator of the awesome list and a link to the original GitHub repository.

Last Update: Nov. 30, 2020, 6:07 a.m.

Thank you iipc & contributors
View Topic on GitHub:
iipc/awesome-web-archiving

Search for resources by name or description.
Simply type in what you are looking for and the results will be filtered on the fly.

Further filter the resources on this page by type (repository/other resource), number of stars on GitHub and time of last commit in months.

Training/Documentation

Resources for Web Publishers

Tools & Software

๐Ÿ“š A compilation of research relevant to Data Together's efforts tackling the general problem of data resilience & interactivity

58
8
2y 65d
CC-BY-SA-4.0

A curated list of awesome tools for website diffing and change monitoring.

167
14
105d
CC0-1.0

Acquisition

22120 - Self-host the Internet with an Offline Archive. Like binaries? https://github.com/dosyago/22120/releases Similar to ArchiveBox, SingleFile and WebMemex, but gooderer.

926
43
21d
n/a

๐Ÿ—ƒ The open source self-hosted web archive. Takes browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

7.09K
427
94d
MIT

A Tool To Push Web Resources Into Web Archives

212
27
4m
MIT

brozzler - distributed browser-based web crawler

411
75
107d
Apache-2.0

NPM package and CLI tool for saving webpages

0
0
26d
GPL-3.0

Offline-first web browser

59
5
1y 10m
MIT

Web archiving using Google Chrome

33
5
11m
MIT

A commandline tool and Python library for archiving data from Facebook using the Graph API.

71
11
2y 10m
NOASSERTION

Snapshots a web page to get it as a static, self-contained HTML document.

138
12
118d
Unlicense

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

616
64
115d
NOASSERTION

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

1.8K
662
89d
n/a

simple script to convert web resources to a single warc file

6
2
4y 11m
MIT

โฌ›๏ธ CLI tool for saving complete web pages as a single HTML file

3.87K
114
4m
Unlicense

Go package and CLI tool for saving web page as single HTML file

46
4
4m
MIT

Web Extension for Firefox/Chrome/Edge and CLI tool to save a faithful copy of an entire web page in a single HTML file

2.28K
240
92d
AGPL-3.0

Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head

107
18
6m
Apache-2.0

A command line tool (and Python library) for archiving Twitter JSON

866
195
4m
MIT

A dockerized, queued high fidelity web archiver based on Squidwarc

35
7
4m
GPL-3.0

Web Archiving Integration Layer: One-Click User Instigated Preservation

222
23
27d
MIT

An easy-to-use and highly customizable crawler that enables you to create your own little Web archives (WARC/CDX)

17
4
3y 53d
MIT

Wget with Lua extension

19
9
4y 11m
GPL-3.0

Wget-compatible web downloader and crawler.

387
60
8m
GPL-3.0

A simple web crawler in Golang. (Stable)

An open source website copying utility. (Stable)

A transactional archive that selectively captures and stores transactions that take place between a web client (browser) and a web server. (Stable)

Open source software that harvests social media data and web resources from Twitter, Tumblr, Flickr, and Sina Weibo.

A collection of resources for building low-latency, scalable web crawlers on Apache Storm. (Stable)

A Google Chrome extension for archiving an individual webpage or website to a WARC file. (Stable)

Web archiving service anyone can use for free to save web pages.

An open source file retrieval utility that of version 1.14 supports writing warcs. (Stable)

Replay

InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS

329
26
47d
MIT

The OpenWayback Development

350
223
7m
Apache-2.0

Core Python Web Archiving Toolkit for replay and recording of web archives

715
108
4m
GPL-3.0

Reconstructive is a ServiceWorker module for client-side reconstruction of composite mementos by rerouting resource requests to corresponding archived copies (JavaScript).

A browser-based, fully client-side replay engine for both local and remote WARC files.

Search & Discovery

Chrome extension that uses Memento to indicate that a page a user is viewing on the live web has an archived copy and to give the user access to the copy

25
2
101d
MIT

WARC and ARC indexing and discovery tools.

83
19
5m
Unknown

Prototype SOLR-powered web archive exploration UI.

36
8
6m
Apache-2.0

A search interface and wayback machine for the UKWA Solr based warc-indexer framework.

37
5
91d
Apache-2.0

A Rails engine supporting the discovery of web archives.

36
6
96d
NOASSERTION
16
3
2y 10d
MIT

Historical and current WHOIS,

Temporal web archive search based on Delicious tags. (Stable)

Temporal web archive search based on links and anchor texts extracted from the German web from 1996 to 2013 (results are not limited to German pages, e.g., [email protected] in Tempas). (Stable)

Utilities

A collection of tools for archiving and analysing the internet.

39
13
8m
GPL-3.0

Convert HTTP Archive (HAR) -> Web Archive (WARC) format

25
2
2y 41d
Apache-2.0

Client app for httpreserve pkg that generates CSV, JSON, HTTP, and BoltDB

2
0
1y 8m
GPL-3.0

A Tool to Summarize Web Archive Holdings

2
0
6m
MIT

A Memento Aggregator CLI and Server in Go

33
7
8m
MIT

Parse CDXJ(https://github.com/oduwsdl/ORS/wiki/CDXJ) files with node.js

0
1
3y 4m
MIT

Web archive index server based on RocksDB

12
12
107d
Apache-2.0

A client for the Archive-It And Webrecorder WASAPI Data Transfer API

10
3
1y 44d
BSD-3-Clause

Tika based link extractor for httpreserve

3
1
10m
Unknown

Java application to download WARCs from WASAPI

4
5
6m
NOASSERTION

Partition (W)ARC Files by MIME Type and Year

0
1
3y 9m
MIT

Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.

32
7
2y 12m
MIT

Tools for downloading and preserving wikis. We archive wikis, from Wikipedia to tiniest wikis. As of 2020, WikiTeam has preserved more than 250,000 wikis.

361
94
94d
GPL-3.0

Service to return the status of a web page or save it to the Internet Archive. Returns JSON via browser or command line via CURL using GET (Golang Package). (Stable)

The Archive Browser is a program that lets you browse the contents of archives, as well as extract them. It will let you open files from inside archives, and lets you preview them using Quick Look. WARC is supported (macOS only, Proprietary app).

Program to extract the contents of many archive formats, inclusive of WARC, to a file system. Free variant of The Archive Browser (macOS only, Proprietary app).

WARC I/O Libraries

A Splitable Hadoop InputFormat for Concatenated GZIP Files and *.(w)arc.gz

8
2
2y 9m
Unknown

Java library for reading and writing WARC files with a typed API

23
4
5m
Apache-2.0

Parse And Create Web ARChive (WARC) files with node.js

44
13
4m
MIT

Tool and library for handling Web ARChive (WARC) files.

87
12
11m
GPL-3.0

Streaming WARC/ARC library for fast web archive IO

169
34
111d
Apache-2.0

Command line tools and libraries for handling and manipulating WARC files (and HTTP contents)

85
27
95d
MIT

golang readers for ARC and WARC webarchive formats

11
0
1y 9m
Unknown

Libraries and tools for reading/writing/validating WARC/ARC/GZIP files (Java). (Stable)

Analysis

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

110
14
11m
MIT

Various examples of notebooks for working with web archives with the Archives Unleashed Toolkit, and derivatives generated by the Archives Unleashed Toolkit.

5
0
5m
Apache-2.0

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

97
30
16d
Apache-2.0

An open-source toolkit for analyzing line-oriented JSON Twitter archives with Apache Spark.

5
1
5m
Apache-2.0

Archives Unleashed Cloud (AUK) is an web interface for analysing web archives. Currently, it can sync with Archive-It collections and extract hyperlink networks, full text, and other information from your collections. (Stable)

Quality Assurance

Powerful yet simple to use screenshot software

6.14K
428
104d
GPL-3.0

fake keyboard/mouse input, window management, and more

1.3K
197
114d
NOASSERTION

Browser extension: a link checker with more options.

Browser extension: link harvester on a page.

Browser extension: opens multiple URLs and also extracts URLs from text.

Browser extension: switches between browser tabs.

For running Xenu and Notepad++ on Ubuntu.

For running Xenu and Notepad++ on macOS.

Windows built-in for partial screen capture and annotation. On macOS you can use Command + Shift + 4 (keyboard shortcut for taking partial screen capture).

For running Xenu and Notepad++ on macOS.

Desktop link checker for Windows.

Other Awesome Lists

๐Ÿ—ƒ The open source self-hosted web archive. Takes browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

7.09K
427
94d
MIT

A list of things related to software, literature, and other content for ๐Ÿ•ฃ Memento

47
6
9m
CC0-1.0

Blogs and Scholarship

Unofficial blog of the Web Archiving Roundtable of the Society of American Archivists maintained by the members of the Web Archiving Roundtable.

An open-source book that provides a conceptual overview to web archiving research, as well as several case studies.

Web Science and Digital Libraries Research Group blogs about various Web archining related topics, scholarly work, and academic trip reports.

David Rosenthal regularly reviews and summarizes work done in the Digital Preservation field.

Mailing Lists

Slack

Fill out this request form](https://docs.google.com/forms/d/e/1FAIpQLScXPIH0Ssw63yWqyMkUqHVYmz2-ItBMzHiJQ-sOlJwTA8u5AQ/viewform?usp=sf_link) for access to a researcher group of people working with web archives.

Invite yourself](https://archivers-slack.herokuapp.com/) to a multi-disciplinary effort for archiving projects run in affiliation with EDGI and Data Together.

Twitter