User Experience on mobile might not be great yet, but I'm working on it.

Your first time on this page? Allow me to give some explanations.

Awesome Web Archiving

An Awesome List for getting started with web archiving

Here you can see meta information about this topic like the time we last updated this page, the original creator of the awesome list and a link to the original GitHub repository.

Last Update: None

Thank you iipc & contributors
View Topic on GitHub:
iipc/awesome-web-archiving

Search for resources by name or description.
Simply type in what you are looking for and the results will be filtered on the fly.

Further filter the resources on this page by type (repository/other resource), number of stars on GitHub and time of last commit in months.

Training/Documentation

Resources for Web Publishers

Tools & Software

๐Ÿ“š A compilation of research relevant to Data Together's efforts tackling the general problem of data resilience & interactivity

58
8
3y 24d
CC-BY-SA-4.0

A curated list of awesome tools for website diffing and change monitoring.

167
14
1y 64d
CC0-1.0

Acquisition

22120 - Self-host the Internet with an Offline Archive. Like binaries? https://github.com/dosyago/22120/releases Similar to ArchiveBox, SingleFile and WebMemex, but gooderer.

926
43
11m
n/a

๐Ÿ—ƒ The open source self-hosted web archive. Takes browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

7.09K
427
1y 53d
MIT

A Tool To Push Web Resources Into Web Archives

212
27
1y 93d
MIT

Run a high-fidelity browser-based crawler in a single Docker container

79
17
25d
AGPL-3.0

brozzler - distributed browser-based web crawler

411
75
1y 66d
Apache-2.0

NPM package and CLI tool for saving webpages

0
0
11m
GPL-3.0

Offline-first web browser

59
5
2y 9m
MIT

Web archiving using Google Chrome

33
5
1y 9m
MIT

A commandline tool and Python library for archiving data from Facebook using the Graph API.

71
11
3y 8m
NOASSERTION

Snapshots a web page to get it as a static, self-contained HTML document.

138
12
1y 77d
Unlicense

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

616
64
1y 74d
NOASSERTION

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

1.8K
662
1y 48d
n/a

simple script to convert web resources to a single warc file

6
2
5y 9m
MIT

โฌ›๏ธ CLI tool for saving complete web pages as a single HTML file

3.87K
114
1y 79d
Unlicense

Go package and CLI tool for saving web page as single HTML file

46
4
1y 100d
MIT

Web Extension for Firefox/Chrome/Edge and CLI tool to save a faithful copy of an entire web page in a single HTML file

2.28K
240
1y 51d
AGPL-3.0

Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head

107
18
1y 5m
Apache-2.0

A command line tool (and Python library) for archiving Twitter JSON

866
195
1y 106d
MIT

Web Archiving Integration Layer: One-Click User Instigated Preservation

222
23
11m
MIT

WARC writing MITM HTTP/S proxy

266
46
50d
n/a

A dockerized, queued high fidelity web archiver based on Squidwarc

35
7
1y 93d
GPL-3.0

An easy-to-use and highly customizable crawler that enables you to create your own little Web archives (WARC/CDX)

17
4
4y 12d
MIT

Wget with Lua extension

19
9
5y 10m
GPL-3.0

Wget-compatible web downloader and crawler.

387
60
1y 7m
GPL-3.0

A simple web crawler in Golang. (Stable)

An open source website copying utility. (Stable)

A transactional archive that selectively captures and stores transactions that take place between a web client (browser) and a web server. (Stable)

Open source software that harvests social media data and web resources from Twitter, Tumblr, Flickr, and Sina Weibo.

A collection of resources for building low-latency, scalable web crawlers on Apache Storm. (Stable)

A Google Chrome extension for archiving an individual webpage or website to a WARC file. (Stable)

Web archiving service anyone can use for free to save web pages.

An open source file retrieval utility that of version 1.14 supports writing warcs. (Stable)

Replay

InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS

348
27
9m
MIT

The OpenWayback Development

350
223
1y 6m
Apache-2.0

Core Python Web Archiving Toolkit for replay and recording of web archives

715
108
1y 101d
GPL-3.0

Reconstructive is a ServiceWorker module for client-side reconstruction of composite mementos by rerouting resource requests to corresponding archived copies (JavaScript).

A browser-based, fully client-side replay engine for both local and remote WARC files.

Search & Discovery

Chrome extension that uses Memento to indicate that a page a user is viewing on the live web has an archived copy and to give the user access to the copy

25
2
1y 60d
MIT

Playback with fun!

0
0
94d
GPL-3.0

WARC and ARC indexing and discovery tools.

83
19
1y 110d
Unknown

Prototype SOLR-powered web archive exploration UI.

36
8
1y 4m
Apache-2.0

A search interface and wayback machine for the UKWA Solr based warc-indexer framework.

37
5
1y 50d
Apache-2.0

A Rails engine supporting the discovery of web archives.

36
6
1y 55d
NOASSERTION
16
3
2y 11m
MIT

Historical and current WHOIS,

Temporal web archive search based on Delicious tags. (Stable)

Temporal web archive search based on links and anchor texts extracted from the German web from 1996 to 2013 (results are not limited to German pages, e.g., [email protected] in Tempas). (Stable)

Utilities

A collection of tools for archiving and analysing the internet.

39
13
1y 7m
GPL-3.0
4
1
23d
Apache-2.0

Convert HTTP Archive (HAR) -> Web Archive (WARC) format

25
2
3y 0d
Apache-2.0

Client app for httpreserve pkg that generates CSV, JSON, HTTP, and BoltDB

2
0
2y 7m
GPL-3.0

Converts HTTrack crawls to WARC files

11
2
34d
Apache-2.0

A Tool to Summarize Web Archive Holdings

2
0
1y 4m
MIT

A Memento Aggregator CLI and Server in Go

33
7
1y 7m
MIT

Parse CDXJ(https://github.com/oduwsdl/ORS/wiki/CDXJ) files with node.js

0
1
4y 93d
MIT

Web archive index server based on RocksDB

12
12
1y 66d
Apache-2.0

A client for the Archive-It And Webrecorder WASAPI Data Transfer API

10
3
2y 3d
BSD-3-Clause

Tika based link extractor for httpreserve

3
1
1y 8m
Unknown

Java application to download WARCs from WASAPI

4
5
1y 5m
NOASSERTION

Partition (W)ARC Files by MIME Type and Year

0
1
4y 8m
MIT

Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.

32
7
3y 10m
MIT

Tools for downloading and preserving wikis. We archive wikis, from Wikipedia to tiniest wikis. As of 2020, WikiTeam has preserved more than 250,000 wikis.

361
94
1y 53d
GPL-3.0

Service to return the status of a web page or save it to the Internet Archive. Returns JSON via browser or command line via CURL using GET (Golang Package). (Stable)

The Archive Browser is a program that lets you browse the contents of archives, as well as extract them. It will let you open files from inside archives, and lets you preview them using Quick Look. WARC is supported (macOS only, Proprietary app).

Program to extract the contents of many archive formats, inclusive of WARC, to a file system. Free variant of The Archive Browser (macOS only, Proprietary app).

WARC I/O Libraries

A Splitable Hadoop InputFormat for Concatenated GZIP Files and *.(w)arc.gz

8
2
3y 8m
Unknown

Java library for reading and writing WARC files with a typed API

23
4
1y 119d
Apache-2.0

Parse And Create Web ARChive (WARC) files with node.js

44
13
1y 96d
MIT

Tool and library for handling Web ARChive (WARC) files.

87
12
1y 9m
GPL-3.0

Streaming WARC/ARC library for fast web archive IO

169
34
1y 70d
Apache-2.0

Command line tools and libraries for handling and manipulating WARC files (and HTTP contents)

85
27
1y 54d
MIT

golang readers for ARC and WARC webarchive formats

11
0
2y 7m
Unknown

Libraries and tools for reading/writing/validating WARC/ARC/GZIP files (Java). (Stable)

Analysis

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

110
14
1y 10m
MIT

Various examples of notebooks for working with web archives with the Archives Unleashed Toolkit, and derivatives generated by the Archives Unleashed Toolkit.

5
0
1y 113d
Apache-2.0

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

99
30
8m
Apache-2.0

An open-source toolkit for analyzing line-oriented JSON Twitter archives with Apache Spark.

5
1
1y 113d
Apache-2.0

Quality Assurance

Powerful yet simple to use screenshot software

6.14K
428
1y 63d
GPL-3.0

fake keyboard/mouse input, window management, and more

1.3K
197
1y 73d
NOASSERTION

Browser extension: a link checker with more options.

Browser extension: link harvester on a page.

Browser extension: opens multiple URLs and also extracts URLs from text.

Browser extension: switches between browser tabs.

For running Xenu and Notepad++ on Ubuntu.

For running Xenu and Notepad++ on macOS.

Windows built-in for partial screen capture and annotation. On macOS you can use Command + Shift + 4 (keyboard shortcut for taking partial screen capture).

For running Xenu and Notepad++ on macOS.

Desktop link checker for Windows.

Curation

Zotero extension that submits to and reads from web archives. Source on GitHub. Supercedes leonkt/zotero-memento.

Other Awesome Lists

๐Ÿ—ƒ The open source self-hosted web archive. Takes browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

7.09K
427
1y 53d
MIT

A list of things related to software, literature, and other content for ๐Ÿ•ฃ Memento

47
6
1y 7m
CC0-1.0

Blogs and Scholarship

Unofficial blog of the Web Archiving Roundtable of the Society of American Archivists maintained by the members of the Web Archiving Roundtable.

An open-source book that provides a conceptual overview to web archiving research, as well as several case studies.

Web Science and Digital Libraries Research Group blogs about various Web archining related topics, scholarly work, and academic trip reports.

David Rosenthal regularly reviews and summarizes work done in the Digital Preservation field.

Mailing Lists

Slack

Fill out this request form](https://docs.google.com/forms/d/e/1FAIpQLScXPIH0Ssw63yWqyMkUqHVYmz2-ItBMzHiJQ-sOlJwTA8u5AQ/viewform?usp=sf_link) for access to a researcher group of people working with web archives.

Invite yourself](https://archivers-slack.herokuapp.com/) to a multi-disciplinary effort for archiving projects run in affiliation with EDGI and Data Together.

Twitter