User Experience on mobile might not be great yet, but I'm working on it.

Your first time on this page? Allow me to give some explanations.

Awesome Web Archiving

An Awesome List for getting started with web archiving

Here you can see meta information about this topic like the time we last updated this page, the original creator of the awesome list and a link to the original GitHub repository.

Last Update: None

Thank you iipc & contributors
View Topic on GitHub:
iipc/awesome-web-archiving

Search for resources by name or description.
Simply type in what you are looking for and the results will be filtered on the fly.

Further filter the resources on this page by type (repository/other resource), number of stars on GitHub and time of last commit in months.

Training/Documentation

Resources for Web Publishers

Tools & Software

๐Ÿ“š A compilation of research relevant to Data Together's efforts tackling the general problem of data resilience & interactivity

58
8
2y 5m
CC-BY-SA-4.0

A curated list of awesome tools for website diffing and change monitoring.

167
14
6m
CC0-1.0

Acquisition

22120 - Self-host the Internet with an Offline Archive. Like binaries? https://github.com/dosyago/22120/releases Similar to ArchiveBox, SingleFile and WebMemex, but gooderer.

926
43
113d
n/a

๐Ÿ—ƒ The open source self-hosted web archive. Takes browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

7.09K
427
6m
MIT

A Tool To Push Web Resources Into Web Archives

212
27
7m
MIT

brozzler - distributed browser-based web crawler

411
75
6m
Apache-2.0

NPM package and CLI tool for saving webpages

0
0
118d
GPL-3.0

Offline-first web browser

59
5
2y 48d
MIT

Web archiving using Google Chrome

33
5
1y 63d
MIT

A commandline tool and Python library for archiving data from Facebook using the Graph API.

71
11
3y 33d
NOASSERTION

Snapshots a web page to get it as a static, self-contained HTML document.

138
12
7m
Unlicense

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

616
64
6m
NOASSERTION

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

1.8K
662
6m
n/a

simple script to convert web resources to a single warc file

6
2
5y 64d
MIT

โฌ›๏ธ CLI tool for saving complete web pages as a single HTML file

3.87K
114
7m
Unlicense

Go package and CLI tool for saving web page as single HTML file

46
4
7m
MIT

Web Extension for Firefox/Chrome/Edge and CLI tool to save a faithful copy of an entire web page in a single HTML file

2.28K
240
6m
AGPL-3.0

Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head

107
18
9m
Apache-2.0

A command line tool (and Python library) for archiving Twitter JSON

866
195
7m
MIT

A dockerized, queued high fidelity web archiver based on Squidwarc

35
7
7m
GPL-3.0

Web Archiving Integration Layer: One-Click User Instigated Preservation

222
23
119d
MIT

An easy-to-use and highly customizable crawler that enables you to create your own little Web archives (WARC/CDX)

17
4
3y 4m
MIT

Wget with Lua extension

19
9
5y 77d
GPL-3.0

Wget-compatible web downloader and crawler.

387
60
11m
GPL-3.0

A simple web crawler in Golang. (Stable)

An open source website copying utility. (Stable)

A transactional archive that selectively captures and stores transactions that take place between a web client (browser) and a web server. (Stable)

Open source software that harvests social media data and web resources from Twitter, Tumblr, Flickr, and Sina Weibo.

A collection of resources for building low-latency, scalable web crawlers on Apache Storm. (Stable)

A Google Chrome extension for archiving an individual webpage or website to a WARC file. (Stable)

Web archiving service anyone can use for free to save web pages.

An open source file retrieval utility that of version 1.14 supports writing warcs. (Stable)

Replay

InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS

348
27
48d
MIT

The OpenWayback Development

350
223
10m
Apache-2.0

Core Python Web Archiving Toolkit for replay and recording of web archives

715
108
7m
GPL-3.0

Reconstructive is a ServiceWorker module for client-side reconstruction of composite mementos by rerouting resource requests to corresponding archived copies (JavaScript).

A browser-based, fully client-side replay engine for both local and remote WARC files.

Search & Discovery

Chrome extension that uses Memento to indicate that a page a user is viewing on the live web has an archived copy and to give the user access to the copy

25
2
6m
MIT

WARC and ARC indexing and discovery tools.

83
19
8m
Unknown

Prototype SOLR-powered web archive exploration UI.

36
8
9m
Apache-2.0

A search interface and wayback machine for the UKWA Solr based warc-indexer framework.

37
5
6m
Apache-2.0

A Rails engine supporting the discovery of web archives.

36
6
6m
NOASSERTION
16
3
2y 102d
MIT

Historical and current WHOIS,

Temporal web archive search based on Delicious tags. (Stable)

Temporal web archive search based on links and anchor texts extracted from the German web from 1996 to 2013 (results are not limited to German pages, e.g., [email protected] in Tempas). (Stable)

Utilities

A collection of tools for archiving and analysing the internet.

39
13
11m
GPL-3.0

Convert HTTP Archive (HAR) -> Web Archive (WARC) format

25
2
2y 4m
Apache-2.0

Client app for httpreserve pkg that generates CSV, JSON, HTTP, and BoltDB

2
0
1y 11m
GPL-3.0

A Tool to Summarize Web Archive Holdings

2
0
9m
MIT

A Memento Aggregator CLI and Server in Go

33
7
12m
MIT

Parse CDXJ(https://github.com/oduwsdl/ORS/wiki/CDXJ) files with node.js

0
1
3y 7m
MIT

Web archive index server based on RocksDB

12
12
6m
Apache-2.0

A client for the Archive-It And Webrecorder WASAPI Data Transfer API

10
3
1y 4m
BSD-3-Clause

Tika based link extractor for httpreserve

3
1
1y 37d
Unknown

Java application to download WARCs from WASAPI

4
5
9m
NOASSERTION

Partition (W)ARC Files by MIME Type and Year

0
1
4y 18d
MIT

Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.

32
7
3y 89d
MIT

Tools for downloading and preserving wikis. We archive wikis, from Wikipedia to tiniest wikis. As of 2020, WikiTeam has preserved more than 250,000 wikis.

361
94
6m
GPL-3.0

Service to return the status of a web page or save it to the Internet Archive. Returns JSON via browser or command line via CURL using GET (Golang Package). (Stable)

The Archive Browser is a program that lets you browse the contents of archives, as well as extract them. It will let you open files from inside archives, and lets you preview them using Quick Look. WARC is supported (macOS only, Proprietary app).

Program to extract the contents of many archive formats, inclusive of WARC, to a file system. Free variant of The Archive Browser (macOS only, Proprietary app).

WARC I/O Libraries

A Splitable Hadoop InputFormat for Concatenated GZIP Files and *.(w)arc.gz

8
2
3y 24d
Unknown

Java library for reading and writing WARC files with a typed API

23
4
8m
Apache-2.0

Parse And Create Web ARChive (WARC) files with node.js

44
13
7m
MIT

Tool and library for handling Web ARChive (WARC) files.

87
12
1y 60d
GPL-3.0

Streaming WARC/ARC library for fast web archive IO

169
34
6m
Apache-2.0

Command line tools and libraries for handling and manipulating WARC files (and HTTP contents)

85
27
6m
MIT

golang readers for ARC and WARC webarchive formats

11
0
2y 1d
Unknown

Libraries and tools for reading/writing/validating WARC/ARC/GZIP files (Java). (Stable)

Analysis

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

110
14
1y 80d
MIT

Various examples of notebooks for working with web archives with the Archives Unleashed Toolkit, and derivatives generated by the Archives Unleashed Toolkit.

5
0
8m
Apache-2.0

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

99
30
34d
Apache-2.0

An open-source toolkit for analyzing line-oriented JSON Twitter archives with Apache Spark.

5
1
8m
Apache-2.0

Archives Unleashed Cloud (AUK) is an web interface for analysing web archives. Currently, it can sync with Archive-It collections and extract hyperlink networks, full text, and other information from your collections. (Stable)

Quality Assurance

Powerful yet simple to use screenshot software

6.14K
428
6m
GPL-3.0

fake keyboard/mouse input, window management, and more

1.3K
197
6m
NOASSERTION

Browser extension: a link checker with more options.

Browser extension: link harvester on a page.

Browser extension: opens multiple URLs and also extracts URLs from text.

Browser extension: switches between browser tabs.

For running Xenu and Notepad++ on Ubuntu.

For running Xenu and Notepad++ on macOS.

Windows built-in for partial screen capture and annotation. On macOS you can use Command + Shift + 4 (keyboard shortcut for taking partial screen capture).

For running Xenu and Notepad++ on macOS.

Desktop link checker for Windows.

Other Awesome Lists

๐Ÿ—ƒ The open source self-hosted web archive. Takes browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

7.09K
427
6m
MIT

A list of things related to software, literature, and other content for ๐Ÿ•ฃ Memento

47
6
1y 6d
CC0-1.0

Blogs and Scholarship

Unofficial blog of the Web Archiving Roundtable of the Society of American Archivists maintained by the members of the Web Archiving Roundtable.

An open-source book that provides a conceptual overview to web archiving research, as well as several case studies.

Web Science and Digital Libraries Research Group blogs about various Web archining related topics, scholarly work, and academic trip reports.

David Rosenthal regularly reviews and summarizes work done in the Digital Preservation field.

Mailing Lists

Slack

Fill out this request form](https://docs.google.com/forms/d/e/1FAIpQLScXPIH0Ssw63yWqyMkUqHVYmz2-ItBMzHiJQ-sOlJwTA8u5AQ/viewform?usp=sf_link) for access to a researcher group of people working with web archives.

Invite yourself](https://archivers-slack.herokuapp.com/) to a multi-disciplinary effort for archiving projects run in affiliation with EDGI and Data Together.

Twitter