User Experience on mobile might not be great yet, but I'm working on it.

Your first time on this page? Allow me to give some explanations.

Awesome Web Archiving

An Awesome List for getting started with web archiving

Here you can see meta information about this topic like the time we last updated this page, the original creator of the awesome list and a link to the original GitHub repository.

Last Update: Dec. 4, 2021, 7:06 p.m.

Thank you iipc & contributors
View Topic on GitHub:
iipc/awesome-web-archiving

Search for resources by name or description.
Simply type in what you are looking for and the results will be filtered on the fly.

Further filter the resources on this page by type (repository/other resource), number of stars on GitHub and time of last commit in months.

Training/Documentation

Resources for Web Publishers

Tools & Software

๐Ÿ“š A compilation of research relevant to Data Together's efforts tackling the general problem of data resilience & interactivity

81
10
3y 69d
CC-BY-SA-4.0

A curated list of awesome tools for website diffing and change monitoring.

368
22
10m
CC0-1.0

Acquisition

22120 - NodeJS product to self-host the Internet with an Offline Archive. Like binaries? https://github.com/dosyago/22120/releases Similar to ArchiveBox, SingleFile, SingleFileZ, and WebMemex, but gooderer. Full text search coming soon

2.61K
97
31d
n/a

๐Ÿ—ƒ Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

12.15K
663
34d
MIT

A Tool To Push Web Resources Into Web Archives

292
36
9m
MIT

Run a high-fidelity browser-based crawler in a single Docker container

91
18
35d
AGPL-3.0

brozzler - distributed browser-based web crawler

489
86
53d
Apache-2.0

NPM package and CLI tool for saving webpages

3
1
32d
GPL-3.0

Offline-first web browser

70
4
2y 10m
MIT

Web archiving using Google Chrome

38
5
1y 11m
MIT

A commandline tool and Python library for archiving data from Facebook using the Graph API.

73
10
3y 10m
n/a

Snapshots a web page to get it as a static, self-contained HTML document.

188
15
53d
Unlicense

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

767
83
40d
n/a

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

2.08K
691
32d
n/a

simple script to convert web resources to a single warc file

8
2
5y 11m
MIT

โฌ›๏ธ CLI tool for saving complete web pages as a single HTML file

4.4K
142
44d
CC0-1.0

Go package and CLI tool for saving web page as single HTML file

93
6
1y 12d
MIT

Web Extension for Firefox/Chrome/MS Edge and CLI tool to save a faithful copy of an entire web page in a single HTML file

4.24K
403
33d
AGPL-3.0

Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head

141
21
1y 6m
Apache-2.0

A command line tool (and Python library) for archiving Twitter JSON

1.09K
227
37d
MIT

Web Archiving Integration Layer: One-Click User Instigated Preservation

258
28
86d
MIT

WARC writing MITM HTTP/S proxy

268
46
95d
n/a

A dockerized, queued high fidelity web archiver based on Squidwarc

42
7
1y 4m
GPL-3.0

A toolkit for snapshot webpage to Internet Archive, archive.today, IPFS and beyond

169
13
31d
GPL-3.0

An easy-to-use and highly customizable crawler that enables you to create your own little Web archives (WARC/CDX)

21
4
4y 57d
MIT

Wget with Lua extension

20
9
5y 11m
GPL-3.0

Wget-compatible web downloader and crawler.

438
64
1y 8m
GPL-3.0

Google Chrome extension for archiving an individual webpage or website to a WARC file. (Stable)

Replay

InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS

408
31
60d
MIT

The OpenWayback Development

406
250
108d
Apache-2.0

Core Python Web Archiving Toolkit for replay and recording of web archives

884
144
43d
GPL-3.0

Converts WARC files to static HTML

5
0
26d
Apache-2.0

Search & Discovery

Chrome extension that uses Memento to indicate that a page a user is viewing on the live web has an archived copy and to give the user access to the copy

30
3
86d
MIT

Playback with fun!

0
0
4m
GPL-3.0

WARC and ARC indexing and discovery tools.

93
22
40d
n/a

Prototype SOLR-powered web archive exploration UI.

37
9
1y 6m
Apache-2.0

A search interface and wayback machine for the UKWA Solr based warc-indexer framework.

51
10
30d
Apache-2.0

A Rails engine supporting the discovery of web archives.

42
9
31d
n/a
19
4
6m
MIT

Utilities

A collection of tools for archiving and analysing the internet.

46
15
1y 8m
GPL-3.0
4
1
51d
Apache-2.0

Convert HTTP Archive (HAR) -> Web Archive (WARC) format

34
3
3y 45d
Apache-2.0

Client app for httpreserve pkg that generates CSV, JSON, HTTP, and BoltDB

2
0
6m
GPL-3.0

Converts HTTrack crawls to WARC files

11
2
79d
Apache-2.0

A Tool to Summarize Web Archive Holdings

5
0
5m
MIT

A Memento Aggregator CLI and Server in Go

40
9
8m
MIT

Parse CDXJ(https://github.com/oduwsdl/ORS/wiki/CDXJ) files with node.js

0
1
4y 4m
MIT

Web archive index server based on RocksDB

20
17
43d
Apache-2.0

A client for the Archive-It And Webrecorder WASAPI Data Transfer API

12
4
2y 48d
BSD-3-Clause

Tika based link extractor for httpreserve

6
1
6m
n/a

Java application to download WARCs from WASAPI

4
5
4m
n/a

Partition (W)ARC Files by MIME Type and Year

0
1
4y 9m
MIT

Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.

34
8
4y 1d
MIT

Tools for downloading and preserving wikis. We archive wikis, from Wikipedia to tiniest wikis. As of 2020, WikiTeam has preserved more than 250,000 wikis.

448
108
66d
GPL-3.0

WARC I/O Libraries

A Splitable Hadoop InputFormat for Concatenated GZIP Files and *.(w)arc.gz

9
4
3y 10m
n/a

Java library for reading and writing WARC files with a typed API

31
6
72d
Apache-2.0

Parse And Create Web ARChive (WARC) files with node.js

65
18
94d
MIT

Tool and library for handling Web ARChive (WARC) files.

101
15
115d
GPL-3.0

Streaming WARC/ARC library for fast web archive IO

223
44
44d
Apache-2.0

Command line tools and libraries for handling and manipulating WARC files (and HTTP contents)

98
27
1y 99d
MIT

golang readers for ARC and WARC webarchive formats

15
0
2y 9m
n/a

Analysis

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

118
15
57d
MIT

Various examples of notebooks for working with web archives with the Archives Unleashed Toolkit, and derivatives generated by the Archives Unleashed Toolkit.

7
3
33d
Apache-2.0

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

109
32
33d
Apache-2.0

An open-source toolkit for analyzing line-oriented JSON Twitter archives with Apache Spark.

7
2
1y 20d
Apache-2.0

Quality Assurance

Curation

Zotero extension that submits to and reads from web archives. Source on GitHub. Supercedes leonkt/zotero-memento.

Other Awesome Lists

๐Ÿ—ƒ Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

12.15K
663
34d
MIT

A list of things related to software, literature, and other content for ๐Ÿ•ฃ Memento

59
6
1y 9m
CC0-1.0

Blogs and Scholarship

Mailing Lists

Slack

Fill out this request form for access to a researcher group of people working with web archives.

Invite yourself to a multi-disciplinary effort for archiving projects run in affiliation with EDGI and Data Together.

Twitter