Your first time on this page? Allow me to give some explanations.
Awesome Web Archiving
An Awesome List for getting started with web archiving
Here you can see meta information about this topic like the time we last updated this page, the original creator of the awesome list and a link to the original GitHub repository.
Thank you iipc & contributors
View Topic on GitHub:
iipc/awesome-web-archiving
Search for resources by name or description.
Simply type in what you are looking for and the results will be filtered on the fly.
Further filter the resources on this page by type (repository/other resource), number of stars on GitHub and time of last commit in months.
Training/Documentation
The Web Archiving Lifecycle Model is an attempt to incorporate the technological and programmatic arms of the web archiving into a framework that will be relevant to any organization seeking to archive content from the web. Archive-It, the web archiving service from the Internet Archive, developed the model based on its work with memory institutions around the world.
community HTML version of the official specification and hub for new proposals.
Resources for Web Publishers
tool, for estimating how likely a web page will be archived successfully.
Tools & Software
📚 A compilation of research relevant to Data Together's efforts tackling the general problem of data resilience & interactivity
A curated list of awesome tools for website diffing and change monitoring.
Acquisition
22120 - Self-host the Internet with an Offline Archive. Like binaries? https://github.com/dosyago/22120/releases Similar to ArchiveBox, SingleFile and WebMemex, but gooderer.
🗃 The open source self-hosted web archive. Takes browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
A Tool To Push Web Resources Into Web Archives
brozzler - distributed browser-based web crawler
NPM package and CLI tool for saving webpages
Offline-first web browser
Web archiving using Google Chrome
A commandline tool and Python library for archiving data from Facebook using the Graph API.
Snapshots a web page to get it as a static, self-contained HTML document.
The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
simple script to convert web resources to a single warc file
⬛️ CLI tool for saving complete web pages as a single HTML file
Go package and CLI tool for saving web page as single HTML file
Web Extension for Firefox/Chrome/Edge and CLI tool to save a faithful copy of an entire web page in a single HTML file
Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head
A command line tool (and Python library) for archiving Twitter JSON
A dockerized, queued high fidelity web archiver based on Squidwarc
Web Archiving Integration Layer: One-Click User Instigated Preservation
An easy-to-use and highly customizable crawler that enables you to create your own little Web archives (WARC/CDX)
Wget with Lua extension
Wget-compatible web downloader and crawler.
A transactional archive that selectively captures and stores transactions that take place between a web client (browser) and a web server. (Stable)
Open source software that harvests social media data and web resources from Twitter, Tumblr, Flickr, and Sina Weibo.
A collection of resources for building low-latency, scalable web crawlers on Apache Storm. (Stable)
A Google Chrome extension for archiving an individual webpage or website to a WARC file. (Stable)
Web archiving service anyone can use for free to save web pages.
An open source file retrieval utility that of version 1.14 supports writing warcs. (Stable)
Replay
InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS
The OpenWayback Development
Core Python Web Archiving Toolkit for replay and recording of web archives
Reconstructive is a ServiceWorker module for client-side reconstruction of composite mementos by rerouting resource requests to corresponding archived copies (JavaScript).
A browser-based, fully client-side replay engine for both local and remote WARC files.
Search & Discovery
Chrome extension that uses Memento to indicate that a page a user is viewing on the live web has an archived copy and to give the user access to the copy
WARC and ARC indexing and discovery tools.
Prototype SOLR-powered web archive exploration UI.
A search interface and wayback machine for the UKWA Solr based warc-indexer framework.
A Rails engine supporting the discovery of web archives.
Temporal web archive search based on links and anchor texts extracted from the German web from 1996 to 2013 (results are not limited to German pages, e.g., [email protected] in Tempas). (Stable)
Utilities
A collection of tools for archiving and analysing the internet.
Convert HTTP Archive (HAR) -> Web Archive (WARC) format
Client app for httpreserve pkg that generates CSV, JSON, HTTP, and BoltDB
A Tool to Summarize Web Archive Holdings
A Memento Aggregator CLI and Server in Go
Parse CDXJ(https://github.com/oduwsdl/ORS/wiki/CDXJ) files with node.js
Web archive index server based on RocksDB
A client for the Archive-It And Webrecorder WASAPI Data Transfer API
Tika based link extractor for httpreserve
Java application to download WARCs from WASAPI
Partition (W)ARC Files by MIME Type and Year
Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.
Tools for downloading and preserving wikis. We archive wikis, from Wikipedia to tiniest wikis. As of 2020, WikiTeam has preserved more than 250,000 wikis.
Service to return the status of a web page or save it to the Internet Archive. Returns JSON via browser or command line via CURL using GET (Golang Package). (Stable)
The Archive Browser is a program that lets you browse the contents of archives, as well as extract them. It will let you open files from inside archives, and lets you preview them using Quick Look. WARC is supported (macOS only, Proprietary app).
Program to extract the contents of many archive formats, inclusive of WARC, to a file system. Free variant of The Archive Browser (macOS only, Proprietary app).
WARC I/O Libraries
A Splitable Hadoop InputFormat for Concatenated GZIP Files and *.(w)arc.gz
Java library for reading and writing WARC files with a typed API
Parse And Create Web ARChive (WARC) files with node.js
Tool and library for handling Web ARChive (WARC) files.
Streaming WARC/ARC library for fast web archive IO
Command line tools and libraries for handling and manipulating WARC files (and HTTP contents)
golang readers for ARC and WARC webarchive formats
Libraries and tools for reading/writing/validating WARC/ARC/GZIP files (Java). (Stable)
Analysis
An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.
Various examples of notebooks for working with web archives with the Archives Unleashed Toolkit, and derivatives generated by the Archives Unleashed Toolkit.
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
An open-source toolkit for analyzing line-oriented JSON Twitter archives with Apache Spark.
Archives Unleashed Cloud (AUK) is an web interface for analysing web archives. Currently, it can sync with Archive-It collections and extract hyperlink networks, full text, and other information from your collections. (Stable)
Quality Assurance
Powerful yet simple to use screenshot software
fake keyboard/mouse input, window management, and more
Browser extension: a link checker with more options.
Browser extension: basic link checker.
Browser extension: link harvester on a page.
Browser extension: opens multiple URLs and also extracts URLs from text.
Browser extension: switches between browser tabs.
Windows built-in for partial screen capture and annotation. On macOS you can use Command + Shift + 4 (keyboard shortcut for taking partial screen capture).
Other Awesome Lists
🗃 The open source self-hosted web archive. Takes browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
A list of things related to software, literature, and other content for 🕣 Memento
Blogs and Scholarship
Unofficial blog of the Web Archiving Roundtable of the Society of American Archivists maintained by the members of the Web Archiving Roundtable.
An open-source book that provides a conceptual overview to web archiving research, as well as several case studies.
Web Science and Digital Libraries Research Group blogs about various Web archining related topics, scholarly work, and academic trip reports.
David Rosenthal regularly reviews and summarizes work done in the Digital Preservation field.
Mailing Lists
Slack
Fill out this request form](https://docs.google.com/forms/d/e/1FAIpQLScXPIH0Ssw63yWqyMkUqHVYmz2-ItBMzHiJQ-sOlJwTA8u5AQ/viewform?usp=sf_link) for access to a researcher group of people working with web archives.
Invite yourself](https://archivers-slack.herokuapp.com/) to a multi-disciplinary effort for archiving projects run in affiliation with EDGI and Data Together.