A collection of tools and/or advice for webscraping

Some tools that are likely to be useful:

grab-site, the archival web crawler used by archive-team

heritrix, the archival web crawler used by the Internet Archive

2 Likes

ArchiveBox: suite of Python tools including command-line and web interfaces, can be installed with Docker. Saves web pages and media in a variety of formats including WARC via wget.

Feed it a bunch of URLs, or install the Chrome/chromium extension (can be on a different machine) and archive as you browse.

1 Like

Also very easy to set up via https://pikapods.com

But: This tool only archives up to level 1 (all links once removed), not any further.

I was just gonna use wget. Is there an advantage to using one of these other tools instead?

1 Like

wget is a great place to start - but it doesn’t pull the CSS etc. pp. - archivebox and Heritrix pull the whole websites including everything to recreate them 100%

2 Likes

wget
I’ve been using the following commands to scrape the sitemap(-s) and to download the HTML:

  1. wget -qO- "https://www.arts.gov/sitemap.xml?page=1" | grep -Po "<loc>\K.+?(?=</loc>)" > urls1.txt
  2. wget --convert-links --no-parent --wait=2 -i urls6.txt -x 2>&1 | tee arts.gov-sitemap1.log

There are better ways for sure, but that’s what I have been using.

This is how I use wget:

wget -m -p -E -k -np https://domain.com

-m, --mirror
Enable recursion/time-stamps, set infinite recursion depth, and retain FTP directory listings

-p, --page-requisites
Get all images/resources needed to display as HTML

-E, --adjust-extension
Save HTML/CSS files with their extensions

-k, --convert-links
Convert links to point to local files

-np, --no-parent
Don’t ascend to the parent directory

1 Like

this works so well!

1 Like

Yeah, I don’t recall where I found this. Someone else crafted the recipe and I’ve just stored it with an alias in my zsh file.

1 Like

The tools from Webrecorder are good for particularly tricky cases that need browser-assisted archiving: Webrecorder: Web Archiving for All e.g. Browsertrix

1 Like

There’s also the iipc/awesome-web-archiving: An Awesome List for getting started with web archiving which I’ve helped maintain over the years.

1 Like

Very good having you here!
We will restructure this forum a bit and add your links to a short tutorial.
If you know of any good ones re webscraping - feel free to add them here!

But: This tool only archives up to level 1 (all links once removed), not any further.

True (though it’s better than nothing); one solution if offered by the target site is, if there’s a sitemap (either XML or regular web page) or RSS feed, you can use that as the starting URL with depth=1.

1 Like