A collection of tools and/or advice for webscraping
Some tools that are likely to be useful:
grab-site, the archival web crawler used by archive-team
heritrix, the archival web crawler used by the Internet Archive
2 Likes
ArchiveBox: suite of Python tools including command-line and web interfaces, can be installed with Docker. Saves web pages and media in a variety of formats including WARC via wget.
Feed it a bunch of URLs, or install the Chrome/chromium extension (can be on a different machine) and archive as you browse.
1 Like
Also very easy to set up via https://pikapods.com
But: This tool only archives up to level 1 (all links once removed), not any further.
I was just gonna use wget. Is there an advantage to using one of these other tools instead?
1 Like
wget is a great place to start - but it doesn’t pull the CSS etc. pp. - archivebox and Heritrix pull the whole websites including everything to recreate them 100%
2 Likes
wget
I’ve been using the following commands to scrape the sitemap(-s) and to download the HTML:
wget -qO- "https://www.arts.gov/sitemap.xml?page=1" | grep -Po "<loc>\K.+?(?=</loc>)" > urls1.txt
wget --convert-links --no-parent --wait=2 -i urls6.txt -x 2>&1 | tee arts.gov-sitemap1.log
There are better ways for sure, but that’s what I have been using.
This is how I use wget:
wget -m -p -E -k -np https://domain.com
-m, --mirror
Enable recursion/time-stamps, set infinite recursion depth, and retain FTP directory listings
-p, --page-requisites
Get all images/resources needed to display as HTML
-E, --adjust-extension
Save HTML/CSS files with their extensions
-k, --convert-links
Convert links to point to local files
-np, --no-parent
Don’t ascend to the parent directory
1 Like
Yeah, I don’t recall where I found this. Someone else crafted the recipe and I’ve just stored it with an alias in my zsh file.
1 Like
The tools from Webrecorder are good for particularly tricky cases that need browser-assisted archiving: Webrecorder: Web Archiving for All e.g. Browsertrix
1 Like
Very good having you here!
We will restructure this forum a bit and add your links to a short tutorial.
If you know of any good ones re webscraping - feel free to add them here!
But: This tool only archives up to level 1 (all links once removed), not any further.
True (though it’s better than nothing); one solution if offered by the target site is, if there’s a sitemap (either XML or regular web page) or RSS feed, you can use that as the starting URL with depth=1.
1 Like