Café du Vesper

Archiving websites with wget

Published at

Tags:

None


During some improving of my site, I was re-exploring some websites I had linked to. At one point, I went to artemis.sh and found this page. In it, artemis describes how you can archive their website by executing a single wget instruction:

wget --mirror --convert-links --adjust-extension --page-requisites --no-parent 'https://artemis.sh'

I was in some kind of awe. All it took was that simple command ? I immediately decided to try it on this website, and it worked ! Here's the command I used:

wget -e robots=off --mirror --convert-links --adjust-extension --page-requisites --no-parent 'https://cafeduvesper.net'

Notice the -e robots=off near the beginning : that is because I currently (at the time of writing) have a robots.txt file which blocks everything by default, and wget will follow that robots.txt.

My reasoning for this robots.txt is : I don't want google, openai, bing, whatever, to index my content, or use it, or anything. I know nothing prevents anyone (including those guys) from just ignoring it, but eh. It gives some peace of mind, knowing I'm at least trying.

Also, I really like the thought of people finding the website by looking around, exploring, not typing up some keywords. That's boring. However, I wouldn't mind giving some cool search engines a pass, the likes of wiby for example, if wiby crawled anything.

Anyway, if you want to archive, zip and send to your friends (I don't know ?), or just download my website, this is how you do it, and you have my permission to do it !