Archiving websites with wget
During some improving of my site, I was re-exploring some websites I
had linked to. At one point, I went to artemis.sh and found this page. In it, artemis
describes how you can archive their website by executing a single
wget
instruction:
wget --mirror --convert-links --adjust-extension --page-requisites --no-parent 'https://artemis.sh'
I was in some kind of awe. All it took was that simple command ? I immediately decided to try it on this website, and it worked ! Here’s the command I used:
wget -e robots=off --mirror --convert-links --adjust-extension --page-requisites --no-parent 'https://cafeduvesper.net'
Notice the -e robots=off
near the beginning : that is
because I currently (at the time of writing) have a robots.txt file
which blocks everything by default, and wget will follow that
robots.txt.
My reasoning for this robots.txt is : I don’t want google, openai, bing, whatever, to index my content, or use it, or anything. I know nothing prevents anyone (including those guys) from just ignoring it, but eh. It gives some peace of mind, knowing I’m at least trying.
Also, I really like the thought of people finding the website by looking around, exploring, not typing up some keywords. That’s boring. However, I wouldn’t mind giving some cool search engines a pass, the likes of wiby for example, if wiby crawled anything.
Anyway, if you want to archive, zip and send to your friends (I don’t know ?), or just download my website, this is how you do it, and you have my permission to do it !