HowTo: Make an archival copy of every page, image, video, and audio file on an entire website using wget

I recently announced that my blog archives will no longer be publicly available for long:

Let me repeat that: while I am still “on Tumblr” and so on for now, my archives will not remain available for very long. If you find something of mine useful, you will need to make a copy of it and host it yourself.

[…]

The errors you see when you just punch in my web address in your browser or follow a link from Google are not happening because my blogs “broke.” The errors are intentional; my blogs have simply become invisible to some while still being easily accessible to others. […] Think of my web presence like Harry Potter’s Diagon Alley; so hidden from Muggles that they don’t even know what they’re missing, but if you know which brick to tap, a whole world of exciting new things awaits you….

As a result, a number of you have already asked the logical question: “Is there some easy way to automatically download your archives, instead of manually copy-and-pasting almost a decade of your posts? That would take forever!”

The answer, of course, is yes. This post is a short tutorial that I hope gives you the knowledge you need to download an entire website for offline viewing. This will work for any simple website like most blogs and personal sites, including mine. Archival geeks, this one’s for you. ;)

Preparation

A sculptor must understand stone: Know thy materials

A website is just a bunch of files. On a server, it usually looks exactly like your own computer’s desktop. A page is a file. A slash (/) indicates a folder.

Let’s say you have a website called “my-blog.com.” When you go to this website in a Web browser, the address bar says: http://my-blog.com/ What that address bar is saying, in oversimplified English, is something like, “Hey, Web browser, connect to the computer at my-blog.com and open the first file in the first folder you find for me.” That file is usually the home page. On a blog, this is usually the list of recent posts.

Then, to continue the example, let’s say you click on a blog post’s title, which is a link to a page that only contains that one blog post. This is often called a “permalink.” When the page loads, the address bar changes to something like http://my-blog.com/posts/123456. Again, in oversimplified English, what the address bar is saying is something like, “Hey, Web browser, make another connection to the computer at my-blog.com and open up the file called 123456 inside that computer’s posts folder.”

And that’s how Web browsing works, in a nutshell. Since websites are just files inside folders, the same basic rules apply to webpages as the ones that apply to files and folders on your own laptop. To save a file, you give it a name, and put it a folder. When you move a file from one folder to another, it stops being available at the old location and becomes available at the new location. You can copy a file from one folder as a new file in another folder, and now you have two copies of that file.

In the case of the web, a “file” is just a “page,” so “copying webpages” is the exact same thing as “copying files.”

Now, as many of you already surmised, you could manually go to a website, open the File menu in your Web browser, choose the Save option, give the file a name, put it in a folder, then click the link to the first entry on the web page to load that post, open the File menu in your Web browser, choose the Save option, give the file another name, put it in a folder, and so on and so on until your eyes bled and you went insane from treating yourself in the same dehumanizing way your bosses already treat you at work. Or you could realize that doing the same basic operation many times in quick succession is what computers were invented to do, and you could automate the process of downloading websites like this by using a software program (a tool) designed to do exactly that.

It just so happens that this kind of task is so common that there are dozens of software programs that do exactly this thing.

A sculptor must understand a chisel: Know thy toolbox

I’m not going to go through the many dozens if not hundreds of tools available to automatically download things from the Web. There is almost certainly an “auto-downloader” plugin available for your favorite Web browser. Feel free to find one and give it a try. Instead, I’m going to walk you through how to use simply the best, most efficient, and most powerful of these tools. It’s called wget. It stands for “Web get” and, as the name implies, it “gets stuff from the Web.”

If you’re on Windows, the easiest way to use wget is by using a program called WinWGet, which is actually two programs: it’s the wget program itself, and a point-and-click graphical user interface that gives you a way to use it with your mouse instead of only your keyboard. There’s a good article on Lifehacker about how to use WinWGet to copy an entire website (an act commonly called “mirroring”). If you’re intimidated by a command line, go get WinWGet, because the wget program itself doesn’t have a point-and-click user interface so you’ll want the extra window dressing WinWGet provides.

If you’re not on Windows, or if you just want to learn how to use wget to copy a website directly, then read on. You may also want to read on to learn more about the relevant options you can enable in wget so it works even under the most hostile conditions (like a flaky Wi-Fi connection).

Relevant wget options

While there are dozens upon dozens of wget options to the point that I know of no one who has read the entire wget manual from front to back, there are only three options that really matter for our purposes. These are:

-m or --mirror
This options turns on options suitable for mirroring. In other words, with this option enabled, wget will look at the URL you gave it, and then copy the page at that URL and all pages that first page links to which also start with the same URL as the URL of the first page until there are no more links to follow. How handy! ;)
-k or --convert-links
The manual describes this option better than I could. It reads:

After the download is complete, convert the links in the document to make them suitable for local viewing. This affects not only the visible hyperlinks, but any part of the document that links to external content, such as embedded images, links to style sheets, hyperlinks to non-HTML content, etc.

So in other words, after the download finishes, all links that originally pointed to “the computer at my-blog.com” will now point to the archived copy of the file wget downloaded for you, so you can click links in your archived copy and they will work just as they did on the original site. Woot!

--retry-connrefused
This option isn’t strictly necessary, but if you’re on a flaky Wi-Fi network or the server hosting the website you’re trying to download is itself kind of flaky (that is, maybe it goes down every once in a while and you don’t always know when that will be), then adding this option makes wget keep trying to download the pages you’ve told it are there even if it’s not able to make a connection to the website. Basically, this option makes wget totally trust you when you tell it to go download some stuff, even if it tries to do that and isn’t able to get it when it tries to. I strongly suggest using this option to get archives of my sites.

Okay, with that necessary background explained, let’s move on to actually using wget to copy whole websites.

Preparation: Get wget if you don’t already have it

If you don’t already have wget, download and install it. For Mac OS X users, the simplest wget installation option are the installer packages made available by the folks at Rudix. For Windows users, again, you probably want WinWGet. Linux users probably already have wget installed. ;)

Step 1: Make a new folder to keep all the stuff you’re about to download

This is easy. Just make a new folder to keep all the pages you’re going to copy. Yup, that’s it. :)

Step 2: Run wget with its mirroring options enabled

Now that we have a place to keep all the stuff we’re about to download, we need to let wget do its work for us. So, first, go to the folder you made. If you’ve made a folder called “Mirror of my-blog.com” on your Desktop, then you can go into that folder by typing cd "~/Desktop/Mirror of my-blog.com" at a command prompt.

Next, run wget:

wget --mirror --convert-links --retry-connrefused http://my-blog.com/

Windows users will have to dig around the WinWGet options panes and make sure the “mirror” and “convert-links” checkboxes are enabled, rather than just typing those options out on the command line. Obviously, replace http://my-blog.com/ with whatever website you want to copy. For instance, replace it with http://days.maybemaimed.com/ to download everything I’ve ever posted to my Tumblr blog. You’ll immediately see a lot of output from your terminal that looks like this:

wget --mirror --convert-links --retry-connrefused http://days.maybemaimed.com/

--2015-02-27 15:08:06--  http://days.maybemaimed.com/
Resolving days.maybemaimed.com (days.maybemaimed.com)... 66.6.42.22, 66.6.43.22
Connecting to days.maybemaimed.com (days.maybemaimed.com)|66.6.42.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘days.maybemaimed.com/index.html’

    [ <=>                                                       ] 188,514     --.-K/s   in 0.1s    

Last-modified header missing -- time-stamps turned off.
2015-02-27 15:08:08 (1.47 MB/s) - ‘days.maybemaimed.com/index.html’ saved [188514]

Now just sit back, relax, let wget work for as long as it needs to (which could take hours, depending on the quality of your Internet connection). Meanwhile, rejoice in the knowledge that you never need to treat yourself like a piece of dehumanized machinery ever again because, y’know, we actually have machines for that.

Even before wget finishes its work, though, you’ll see files start appearing inside the folder you made. You can now drag-and-drop one of those files into your Web browser window to open that file. It will look exactly like the blog web page from which it was downloaded. Voila! Archive successfully made!

Special secret bonuses

The above easily works on any publicly accessible website. These are websites that you don’t need to log into to see. But you can also do the same thing on websites that do require you to log into them, though I’ll leave that as an exercise for the reader. All you have to do is learn a few different wget options, which are all explained in the wget manual. (Hint: The option you want to read up on is the --load-cookies option.)

What I do want to explain, however, is that the above procedure won’t currently work on some of my other blogs because of additional techno-trickery I’m doing to keep the Muggles out, as I mentioned at the start of this post. However, I’ve already created an archive copy of my other (non-Tumblr) sites, so you don’t have to.1 Still, though, if you can figure out which bricks to tap, you can still create your own archive of my proverbial Diagon Alley.

Anyway, I’m making that other archive available on BitTorrent. Here’s the torrent metafile for an archive of maybemaimed.com. If you don’t already know how to use BitTorrent, this might be a good time to read through my BitTorrent howto guide.

Finally, if data archival and preservation is something that really spins your propeller and you don’t already know about it, consider browsing on over to The Internet Archive at Archive.org. If you live in San Francisco, they offer free lunches to the public every Friday (which are FUCKING CATERED AND DELICIOUS, I’VE BEEN), and they always have need of volunteers.

  1. If you’re just curious, the archive contains every conference presentation I’ve ever given, including video recordings, presentation slides, and so on, as well as audio files of some podcasts and interviews I’ve given, transcripts of every one of these, all pictures uploaded to my site, etc., and weighs in at approximately 1 gigabyte, uncompressed. []

10 comments

  1. 38
    Meitar M says:

    :) I have a laptop and move around a lot, as I think you know, so I’m not always connected to the ‘net and can’t always seed. But I’ll see about making a webseed for this purpose. You’ll need to re-download the .torrent file when that’s ready, but that’s all.

  2. 38
    Josh Samuels says:

    Ooo…didn’t know about those. Neat! (And it looks like maybe you did that and the torrent updated itself automatically, thus negating the need for a re-download; my client says the old torrent is currently connected to and downloading from a webseed…) Also, do you plan to blog more there beyond what’s archived and update the torrent with new stuff once it’s gone dark?

  3. 38
    Josh Samuels says:

    Oop, disregard that last bit. As I understand it, this archive is only for stuff that was on the now permanently dark maybemaimed blog, and doesn’t include the currently semi-active maybe days tumblr (which is what I was confusing it with) or the seemingly-dark everything in between blog from maymay.net; is that right?

  4. 38
    Meitar M says:

    Well, actually I am still blogging on all the "dark" sites. How do you think I made this post? (Answer: by crossposting from my blog.) But yes, the archive available via BitTorrent is just a static snapshot of what was on maybemaimed.com as of February 27th, 2015. New stuff won’t be in the archive. I’m only making new stuff available to people here, in the (non-corporate-owned social media) Federation.

  5. 38
    Meitar M says:

    Also I’m pretty sure the torrent didn’t update itself automatically; you just happened to get the torrent after I already added the webseed as an entry in the metafile but before I actually created the webseed. I’m forward-thinking like that. :P

  6. 38
    Josh Samuels says:

    Yeah, I meant "dark" more in the sense of not available on the public web in their "native" forms (as far as I can tell) but as you say, the new stuff ::is:: on the public web, just consolidated here at D* – which I actually almost added as a clarification in my first comment, but then edited out. Anywho, thanks for being forward thinking and for prompting me to start poking around again over here. ;)

  7. 38
    Meitar M says:

    I see that "Everything In Between" at least ::is:: still accessible natively, albeit only if you’re at specific post or category URLs.

    This is also true of maybemaimed.com, FYI.

Comments are closed.