Let me repeat that: while I am still “on Tumblr” and so on for now, my archives will not remain available for very long. If you find something of mine useful, you will need to make a copy of it and host it yourself.
The errors you see when you just punch in my web address in your browser or follow a link from Google are not happening because my blogs “broke.” The errors are intentional; my blogs have simply become invisible to some while still being easily accessible to others. […] Think of my web presence like Harry Potter’s Diagon Alley; so hidden from Muggles that they don’t even know what they’re missing, but if you know which brick to tap, a whole world of exciting new things awaits you….
As a result, a number of you have already asked the logical question: “Is there some easy way to automatically download your archives, instead of manually copy-and-pasting almost a decade of your posts? That would take forever!”
The answer, of course, is yes. This post is a short tutorial that I hope gives you the knowledge you need to download an entire website for offline viewing. This will work for any simple website like most blogs and personal sites, including mine. Archival geeks, this one’s for you. ;)
A sculptor must understand stone: Know thy materials
A website is just a bunch of files. On a server, it usually looks exactly like your own computer’s desktop. A page is a file. A slash (
/) indicates a folder.
Let’s say you have a website called “my-blog.com.” When you go to this website in a Web browser, the address bar says:
http://my-blog.com/ What that address bar is saying, in oversimplified English, is something like, “Hey, Web browser, connect to the computer at
my-blog.com and open the first file in the first folder you find for me.” That file is usually the home page. On a blog, this is usually the list of recent posts.
Then, to continue the example, let’s say you click on a blog post’s title, which is a link to a page that only contains that one blog post. This is often called a “permalink.” When the page loads, the address bar changes to something like
http://my-blog.com/posts/123456. Again, in oversimplified English, what the address bar is saying is something like, “Hey, Web browser, make another connection to the computer at
my-blog.com and open up the file called
123456 inside that computer’s
And that’s how Web browsing works, in a nutshell. Since websites are just files inside folders, the same basic rules apply to webpages as the ones that apply to files and folders on your own laptop. To save a file, you give it a name, and put it a folder. When you move a file from one folder to another, it stops being available at the old location and becomes available at the new location. You can copy a file from one folder as a new file in another folder, and now you have two copies of that file.
In the case of the web, a “file” is just a “page,” so “copying webpages” is the exact same thing as “copying files.”
Now, as many of you already surmised, you could manually go to a website, open the File menu in your Web browser, choose the Save option, give the file a name, put it in a folder, then click the link to the first entry on the web page to load that post, open the File menu in your Web browser, choose the Save option, give the file another name, put it in a folder, and so on and so on until your eyes bled and you went insane from treating yourself in the same dehumanizing way your bosses already treat you at work. Or you could realize that doing the same basic operation many times in quick succession is what computers were invented to do, and you could automate the process of downloading websites like this by using a software program (a tool) designed to do exactly that.
It just so happens that this kind of task is so common that there are dozens of software programs that do exactly this thing.
A sculptor must understand a chisel: Know thy toolbox
I’m not going to go through the many dozens if not hundreds of tools available to automatically download things from the Web. There is almost certainly an “auto-downloader” plugin available for your favorite Web browser. Feel free to find one and give it a try. Instead, I’m going to walk you through how to use simply the best, most efficient, and most powerful of these tools. It’s called
wget. It stands for “Web get” and, as the name implies, it “gets stuff from the Web.”
If you’re on Windows, the easiest way to use
wget is by using a program called WinWGet, which is actually two programs: it’s the
wget program itself, and a point-and-click graphical user interface that gives you a way to use it with your mouse instead of only your keyboard. There’s a good article on Lifehacker about how to use WinWGet to copy an entire website (an act commonly called “mirroring”). If you’re intimidated by a command line, go get WinWGet, because the
wget program itself doesn’t have a point-and-click user interface so you’ll want the extra window dressing WinWGet provides.
If you’re not on Windows, or if you just want to learn how to use
wget to copy a website directly, then read on. You may also want to read on to learn more about the relevant options you can enable in
wget so it works even under the most hostile conditions (like a flaky Wi-Fi connection).
While there are dozens upon dozens of
wget options to the point that I know of no one who has read the entire
wget manual from front to back, there are only three options that really matter for our purposes. These are:
- This options turns on options suitable for mirroring. In other words, with this option enabled,
wgetwill look at the URL you gave it, and then copy the page at that URL and all pages that first page links to which also start with the same URL as the URL of the first page until there are no more links to follow. How handy! ;)
- The manual describes this option better than I could. It reads:
After the download is complete, convert the links in the document to make them suitable for local viewing. This affects not only the visible hyperlinks, but any part of the document that links to external content, such as embedded images, links to style sheets, hyperlinks to non-HTML content, etc.
So in other words, after the download finishes, all links that originally pointed to “the computer at
my-blog.com” will now point to the archived copy of the file
wgetdownloaded for you, so you can click links in your archived copy and they will work just as they did on the original site. Woot!
- This option isn’t strictly necessary, but if you’re on a flaky Wi-Fi network or the server hosting the website you’re trying to download is itself kind of flaky (that is, maybe it goes down every once in a while and you don’t always know when that will be), then adding this option makes
wgetkeep trying to download the pages you’ve told it are there even if it’s not able to make a connection to the website. Basically, this option makes
wgettotally trust you when you tell it to go download some stuff, even if it tries to do that and isn’t able to get it when it tries to. I strongly suggest using this option to get archives of my sites.
Okay, with that necessary background explained, let’s move on to actually using
wget to copy whole websites.
wget if you don’t already have it
If you don’t already have
wget, download and install it. For Mac OS X users, the simplest
wget installation option are the installer packages made available by the folks at Rudix. For Windows users, again, you probably want WinWGet. Linux users probably already have
wget installed. ;)
Step 1: Make a new folder to keep all the stuff you’re about to download
This is easy. Just make a new folder to keep all the pages you’re going to copy. Yup, that’s it. :)
Step 2: Run
wget with its mirroring options enabled
Now that we have a place to keep all the stuff we’re about to download, we need to let
wget do its work for us. So, first, go to the folder you made. If you’ve made a folder called “Mirror of my-blog.com” on your Desktop, then you can go into that folder by typing cd "~/Desktop/Mirror of my-blog.com" at a command prompt.
wget --mirror --convert-links --retry-connrefused http://my-blog.com/
Windows users will have to dig around the WinWGet options panes and make sure the “mirror” and “convert-links” checkboxes are enabled, rather than just typing those options out on the command line. Obviously, replace
http://my-blog.com/ with whatever website you want to copy. For instance, replace it with
http://days.maybemaimed.com/ to download everything I’ve ever posted to my Tumblr blog. You’ll immediately see a lot of output from your terminal that looks like this:
wget --mirror --convert-links --retry-connrefused http://days.maybemaimed.com/ --2015-02-27 15:08:06-- http://days.maybemaimed.com/ Resolving days.maybemaimed.com (days.maybemaimed.com)... 18.104.22.168, 22.214.171.124 Connecting to days.maybemaimed.com (days.maybemaimed.com)|126.96.36.199|:80... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] Saving to: ‘days.maybemaimed.com/index.html’ [ <=> ] 188,514 --.-K/s in 0.1s Last-modified header missing -- time-stamps turned off. 2015-02-27 15:08:08 (1.47 MB/s) - ‘days.maybemaimed.com/index.html’ saved 
Now just sit back, relax, let
wget work for as long as it needs to (which could take hours, depending on the quality of your Internet connection). Meanwhile, rejoice in the knowledge that you never need to treat yourself like a piece of dehumanized machinery ever again because, y’know, we actually have machines for that.
wget finishes its work, though, you’ll see files start appearing inside the folder you made. You can now drag-and-drop one of those files into your Web browser window to open that file. It will look exactly like the blog web page from which it was downloaded. Voila! Archive successfully made!
Special secret bonuses
The above easily works on any publicly accessible website. These are websites that you don’t need to log into to see. But you can also do the same thing on websites that do require you to log into them, though I’ll leave that as an exercise for the reader. All you have to do is learn a few different
wget options, which are all explained in the
wget manual. (Hint: The option you want to read up on is the
What I do want to explain, however, is that the above procedure won’t currently work on some of my other blogs because of additional techno-trickery I’m doing to keep the Muggles out, as I mentioned at the start of this post. However, I’ve already created an archive copy of my other (non-Tumblr) sites, so you don’t have to.1 Still, though, if you can figure out which bricks to tap, you can still create your own archive of my proverbial Diagon Alley.
Anyway, I’m making that other archive available on BitTorrent. Here’s the torrent metafile for an archive of maybemaimed.com. If you don’t already know how to use BitTorrent, this might be a good time to read through my BitTorrent howto guide.
Finally, if data archival and preservation is something that really spins your propeller and you don’t already know about it, consider browsing on over to The Internet Archive at Archive.org. If you live in San Francisco, they offer free lunches to the public every Friday (which are FUCKING CATERED AND DELICIOUS, I’VE BEEN), and they always have need of volunteers.
- If you’re just curious, the archive contains every conference presentation I’ve ever given, including video recordings, presentation slides, and so on, as well as audio files of some podcasts and interviews I’ve given, transcripts of every one of these, all pictures uploaded to my site, etc., and weighs in at approximately 1 gigabyte, uncompressed. [↩]