I have an old private blog from more than a decade ago that I’m shutting down. It was basically a semi-private journal, related to the construction of our home. It has a lot of useful photographs for me on it — i.e., in-progress construction shots, design inspiration shots, sketches and renderings, plumbing and electrical rough-ins, etc. I wanted to archive them.
I tried a few free tools to download the images. I found various freeware programs as well as free Chrome extensions, but for some reason they all seem to stop at the “thumbnail” image, and did not crawl through to get the full, highest-resolution image.
Nowadays, when there’s a quick “power tool” automated task you’d like to perform, there’s very likely a python library that’ll help with it. So I was happy to discover the excellent Scrapy library, which is a spider/crawling framework.
There’s also BeautifulSoup, and, in the .NET world, HTMLAgilityPack, which are very good at scraping pages… but Scrapy comes with a full spidering framework, letting you crawl the website to fetch what you want with minimal code.
Using Python 3.7+, first, create a virtual environment, since python library conflicts are a pain in the neck:
Activate the virtual environment; to do so, on Windows it’s:
Then, it’s as simple as:
pip install scrapy
Then, simply create a python file in that directory with the code. Here’s mine; I called this file “blogimages.py”. Note that I took some quick-and-dirty shortcuts here because I just wanted this process to work for this specific one-time task. Obviously it can be further generalized:
name = 'images'
start_urls = [
def parse(self, response):
# find matching image links -- grab all images
for imgurl in response.xpath("//a[contains(@href,'jpg')]"):
the_url = imgurl.css("a::attr(href)").extract()
filename = the_url[the_url.rfind("/")+1:]
filename = filename.replace("%","") #get rid of bad characters for filenames
filename = filename.replace("+","")
#next_page = response.css('ul.archive-list li a::attr("href")').get()
#print("NEXT PAGE ==== ")
#if next_page is not None:
# yield response.follow(next_page, self.parse)
Make a subfolder called “images”, because, as you can see in the hacky code above, it’s going to try to save to that folder.
To run it, all you do is:
scrapy runspider blogimages.py -o images.json
That’s it! The images should now be in your images folder. You can certainly enhance this spider easily to automatically find and crawl the pagination links; I chose not to do that, because, well, I never put pagination links in the old blog — only an “archive” section with a table of contents. I simply fed a list of “start_urls” in the upper portion to traverse.