I have an old private blog from more than a decade ago that I’m shutting down. It was basically a semi-private journal, related to the construction of our home. It has a lot of useful photographs for me on it — i.e., in-progress construction shots, design inspiration shots, sketches and renderings, plumbing and electrical rough-ins, etc. I wanted to archive them.

I tried a few free tools to download the images. I found various freeware programs as well as free Chrome extensions, but for some reason they all seem to stop at the “thumbnail” image, and did not crawl through to get the full, highest-resolution image.

Enter Scrapy

Nowadays, when there’s a quick “power tool” automated task you’d like to perform, there’s very likely a python library that’ll help with it. So I was happy to discover the excellent Scrapy library, which is a spider/crawling framework.

There’s also BeautifulSoup, and, in the .NET world, HTMLAgilityPack, which are very good at scraping pages… but Scrapy comes with a full spidering framework, letting you crawl the website to fetch what you want with minimal code.

Using Python 3.7+, first, create a virtual environment, since python library conflicts are a pain in the neck:

virtualenv venv

Activate the virtual environment; to do so, on Windows it’s:

.\venv\scripts\activate.bat 

Then, it’s as simple as:

pip install scrapy

Then, simply create a python file in that directory with the code. Here’s mine; I called this file “blogimages.py”. Note that I took some quick-and-dirty shortcuts here because I just wanted this process to work for this specific one-time task. Obviously it can be further generalized:

import scrapy
import urllib.request 

class ImageSpider(scrapy.Spider):
    name = 'images'
    start_urls = [
        'http://sample-blog.blogspot.com/',
        'http://sample-blog.blogspot.com/2005/01/',
        'http://sample-blog.blogspot.com/2005/02/'
    ]

    def parse(self, response):
        # find matching image links -- grab all images
      

        for imgurl in response.xpath("//a[contains(@href,'jpg')]"):
            the_url = imgurl.css("a::attr(href)").extract()[0]
            print(the_url)
            filename = the_url[the_url.rfind("/")+1:]
            print(filename)
            filename = filename.replace("%","") #get rid of bad characters for filenames
            filename = filename.replace("+","")
            print("======")
            urllib.request.urlretrieve(the_url, "images\\"+filename)

            yield {
                'url': imgurl.css("a::attr(href)").extract()[0]
            }


        #next_page = response.css('ul.archive-list li a::attr("href")').get()
        #print("NEXT PAGE ==== ")
        #if next_page is not None:
        #    yield response.follow(next_page, self.parse)

Make a subfolder called “images”, because, as you can see in the hacky code above, it’s going to try to save to that folder.

To run it, all you do is:

scrapy runspider blogimages.py -o images.json

That’s it! The images should now be in your images folder. You can certainly enhance this spider easily to automatically find and crawl the pagination links; I chose not to do that, because, well, I never put pagination links in the old blog — only an “archive” section with a table of contents. I simply fed a list of “start_urls” in the upper portion to traverse.

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.