Programmatically Downloading Images from a Website

April 10, 2020

I have an old private blog from more than a decade ago that I’m shutting down. It was basically a semi-private journal, related to the construction of our home. It has a lot of useful photographs for me on it — i.e., in-progress construction shots, design inspiration shots, sketches and renderings, plumbing and electrical rough-ins, etc. I wanted to archive them.

I tried a few free tools to download the images. I found various freeware programs as well as free Chrome extensions, but for some reason they all seem to stop at the “thumbnail” image, and did not crawl through to get the full, highest-resolution image.

Enter Scrapy

Nowadays, when there’s a quick “power tool” automated task you’d like to perform, there’s very likely a python library that’ll help with it. So I was happy to discover the excellent Scrapy library, which is a spider/crawling framework.

There’s also BeautifulSoup, and, in the .NET world, HTMLAgilityPack, which are very good at scraping pages… but Scrapy comes with a full spidering framework, letting you crawl the website to fetch what you want with minimal code.

Using Python 3.7+, first, create a virtual environment, since python library conflicts are a pain in the neck:

virtualenv venv

Activate the virtual environment; to do so, on Windows it’s:

.\venv\scripts\activate.bat

Then, it’s as simple as:

pip install scrapy

Then, simply create a python file in that directory with the code. Here’s mine; I called this file “blogimages.py”. Note that I took some quick-and-dirty shortcuts here because I just wanted this process to work for this specific one-time task. Obviously it can be further generalized:

import scrapy
import urllib.request 

class ImageSpider(scrapy.Spider):
    name = 'images'
    start_urls = [
        'http://sample-blog.blogspot.com/',
        'http://sample-blog.blogspot.com/2005/01/',
        'http://sample-blog.blogspot.com/2005/02/'
    ]

    def parse(self, response):
        # find matching image links -- grab all images
      

        for imgurl in response.xpath("//a[contains(@href,'jpg')]"):
            the_url = imgurl.css("a::attr(href)").extract()[0]
            print(the_url)
            filename = the_url[the_url.rfind("/")+1:]
            print(filename)
            filename = filename.replace("%","") #get rid of bad characters for filenames
            filename = filename.replace("+","")
            print("======")
            urllib.request.urlretrieve(the_url, "images\\"+filename)

            yield {
                'url': imgurl.css("a::attr(href)").extract()[0]
            }


        #next_page = response.css('ul.archive-list li a::attr("href")').get()
        #print("NEXT PAGE ==== ")
        #if next_page is not None:
        #    yield response.follow(next_page, self.parse)

Make a subfolder called “images”, because, as you can see in the hacky code above, it’s going to try to save to that folder.

To run it, all you do is:

scrapy runspider blogimages.py -o images.json

That’s it! The images should now be in your images folder. You can certainly enhance this spider easily to automatically find and crawl the pagination links; I chose not to do that, because, well, I never put pagination links in the old blog — only an “archive” section with a table of contents. I simply fed a list of “start_urls” in the upper portion to traverse.

Steve Murch

Steve’s a Seattle-based entrepreneur and software leader and father of three. He’s American-Canadian, and east-coast born and raised. Steve has made the Pacific Northwest his home since 1991, when he moved here to work for Microsoft. He’s started and sold multiple Internet companies. Politically independent, he writes on occasion about city politics and national issues, and created voter-candidate matchmaker Align Vote in the 2019 and 2021 election cycles. He holds a BS in Applied Math (Computer Science) and Business from Carnegie Mellon University, a Masters in Computer Science from Stanford University in Symbolic and Heuristic Computation, and an MBA from the Harvard Business School, where he graduated a George F. Baker Scholar. Steve volunteers when time allows with Habitat for Humanity, University District Food Bank, Technology Access Foundation (TAF) and other organizations in Seattle. read more

← Previous Post

In Act II, Will We Wait Until Our Row Number Is Called?

Celebration Videos Made Easy: Introducing HipHip.app

This site uses Akismet to reduce spam. Learn how your comment data is processed.

steve murch

steve murch

Programmatically Downloading Images from a Website

Enter Scrapy

Steve Murch

← Previous Post

In Act II, Will We Wait Until Our Row Number Is Called?

Next Post →

Celebration Videos Made Easy: Introducing HipHip.app

Leave a Comment

Enter Scrapy

Share this

Steve Murch

← Previous Post

In Act II, Will We Wait Until Our Row Number Is Called?

Next Post →

Celebration Videos Made Easy: Introducing HipHip.app

Leave a Comment