I have an old private blog from more than a decade ago that I’m shutting down. It was basically a semi-private journal, related to the construction of our home. It has a lot of useful photographs for me on it — i.e., in-progress construction shots, design inspiration shots, sketches and renderings, plumbing and electrical rough-ins, etc. I wanted to archive them.

I tried a few free tools to download the images. I found various freeware programs as well as free Chrome extensions, but for some reason they all seem to stop at the “thumbnail” image, and did not crawl through to get the full, highest-resolution image.

Enter Scrapy

Nowadays, when there’s a quick “power tool” automated task you’d like to perform, there’s very likely a python library that’ll help with it. So I was happy to discover the excellent Scrapy library, which is a spider/crawling framework.

There’s also BeautifulSoup, and, in the .NET world, HTMLAgilityPack, which are very good at scraping pages… but Scrapy comes with a full spidering framework, letting you crawl the website to fetch what you want with minimal code.

Using Python 3.7+, first, create a virtual environment, since python library conflicts are a pain in the neck:

Activate the virtual environment; to do so, on Windows it’s:

Then, it’s as simple as:

Then, simply create a python file in that directory with the code. Here’s mine; I called this file “blogimages.py”. Note that I took some quick-and-dirty shortcuts here because I just wanted this process to work for this specific one-time task. Obviously it can be further generalized:

Make a subfolder called “images”, because, as you can see in the hacky code above, it’s going to try to save to that folder.

To run it, all you do is:

That’s it! The images should now be in your images folder. You can certainly enhance this spider easily to automatically find and crawl the pagination links; I chose not to do that, because, well, I never put pagination links in the old blog — only an “archive” section with a table of contents. I simply fed a list of “start_urls” in the upper portion to traverse.

Author

Steve's an entrepreneur and software leader. Most recently, he founded HipHip.app, a way to create celebration cards easily. He also founded bigthanks.org, helping people discover and share productive ways they can respond in times of crisis. Steve's worked on consumer apps, online travel, games, relational databases, management consulting and telecom. He launched Alignvote in 2019, which helped Seattle voters find their best-match political candidates. Steve founded BigOven, the first recipe app for iPhone, now with more than 15 million downloads, which was purchased in 2018. Steve served as Chairman of Escapia Inc., the leading SaaS solution for the US vacation rental industry, sold to Homeaway, now part of Expedia. In 1997, Steve was cofounder, President, CEO and Chairman of VacationSpot, a pioneer in the online reservation of vacation rentals, bought by Expedia in January 2000. At Expedia, Steve was Vice President of Vacation Packages, leading the vacation package and destination services teams, helping to create two patents on the first-ever dynamic vacation packaging system on the Internet, which now represents billions in annual transactions for Expedia. He has keynoted on several occasions at the Vacation Rental Managers Association (VRMA), and taught a graduate level course on the strategic management of innovation at the University of Washington Foster Business School in Seattle, Washington. Steve worked for Microsoft from 1991 to 1997 in a variety of senior marketing and executive positions, and led the creation of the internet games group, helping develop several products and patents related to online multiplayer gaming. He helped launch Microsoft Access and was involved in the acquisition of Fox Software by Microsoft in 1993. He's worked for IBM, Booz-Allen Hamilton and Bell Communications Research. He holds an MS in Computer Science from Stanford University in Symbolic and Heuristic Computation (AI), an MBA from Harvard Business School, where he was named a George F. Baker Scholar (awarded to top 5% of graduating class), and a dual BS in Applied Mathematics / Computer Science and Industrial Management from Carnegie Mellon University (CMU) with University Honors. Steve volunteers when time allows with Habitat for Humanity, University District Food Bank, YMCA Seattle, Technology Access Foundation (TAF) and other organizations in Seattle.

Write A Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.