Basic web scraping – Part 2: thoughts on debugging and Python3

A while ago, I posted a basic tutorial on how to do some simple web scraping in Python using requests and beautifulsoup.

It obviously got found in Google because the hits were good. There were however, a couple of comments highlighting issues with it – Brett and Sean saying issues with “TypeError: ‘NoneType’ object is not callable”

Weird. It seemed to be that they were running it in Python3 so perhaps that was the issue. So I had a quick look. I updated the code to be Python3 compatible – and in fact the only required change was to use brackets in the print statements. Everything else runs just fine. Except for the issue highlighted above – which I was able to reproduce.

But it didn’t look like a code version issue – this was something else. Time to debug, then. I could see where it was failing (based on the stack trace, but also the console prints) – it was getting to download_file method and then hitting an issue. The line of code in particular was the .Split method (using a backslash.) That issue looked like there was no backslash to split on.

Obvious thing to do then? What does my url variable look like. Print that out to the screen and it looked like this:

<a class="btn" href="https://www.raspberrypi.org/magpi-issues/MagPi81.pdf" target="_blank">Download Free</a>

That’s obviously not right. Look further up the code, and I can see the issue. Can you?

for link in soup.find_all('a'):
        url = link.get('href')
        if url is not None:
            if url.endswith('.pdf'):
                download_file(link)

It’s in there. I’ll give you a moment.

Found it?

Yep, obviously it was

download_file(link)

We should have been passing the url variable in to the function. As soon as that’s changed, it all works OK again.

Anyway, the point of this post was to highlight that programmers are typically faced with these sorts of minor issues – and someone who can debug and figure it out differentiates from someone who just asks the question and waits for the answer.

Here’s the full code for Python3:

import requests
from bs4 import BeautifulSoup

def download_file(url):
	# Creates a filename to write to; assumes we'll put the downloaded files in a folder called Output - make sure you create this folder first...
	# The filename is the last bit of the URL
	print("Download found! Let's smash " + url)	
	filename = 'Output/' + url.split('/')[-1]

	print("Downloading " + url + " to " + filename)

	# Uses requests again to actually grab the file and save it
	r = requests.get(url, stream=True)
	with open(filename, 'wb') as f:
		for chunk in r.iter_content(chunk_size=1024): 
			if chunk: # filter out keep-alive new chunks
				f.write(chunk)
	print("Done!")

def search_pi():
	print("Opening site...")

	f = requests.get('https://www.raspberrypi.org/magpi/issues/')
	soup = BeautifulSoup(f.text,'lxml')

	print("Hunting...")

	for link in soup.find_all('a'):
		url = link.get('href')
		if url is not None:
			if url.endswith('.pdf'):
				download_file(url)

	print("Hunt complete!")

print("Hello, World!")
print("Let's do some mad scraping")
search_pi()
print("Finished")

2 Comments

  1. Perhaps you could simplify this tutorial to just scraping single variables or words from a website that change over time. This would be truly useful.

    Last version had an indent problem as well.

  2. Still not working.

    Here is the result I get running Python 3

    Traceback (most recent call last):
    File “scraper.py”, line 38, in
    search_pi()
    File “scraper.py”, line 32, in search_pi
    download_file(url)
    File “scraper.py”, line 14, in download_file
    with open(filename, ‘wb’) as f:
    IOError: [Errno 2] No such file or directory: ‘Output/MagPi84.pdf’

Leave a Reply

Your email address will not be published.


*