Basic web scraping - Part 2: thoughts on debugging and Python3
Update: 2020-04-13. The main article has been overhauled and republished here! Includes Python3 updates, Dropbox uploading and a code repo.
I’m leaving the below in place for info and posterity and for misc useful tidbits.
A while ago, I posted a basic tutorial on how to do some simple web scraping in Python using requests and beautifulsoup.
It obviously got found in Google because the hits were good. There were however, a couple of comments highlighting issues with it - Brett and Sean saying issues with “TypeError: ‘NoneType’ object is not callable”
Weird. It seemed to be that they were running it in Python3 so perhaps that was the issue. So I had a quick look. I updated the code to be Python3 compatible - and in fact the only required change was to use brackets in the print statements. Everything else runs just fine. Except for the issue highlighted above - which I was able to reproduce.
But it didn’t look like a code version issue - this was something else. Time to debug, then. I could see where it was failing (based on the stack trace, but also the console prints) - it was getting to download_file method and then hitting an issue. The line of code in particular was the .Split method (using a backslash.) That issue looked like there was no backslash to split on.
Obvious thing to do then? What does my url variable look like? Look further up the code, and I can see the issue. Can you?
for link in soup.find_all('a'): url = link.get('href') if url is not None: if url.endswith('.pdf'): download_file(link)
It’s in there. I’ll give you a moment.
Yep, obviously it was
We should have been passing the url variable in to the function. As soon as that’s changed, it all works OK again.
Anyway, the point of this post was to highlight that programmers are typically faced with these sorts of minor issues - and someone who can debug and figure it out differentiates from someone who just asks the question and waits for the answer.
Here’s the full code for Python3:
import requests from bs4 import BeautifulSoup def download_file(url): # Creates a filename to write to; assumes we'll put the downloaded files in a folder called Output - make sure you create this folder first... # The filename is the last bit of the URL print("Download found! Let's smash " + url) filename = 'Output/' + url.split('/')[-1] print("Downloading " + url + " to " + filename) # Uses requests again to actually grab the file and save it r = requests.get(url, stream=True) with open(filename, 'wb') as f: for chunk in r.iter_content(chunk_size=1024): if chunk: # filter out keep-alive new chunks f.write(chunk) print("Done!") def search_pi(): print("Opening site...") f = requests.get('https://www.raspberrypi.org/magpi/issues/') soup = BeautifulSoup(f.text,'lxml') print("Hunting...") for link in soup.find_all('a'): url = link.get('href') if url is not None: if url.endswith('.pdf'): download_file(url) print("Hunt complete!") print("Hello, World!") print("Let's do some mad scraping") search_pi() print("Finished")