Update: 2019-05-13 – this has been edited to fix a little bug in the code transcript. There is also a follow-up article on working with Python3 here.
I’m a print subscriber of both The MagPi and HackSpace magazine. In a recent edition of HackSpace (p129 of issue #12) it referenced a project in The MagPi about a Raspberry Pi powered CNC machine. This was something I was really interested in, but my print editions don’t go back that far. So I went to the RPi website, found and downloaded in PDF format issue #38. But there was nothing in it about a CNC machine! The site search on the MagPi website didn’t help either but after some googling I found it – it seems the reference in HackSpace was wrong – it meant page 38 of issue 39. But that’s somewhat by the by.
I’ve been using a Raspberry Pi (model B!) for ages to do some lightweight web scraping. It’s written in Python and uses cron for scheduling. It monitors some local classifieds sites for some keywords that I’m interested in; so that rather than having to laboriously search these sites, it does it automatically twice a day and then notifies me if it finds something matching a search term. This has been really helpful and I’ve picked up some niche woodworking tools that don’t crop up on classifieds all that often (and if they do then they’re usually sold by the time I find them.) With my ‘pyscraper’ I’m now usually first to get in touch and have snared some great stuff. But that’s also somewhat by the by.
I modified my script to target the ‘issues’ page of the MagPi website to find all the PDF download links and then download the PDF file itself. So I now have a complete electronic back catalogue of The MagPi and HackSpace, which, since the PDF is created so beautifully, is fully searchable.
I figured this might be of interest as an introduction to Pi/Python/scraping. The subject of scraping is pretty huge and there are some big frameworks available if you want to do anything particularly involved.
This article is not about that and will barely even scratch the surface. And as with most things, there are 20 different ways to do it – this is just one of them.
We’ll be using:
- A Raspberry Pi (although this is really only for a bit of ironic symbolism – any environment running Python should be just fine. As it happens, I’m doing this entirely on Mac OSX – the only reason to do this on a Pi is that if you build something that you then later want to leave running on a scheduler, then a Pi is absolutely perfect for that, rather than needing say a big laptop / computer running all the time.)
- Python (2 or 3… your shout)
- Requests library (pretty solid web client library)
- Beautiful Soup (a nice DOM parser that makes extracting stuff from HTML / XML a lot easier)
Getting started and some assumptions
I’ll assume that you have a bare installation of Raspbian, whether that’s with the desktop or the terminal interface only.
I’ll also assume you have a basic understanding of how to do things in a Linux environment; in other words, whether you’re directly on the machine in the command line interface or on the desktop, or whether you’re on it remotely over SSH, whatever works for you.
I’ll assume you’re in a folder which is /home/Pi.
I’m going to do everything through the CLI commands; if you’re more comfortable in the desktop environment, then just do it that way.
I’ll assume that you’ll know to use
sudo if you need to – i.e., if anything fails, then this is usually resolved by rerunning the command with sudo ahead of it.
I prefer nano as a text editor but if you want to use vim or anything else, you’ll do that.
I’ll assume you know how to install Python modules using either easy_install or pip. I personally prefer pip but there’s not much in it. (If you don’t have it, install it using easy_install. Irony. Love it.)
I’ll assume you’re aware of Python’s requirement for properly indented code. There’s a strong chance that this blog will barf all over indentation so you may need to fix that in your script.
Create a folder called Scraper using
Change in to this with
Create a scraper file in this directory with
I like to verify that everything is working nicely before I get too far into it. So for now we’ll just check that Python is happy. We’ll edit our scraper file
nano scraper.py and inject in to that
print "Hello, World!"
CTRL + X to exit nano, saving when prompted.
python scraper.py and you should get a beautiful looking Hello, World! on screen.
Whilst Hello, World is clearly awesome, it’s time to do something a tad more interesting and start scraping. To do this, we need some modules – Requests and BeautifulSoup.
pip install Requests
pip install bs4
pip install lxml
Let that do its thing for a while. We’ll quickly verify they installed OK. Modify your file again (
nano scraper.py) and add the following:
import requests from bs4 import BeautifulSoup print "Hello, World!"
Exit and Save and run it again. This should do exactly as before with no errors. If you get errors then chances are you:
- Mis-typed the code
- Something didn’t install correctly
- Need sudo
Check those things and report back when your environment is up and running.
Now for some fun
As I said previously, getting in to scraping is a huge topic and beyond the scope of a basic introductory article. It usually requires a reasonable understanding of HTML and knowing the basic workings of an HTTP web server (in terms of the types of requests that are made back and forth from you to the target server.) Understanding response messages, headers, the importance of cookies and session tokens and so on, are all somewhat advanced topics. So for now, we’ll be keeping it basic.
To begin with, we need a target to scrape. We’ll start with https://www.raspberrypi.org/magpi/issues/ which is the page on the RPi website that lists all the available PDF downloads of The MagPi.
If you open that page and in your favourite browser (mine is currently Firefox) have a look around at it and get to grips with how things are structured. Pretty standard here – some rows and columns. Click a button to get a download link. Now, right-click (or equivalent) and Inspect Element on one of the issues and you’ll see the source that we’re working with.
Deconstructing that in detail is beyond this article, but ultimately what we’re looking to do is understand how the web developers have built the site and zero in on the repeating aspects of the page that contain the stuff we’re after.
In the case of our target site, it’s a little interesting. We have the basic HTML that are the visual components of the page – i.e., it shows the magazine icon and details, but then the ‘Get Issue’ button actually pops up a modal dialog. The HTML for the dialog is also embedded in the source, further down the page, which itself contains the link to the PDF download. So we can skip past the visual elements and go straight after the download links. This will obviously vary from site to site so knowing how to deconstruct HTML and figure out where things live and traversing links and so on are key skills to scraping.
Job one is to just get our script to grab that HTML so that we can start to pick through it programmatically. This is where the Requests library and BeautifulSoup come in to play as they make our lives really easy.
In your scraper.py script, add the following:
print "Opening site..." f = requests.get('https://www.raspberrypi.org/magpi/issues/') soup = BeautifulSoup(f.text,'lxml') print soup
If you run that, your screen should fill up with some lovely HTML. This is doing a few things:
- Prints a lovely “We’ve started” message
requeststo send a get request to our target URL stored in variable called
- Puts the text of the response into a BS object called
- Prints it out
Note: in this case we’re using the lxml parser. This is because the source code on the target URL is actually (somewhat) malformed. The lxml parser is most tolerant of this. For more on this, see here.
Extracting the stuff we want
If we were looking to extract more than just the download links of the PDF magazines, then we’d now need to start inspecting the HTML and using e.g., xpath to traverse the structure to pull out all the stuff we wanted, assign to variables and maybe store away in a database.
However, in this nice simple case, we only care about the download links so we can be a little lazy, and go about this a little differently. And when I say differently… I mean dirtily! Performance concerns aside, all we really care about is an
a element on the page that has an
href to a PDF file. This is where BS makes our lives easier, using some built-in methods such as ‘find_all’.
for link in soup.find_all('a'): #print link url = link.get('href') if url is not None: if ".pdf" in url: # or if url.endswith('.pdf'): # do something...
In simplest terms, we find every a element in the page, we grab its href attribute, and, if that’s not null, check if the link contains (or endswith) .pdf. If it does, then we’ll do something with it.
Once we’ve extracted the URL to the file, we’re going to need some code to download it. There are a few complications with this in terms of handling file size, interruptions and so on, but a nice simple implementation of this looks like:
def download_file(url): # Creates a filename to write to; assumes we'll put the downloaded files in a folder called Output - make sure you create this folder first... # The filename is the last bit of the URL filename = 'Output/' + url.split('/')[-1] print "Downloading " + url + " to " + filename # Uses requests again to actually grab the file and save it r = requests.get(url, stream=True) with open(filename, 'wb') as f: for chunk in r.iter_content(chunk_size=1024): if chunk: # filter out keep-alive new chunks f.write(chunk) print "Done!"
Putting it all together
We now have moreorless everything we’d need to do this.
We’ll sort things out a little and your completed file should look a little like this:
import requests from bs4 import BeautifulSoup def download_file(url): # Creates a filename to write to; assumes we'll put the downloaded files in a folder called Output - make sure you create this folder first... # The filename is the last bit of the URL filename = 'Output/' + url.split('/')[-1] print "Downloading " + url + " to " + filename # Uses requests again to actually grab the file and save it r = requests.get(url, stream=True) with open(filename, 'wb') as f: for chunk in r.iter_content(chunk_size=1024): if chunk: # filter out keep-alive new chunks f.write(chunk) print "Done!" def search_pi(): print "Opening site..." f = requests.get('https://www.raspberrypi.org/magpi/issues/') soup = BeautifulSoup(f.text,'lxml') print "Hunting..." for link in soup.find_all('a'): url = link.get('href') if url is not None: if url.endswith('.pdf'): download_file(url) print "Hunt complete!" print "Hello, World!" print "Let's do some mad scraping" search_pi() print "Finished"
And that is it – a very basic implementation of web scraping in Python with the added bonus of a full electronic back catalogue of the Mag Pi. This should be enough for you to start going after other reasonably simple targets.
The actual implementation of this included a few other bits:
- Support for scraping more than one site – e.g., HackSpace
- Some graceful error handling
- Support for scheduling and re-scraping – i.e., we don’t want to download everything every time we run this – we want it to determine what’s new and only download that after we’ve run it
- Make it a bit more configurable
- And a few other things to boot
Drop a comment if you’d like to see a further post with those things included, but in the meantime… enjoy.