Scraping web sites using Python

Lambda web scraper project (week 1)

I'm commencing a new project, where I will be building an application that can monitor prices on websites. I will be trying this on 2 cloud platforms; Amazon Web Services (AWS) and on Microsoft Azure Google Cloud Platform (found out during the week that MS don't support Python yet). I will be using their Function as a service (FaaS) cloud computing services. I'm going to use Python which has a good library to help parse HTML, the language I found useful on a couple of occasions when I had to work on very large Gb sized files.

People take too much time making decisions.

There is a lot of bias out there on which programming language, libraries, patterns and way of doing things. Generally, I never cared too much for any tool I use, it either works, I make it work or throw it away. Using Python was not my first choice, I've programmed C# for a very long time now, it is my primary language and I could use this opportunity to test DotNet core, but after installing it on my Ubuntu desktop, I noted the updates are huge and slow. Also, the NuGet HTML agility pack when I last used it, maybe 12 months ago, I vaguely remember it didn't seem to be getting updated. I can't say I am thrilled about using Python since I've always struggled to install it properly on my Ubuntu machine, but to try to get off to a good start, I just reinstalled my laptop and hopefully will be second time lucky.

I want it to be as professional as possible, I'm fed up of people dodging difficult parts and saying, "this is not best practice, but in this demo, we will do it in a way you should not". I always seem to get annoyed at that and ask myself, how can these be great examples? So, it's time for me to try to do what others won't. Now, chances are that I will get some things wrong. I can hardly program Python for a start. However, I'm good at DevOps and the CI/CD deployments. Anyway, I think many people warn about their code being work in progress, along with all the other excuses for the wrong reasons. Feel free to hit me up if I screw up by the way. Yes, I am under no illusion that some parts will be ugly. However, I'm going to get something up and running within 6 weeks. So let's get started...

My first step is to choose a system with source control that will allow deployment to a cloud platform like AWS. My go-to choice for this project will be http://gitlab.com. My reasoning is that having continuous integration out the box will be a huge time saver. I don't want to have to spend time setting AWS or Azure based proprietary tooling along with an external CI service. You can find the project here https://gitlab.com/neilspink/aws-lambda-price-grabber (the naming could maybe have been better).

Setting up source control

I started by installing GIT on my machine, where I just reinstalled Ubuntu 18.04.1 LTS

:~$ sudo apt install git

Then you need to setup your details, so that you can commit changes to your repository, e.g.

:~$ git config --global user.name "Neil Spink"
:~$ git config --global user.email "neilspink@gmail.com"

If you haven't used GitLab before, you might have the access errors I had.

:~$ git clone git@gitlab.com:neilspink/aws-lambda-price-grabber.git
git@gitlab.com: Permission denied (publickey).
fatal: Could not read from remote repository.

You need to use an SSH key as documented as here
https://docs.gitlab.com/ee/ssh/README.html

Have a look at this video for help.

Next step is to create your README file and push the changes back to GitLab. The instructions are provided on the project details page and start with going into the directory where the source code is
touch README.md
git add README.md
git commit -m "add README"
git push -u origin master
You could edit your python from the command line. I've done it many times using utilities like Nano or VI. However, to make life easier I am going to use PyCharm (If you on Ubuntu then go to the software centre to install). When you first run the IDE, I would recommend installing Markdown support, you need to maintain a good README file for your project.The next thing you might want to do is create a .gitignore file for your project. Ignored files are usually built artefacts and machine-generated files that can be derived from your repository source or should otherwise not be committed. I mention this now in particular because as soon as you start working in Pycharm, it will ask if it should add the .idea directory to source control. Your PyCharm settings are yours, therefore I would add the following 2 lines to the ignore file:
.DS_Store
.idea
In the next step, I am installing Python
:~$ sudo apt install python3
I have a new version of Ubuntu, so can install the Beautiful Soup libraries we need using the system package manager:
:~$ apt-get install python3-bs4
Apparently, Beautiful Soup supports the HTML parser included in Python’s standard library, but it also supports a number of other third-party parsers too. The setup I read while setting up mine suggested, "lxml" might be useful, so I installed it too, we will find out later maybe:
:~$ apt-get install python-lxml
A word of warning if you haven't done any web scraping before. A program can make requests to a web server much faster than a human and can easily get you into trouble in a number of ways. If you repeatedly crawl a site or page your IP could become blocked (it could be a terrible problem if you're doing this in your work offices). Some websites do not allow you to scrap/scan/crawl them using any utility. For example, ricardo.ch an online auction site will recognize you are using a program and tell you to stop crawling. There are tricks and ways around this, but you will likely be breaking the terms and conditions of legitimate use of their website.The best way to avoid getting into trouble while you figure everything out is to download a complete webpage from the site you want to target, then work on the copy until you have perfected your program.So, let's get started. Open a browser and get a page you want to test your web scraper on. I want to extract prices from a well-known online shop. In the file menu, you should find the save webpage complete and store on your computer.

Here is the code from the program I showed in the video (warning: its Python 2.7 code):

from bs4 import BeautifulSoup

with open('../test-data/pullover.html', 'r') as myfile:
    data=myfile.read().replace('\n', '')

soup = BeautifulSoup(data)

for EachPart in soup.select('div[class*="h-text h-color-red title-typo h-p-top-m"]'):
    print EachPart.get_text()

The code above extracts the price from a webpage I have saved on my computer. It is  searching for <div> tags with an attribute class="h-text h-color-red title-typo h-p-top-m". You can get more help on this on the Beautiful Soup documentation page by searching for the '.select' keyword, which is about halfway down the page.

In my next blog post, I will be testing this code on AWS Lambda and looking at setting up an automated deployment pipeline to the cloud platform.

Neil

Leave a Reply

Your email address will not be published. Required fields are marked *