Scraping web sites using Python Blog

Lambda web scraper project (week 2)

What Happened

In my first week working on the scraper project while reading the beautiful soup documentation, I read a bit about optimization and performance that if you really need a fast program and are using a platform where speed/compute usage counts, then you may be better just using the lxml HTML parser. I gave it a try and found a big difference in performance when I ran my 2 programs on AWS:

  • Beautiful Soup 38Mb memory and about 1000ms.
  • LXML 35Mb memory and 500ms 

If LXML is 50% faster it is likely to cost half as much to run and therefore now I know which library to use.

I wanted the week to be all about testing the programs and setting up our CI/CD pipeline. Finally, I spent some time on learning python. I feel good that I've not bothered doing what I usually do and that is to take in a few hours of video about a subject first. I did spend a bit of time googling for resources and it does seem everyone is trying to sell an online course these days, but I think it is possible to do this without. The best source I found so far was https://docs.python.org

One of my first learnings shows how much Python I know, I thought I was running python 3.6 but was it's been 2.7. I vaguely remember taking in a course years ago showing me the differences, but since I never really used the language, I've forgotten. When I reinstalled Ubuntu I thought I had installed Python 3 but didn't realise they are side-by-side installation. Try these commands:

python --version
python3 --version

When it came to running my programs under version three, I got errors like:

:~$ python3 test1.py
File "test1.py", line 13
print EachPart.get_text()
^
SyntaxError: invalid syntax

A further thing I got caught out on was thinking Microsoft Azure supported Python for Functions as a Service (Faas), but they don't. In this project and it is a mini-project, I want to compare deployment on at least 2 cloud platforms, so I have found Google Cloud Platforms (GCP) as the replacement. More about that in the coming weeks.

I'm glad to report that I got my unit tests are written and they run automatically when I commit my code to the central source code repository on GitLab.

Unit Testing

I don't remember the resource that helped me figure out unit testing, but I had to do a bit of refactoring on the program. If your interested in how then this video could help you.

Auto Deployment Part 1

Really it should be called Continuous Integration (CI), this development practice makes software development faster, safer and more stable. Each time source code is checked-in it is then verified by an automated build. 

Making GitLab test the source code when it's pushed to the central repository was a little daunting for me. I never used Python or this platform and the online documentation was very confusing, which had me just not wanting to create the .gitlab-ci.yml file which you need to do this.

Eventually, I checked YouTube and a video helped me, Valentin Despa had a perfect video that made sense and I was able to segway off his work and make my own CI file. It took me about 45 minutes to figure out various little intricacies on GitLab which I think is quite fast when I compare to other platforms I've used in the past, but we will see on the next phase when I have to workout deployments.

Next

The next step is to do the automated deployment to AWS, I'm looking forward to getting this phase completed and getting to the good stuff.

It's a shame that entrepreneurs and business owners will probably never read this blog. "The money is in repetition" so says Kayne West and Harper Reed, the first needs no introduction, Mr Reed, on the other hand, you might never have heard of, he would say "test, repeat, weaponize".  Once we get this system in place, we are ready to start building and changing things fast. 

I'm surprised even today after so many years developing software, devops being yodeld from every hilltop, that there are so many online shops and applications still being deployed manually. It's the single biggest mistake I see.

Scraping web sites using Python Blog

Lambda web scraper project (week 1)

I'm commencing a new project, where I will be building an application that can monitor prices on websites. I will be trying this on 2 cloud platforms; Amazon Web Services (AWS) and on Microsoft Azure Google Cloud Platform (found out during the week that MS don't support Python yet). I will be using their Function as a service (FaaS) cloud computing services. I'm going to use Python which has a good library to help parse HTML, the language I found useful on a couple of occasions when I had to work on very large Gb sized files.

People take too much time making decisions.

There is a lot of bias out there on which programming language, libraries, patterns and way of doing things. Generally, I never cared too much for any tool I use, it either works, I make it work or throw it away. Using Python was not my first choice, I've programmed C# for a very long time now, it is my primary language and I could use this opportunity to test DotNet core, but after installing it on my Ubuntu desktop, I noted the updates are huge and slow. Also, the NuGet HTML agility pack when I last used it, maybe 12 months ago, I vaguely remember it didn't seem to be getting updated. I can't say I am thrilled about using Python since I've always struggled to install it properly on my Ubuntu machine, but to try to get off to a good start, I just reinstalled my laptop and hopefully will be second time lucky.

I want it to be as professional as possible, I'm fed up of people dodging difficult parts and saying, "this is not best practice, but in this demo, we will do it in a way you should not". I always seem to get annoyed at that and ask myself, how can these be great examples? So, it's time for me to try to do what others won't. Now, chances are that I will get some things wrong. I can hardly program Python for a start. However, I'm good at DevOps and the CI/CD deployments. Anyway, I think many people warn about their code being work in progress, along with all the other excuses for the wrong reasons. Feel free to hit me up if I screw up by the way. Yes, I am under no illusion that some parts will be ugly. However, I'm going to get something up and running within 6 weeks. So let's get started...

My first step is to choose a system with source control that will allow deployment to a cloud platform like AWS. My go-to choice for this project will be http://gitlab.com. My reasoning is that having continuous integration out the box will be a huge time saver. I don't want to have to spend time setting AWS or Azure based proprietary tooling along with an external CI service. You can find the project here https://gitlab.com/neilspink/aws-lambda-price-grabber (the naming could maybe have been better).

Setting up source control

I started by installing GIT on my machine, where I just reinstalled Ubuntu 18.04.1 LTS

:~$ sudo apt install git

Then you need to setup your details, so that you can commit changes to your repository, e.g.

:~$ git config --global user.name "Neil Spink"
:~$ git config --global user.email "neilspink@gmail.com"

If you haven't used GitLab before, you might have the access errors I had.

:~$ git clone git@gitlab.com:neilspink/aws-lambda-price-grabber.git
git@gitlab.com: Permission denied (publickey).
fatal: Could not read from remote repository.

You need to use an SSH key as documented as here
https://docs.gitlab.com/ee/ssh/README.html

Have a look at this video for help.

Next step is to create your README file and push the changes back to GitLab. The instructions are provided on the project details page and start with going into the directory where the source code is
touch README.md
git add README.md
git commit -m "add README"
git push -u origin master
You could edit your python from the command line. I've done it many times using utilities like Nano or VI. However, to make life easier I am going to use PyCharm (If you on Ubuntu then go to the software centre to install). When you first run the IDE, I would recommend installing Markdown support, you need to maintain a good README file for your project.The next thing you might want to do is create a .gitignore file for your project. Ignored files are usually built artefacts and machine-generated files that can be derived from your repository source or should otherwise not be committed. I mention this now in particular because as soon as you start working in Pycharm, it will ask if it should add the .idea directory to source control. Your PyCharm settings are yours, therefore I would add the following 2 lines to the ignore file:
.DS_Store
.idea
In the next step, I am installing Python
:~$ sudo apt install python3
I have a new version of Ubuntu, so can install the Beautiful Soup libraries we need using the system package manager:
:~$ apt-get install python3-bs4
Apparently, Beautiful Soup supports the HTML parser included in Python’s standard library, but it also supports a number of other third-party parsers too. The setup I read while setting up mine suggested, "lxml" might be useful, so I installed it too, we will find out later maybe:
:~$ apt-get install python-lxml
A word of warning if you haven't done any web scraping before. A program can make requests to a web server much faster than a human and can easily get you into trouble in a number of ways. If you repeatedly crawl a site or page your IP could become blocked (it could be a terrible problem if you're doing this in your work offices). Some websites do not allow you to scrap/scan/crawl them using any utility. For example, ricardo.ch an online auction site will recognize you are using a program and tell you to stop crawling. There are tricks and ways around this, but you will likely be breaking the terms and conditions of legitimate use of their website.The best way to avoid getting into trouble while you figure everything out is to download a complete webpage from the site you want to target, then work on the copy until you have perfected your program.So, let's get started. Open a browser and get a page you want to test your web scraper on. I want to extract prices from a well-known online shop. In the file menu, you should find the save webpage complete and store on your computer.

Here is the code from the program I showed in the video (warning: its Python 2.7 code):

from bs4 import BeautifulSoup

with open('../test-data/pullover.html', 'r') as myfile:
    data=myfile.read().replace('\n', '')

soup = BeautifulSoup(data)

for EachPart in soup.select('div[class*="h-text h-color-red title-typo h-p-top-m"]'):
    print EachPart.get_text()

The code above extracts the price from a webpage I have saved on my computer. It is  searching for <div> tags with an attribute class="h-text h-color-red title-typo h-p-top-m". You can get more help on this on the Beautiful Soup documentation page by searching for the '.select' keyword, which is about halfway down the page.

In my next blog post, I will be testing this code on AWS Lambda and looking at setting up an automated deployment pipeline to the cloud platform.