Scraping web sites using Python

Lambda web scraper project (week 2)

What Happened

In my first week working on the scraper project while reading the beautiful soup documentation, I read a bit about optimization and performance that if you really need a fast program and are using a platform where speed/compute usage counts, then you may be better just using the lxml HTML parser. I gave it a try and found a big difference in performance when I ran my 2 programs on AWS:

  • Beautiful Soup 38Mb memory and about 1000ms.
  • LXML 35Mb memory and 500ms 

If LXML is 50% faster it is likely to cost half as much to run and therefore now I know which library to use.

I wanted the week to be all about testing the programs and setting up our CI/CD pipeline. Finally, I spent some time on learning python. I feel good that I've not bothered doing what I usually do and that is to take in a few hours of video about a subject first. I did spend a bit of time googling for resources and it does seem everyone is trying to sell an online course these days, but I think it is possible to do this without. The best source I found so far was

One of my first learnings shows how much Python I know, I thought I was running python 3.6 but was it's been 2.7. I vaguely remember taking in a course years ago showing me the differences, but since I never really used the language, I've forgotten. When I reinstalled Ubuntu I thought I had installed Python 3 but didn't realise they are side-by-side installation. Try these commands:

python --version
python3 --version

When it came to running my programs under version three, I got errors like:

:~$ python3
File "", line 13
print EachPart.get_text()
SyntaxError: invalid syntax

A further thing I got caught out on was thinking Microsoft Azure supported Python for Functions as a Service (Faas), but they don't. In this project and it is a mini-project, I want to compare deployment on at least 2 cloud platforms, so I have found Google Cloud Platforms (GCP) as the replacement. More about that in the coming weeks.

I'm glad to report that I got my unit tests are written and they run automatically when I commit my code to the central source code repository on GitLab.

Unit Testing

I don't remember the resource that helped me figure out unit testing, but I had to do a bit of refactoring on the program. If your interested in how then this video could help you.

Auto Deployment Part 1

Really it should be called Continuous Integration (CI), this development practice makes software development faster, safer and more stable. Each time source code is checked-in it is then verified by an automated build. 

Making GitLab test the source code when it's pushed to the central repository was a little daunting for me. I never used Python or this platform and the online documentation was very confusing, which had me just not wanting to create the .gitlab-ci.yml file which you need to do this.

Eventually, I checked YouTube and a video helped me, Valentin Despa had a perfect video that made sense and I was able to segway off his work and make my own CI file. It took me about 45 minutes to figure out various little intricacies on GitLab which I think is quite fast when I compare to other platforms I've used in the past, but we will see on the next phase when I have to workout deployments.


The next step is to do the automated deployment to AWS, I'm looking forward to getting this phase completed and getting to the good stuff.

It's a shame that entrepreneurs and business owners will probably never read this blog. "The money is in repetition" so says Kayne West and Harper Reed, the first needs no introduction, Mr Reed, on the other hand, you might never have heard of, he would say "test, repeat, weaponize".  Once we get this system in place, we are ready to start building and changing things fast. 

I'm surprised even today after so many years developing software, devops being yodeld from every hilltop, that there are so many online shops and applications still being deployed manually. It's the single biggest mistake I see.