Scraping web sites using Python Blog

Lambda web scraper project (week 1)

I'm commencing a new project, where I will be building an application that can monitor prices on websites. I will be trying this on 2 cloud platforms; Amazon Web Services (AWS) and on Microsoft Azure Google Cloud Platform (found out during the week that MS don't support Python yet). I will be using their Function as a service (FaaS) cloud computing services. I'm going to use Python which has a good library to help parse HTML, the language I found useful on a couple of occasions when I had to work on very large Gb sized files.

People take too much time making decisions.

There is a lot of bias out there on which programming language, libraries, patterns and way of doing things. Generally, I never cared too much for any tool I use, it either works, I make it work or throw it away. Using Python was not my first choice, I've programmed C# for a very long time now, it is my primary language and I could use this opportunity to test DotNet core, but after installing it on my Ubuntu desktop, I noted the updates are huge and slow. Also, the NuGet HTML agility pack when I last used it, maybe 12 months ago, I vaguely remember it didn't seem to be getting updated. I can't say I am thrilled about using Python since I've always struggled to install it properly on my Ubuntu machine, but to try to get off to a good start, I just reinstalled my laptop and hopefully will be second time lucky.

I want it to be as professional as possible, I'm fed up of people dodging difficult parts and saying, "this is not best practice, but in this demo, we will do it in a way you should not". I always seem to get annoyed at that and ask myself, how can these be great examples? So, it's time for me to try to do what others won't. Now, chances are that I will get some things wrong. I can hardly program Python for a start. However, I'm good at DevOps and the CI/CD deployments. Anyway, I think many people warn about their code being work in progress, along with all the other excuses for the wrong reasons. Feel free to hit me up if I screw up by the way. Yes, I am under no illusion that some parts will be ugly. However, I'm going to get something up and running within 6 weeks. So let's get started...

My first step is to choose a system with source control that will allow deployment to a cloud platform like AWS. My go-to choice for this project will be http://gitlab.com. My reasoning is that having continuous integration out the box will be a huge time saver. I don't want to have to spend time setting AWS or Azure based proprietary tooling along with an external CI service. You can find the project here https://gitlab.com/neilspink/aws-lambda-price-grabber (the naming could maybe have been better).

Setting up source control

I started by installing GIT on my machine, where I just reinstalled Ubuntu 18.04.1 LTS

:~$ sudo apt install git

Then you need to setup your details, so that you can commit changes to your repository, e.g.

:~$ git config --global user.name "Neil Spink"
:~$ git config --global user.email "neilspink@gmail.com"

If you haven't used GitLab before, you might have the access errors I had.

:~$ git clone git@gitlab.com:neilspink/aws-lambda-price-grabber.git
git@gitlab.com: Permission denied (publickey).
fatal: Could not read from remote repository.

You need to use an SSH key as documented as here
https://docs.gitlab.com/ee/ssh/README.html

Have a look at this video for help.

Next step is to create your README file and push the changes back to GitLab. The instructions are provided on the project details page and start with going into the directory where the source code is
touch README.md
git add README.md
git commit -m "add README"
git push -u origin master
You could edit your python from the command line. I've done it many times using utilities like Nano or VI. However, to make life easier I am going to use PyCharm (If you on Ubuntu then go to the software centre to install). When you first run the IDE, I would recommend installing Markdown support, you need to maintain a good README file for your project. The next thing you might want to do is create a .gitignore file for your project. Ignored files are usually built artefacts and machine-generated files that can be derived from your repository source or should otherwise not be committed. I mention this now in particular because as soon as you start working in Pycharm, it will ask if it should add the .idea directory to source control. Your PyCharm settings are yours, therefore I would add the following 2 lines to the ignore file:
.DS_Store
.idea
In the next step, I am installing Python
:~$ sudo apt install python3
I have a new version of Ubuntu, so can install the Beautiful Soup libraries we need using the system package manager:
:~$ apt-get install python3-bs4
Apparently, Beautiful Soup supports the HTML parser included in Python’s standard library, but it also supports a number of other third-party parsers too. The setup I read while setting up mine suggested, "lxml" might be useful, so I installed it too, we will find out later maybe:
:~$ apt-get install python-lxml
A word of warning if you haven't done any web scraping before. A program can make requests to a web server much faster than a human and can easily get you into trouble in a number of ways. If you repeatedly crawl a site or page your IP could become blocked (it could be a terrible problem if you're doing this in your work offices). Some websites do not allow you to scrap/scan/crawl them using any utility. For example, ricardo.ch an online auction site will recognize you are using a program and tell you to stop crawling. There are tricks and ways around this, but you will likely be breaking the terms and conditions of legitimate use of their website. The best way to avoid getting into trouble while you figure everything out is to download a complete webpage from the site you want to target, then work on the copy until you have perfected your program. So, let's get started. Open a browser and get a page you want to test your web scraper on. I want to extract prices from a well-known online shop. In the file menu, you should find the save webpage complete and store on your computer.

Here is the code from the program I showed in the video (warning: its Python 2.7 code):

from bs4 import BeautifulSoup

with open('../test-data/pullover.html', 'r') as myfile:
    data=myfile.read().replace('\n', '')

soup = BeautifulSoup(data)

for EachPart in soup.select('div[class*="h-text h-color-red title-typo h-p-top-m"]'):
    print EachPart.get_text()

The code above extracts the price from a webpage I have saved on my computer. It is  searching for <div> tags with an attribute class="h-text h-color-red title-typo h-p-top-m". You can get more help on this on the Beautiful Soup documentation page by searching for the '.select' keyword, which is about halfway down the page.

In my next blog post, I will be testing this code on AWS Lambda and looking at setting up an automated deployment pipeline to the cloud platform.

Web server maintenance is an indispensable part of hosting. You need preventive security audits and addins. Blog

Ubuntu / WordPress Server Maintenance

This post aims to cover the points I currently think are important if you are running your own web server. 

Security Patches

I would place a monthly reminder in your calendar to run updates on your server. This is a task that can maybe be automated, but, I haven't found a good way yet. You should run these commands:

sudo apt-get update
sudo apt-get upgrade
sudo apt-get dist-upgrade

Tips For Basic Operational Security

You should not allow the public to access your parts of the web server with statistics, i.e. Webalizer. I've seen viagra sites just ping your site to get their URL appearing in your statistics, which in turn then gets Googled and ranked. 

If you are going to install a tool like phpMyAdmin, which I wouldn't, then you should make sure only your IP can access the server or/and increase the security by password protecting the directory using "Basic Auth".

If you use WordPress then you should have a security addin to help harden your site. I use "ShieldSecurity" which I like and would recommend. I find it alarming to see that after 420 days of operating my site there have been 20'000+ logins blocked. 

20000+ login blocks in 420 days operation

When you create administrator accounts for your WordPress site. I would make the username something random. The password needs to be long and your security plugin should definitely offer reCAPTCHA or other means to slow the automated attacks. 

Google ReCaptcha For Securing WordPress Logins

Pentest Your Site 

Luckily there are tools out there that you can try for free. If you have an online shop, this is the best 50$ you could invest ever! Head on over to https://pentest-tools.com and check your site.

List of some high risk vulnerabilities on a Apache webserver.

If you see any High-Risk warnings, you must take immediate action. In this example, I was surprised myself, as I had done the apt updates and upgrades on my Ubuntu server. Well, they have a six-month release cycle. Critical bug fixes do make it sooner but still take time. The solution to this is the Personal Package Archive (PPA). This is a repository, provided by Canonical (the company behind Ubuntu) and allows developers and enthusiasts to offer up-to-date versions of software to all Ubuntu users.

sudo add-apt-repository ppa:ondrej/apache2
sudo apt update
sudo apt upgrade

Before adding any old PPA I would suggest first investigating them a little. In this case Ondrej you'll find listed on the list of maintainers https://packages.ubuntu.com/cosmic/apache2 

Web Server Logs

Seeing evidence in the logs that hackers are trying to hack doesn't mean you have been hacked. I would venture a look every now and then. For an apache server they are typically found under the /var/logs/apache2 directory..

If you're on AWS like me, you will need to Putty or SSH on to the server and navigate to where the log files are found. Maybe you have a lot of log files too, but don't worry I have a trick to help you scan them. 

List of log apache log files in /var/logs/apache2 directory

The following command can scan the GZ compressed files for keywords. In this example its wp-admin area.

find -name access.log.\*.gz -print0 | xargs -0 zgrep "wp-admin"

log of hackers calling  /wp-admin/setup-config.php?step=0

You can see someone is testing the server if they can run the setup wizard.

A surefire indication you have hackers probing  or even hacked your server is seeing loads of entries with URLs ending .RU like this

"http://viagra-blah-blah.ru/

General keywords I might look for in my logs include

  • .cgi
  • wp-admin
  • admin
  • 404
  • passwd
  • .tables

Want to learn to dig deeper, then have a look at the following blog "Looking for hacking activity in Apache Logs"

When you add ShieldSecurity it will guide you through setup using its wizard. Two important features are automated updates, which ensures your WordPress is running using the latest patches and the login protection. I choose to add the Google ReCapture