AWS Lambda Python-based web scraper query needed adaptation.

Lambda web scraper project (week 4)

What Happened Last Week

Having an automated deployment system in place for my price scraper, I scheduled the Lambda function to run every 2 hours and quickly discovered I needed to get alerts when it is not getting the prices. Using CloudWatch alarms it is set up rapidly, but unfortunately, I didn't find the time to create the CloudFormation scripts for these yet.

I improved the query that is finding the information in the HTML page and also added a new invoker program which is triggers the web scraping. The new program is now able to look at the prices from as many pages and sites. 

Improving my HTML parse with lxml

The web scraper was throwing exceptions on the first night of operation.

Couldn't find string in HTML: //div[@class='h-text h-color-black title-typo h-p-top-m']/text(): Exception Traceback (most recent call last): File "/var/task/"...

It turns out the onlineshop I'm interested in like to change both the price and text colour from black to red regularly.

Parsing HTML example div class h-text h-color-red
<div class="h-product-price h-m-bottom-xl topSection">
    <div class="h-text h-color-red title-typo h-p-top-m">CHF 90.00
        <span as="span" class="h-text h-color-dark-grey detail h-p-left-s h-normal-weight">inkl. MwSt.</span>
    <span as="span" class="h-text h-color-red detail h-p-right-s  h-p-top-xs">14% sparen</span>
    <span as="span" class="h-text h-color-black detail h-strike h-p-top-s  h-p-right-s">CHF 105.00</span>
Parsing HTML example div class h-text h-color-black
<div class="h-product-price h-m-bottom-xl topSection">
    <div class="h-text h-color-black title-typo h-p-top-m">CHF 105.00
        <span as="span" class="h-text h-color-dark-grey detail h-p-left-s h-normal-weight">inkl. MwSt.</span>

Time to look at the HTML again... The solution is the DIV tag above the price with the CLASS attribute "h-product-price". This is typical with a lot of online shops and it makes parsing for data easier. We need to query by that tag and get the price out, but how?

I wrote a test program to see what came out of lxml:

import lxml.html

with open('../test-data/shirt.html', 'r') as myfile:  # dev server
    result ='\n', '')

doc = lxml.html.document_fromstring(result)
result = doc.xpath("//div[contains(@class, 'h-product-price')]")

print result[0]

It returned the data

<Element div at 0x7f004d244158>

To be honest, at this point I had one look at the developing with lxml documentation and after searching and after 5 minutes searching for the terms like wildcard * I gave up.

The hot tip came after I searched google with the data my program had output:

Searching Google for lxml help

This came back with Stack Overflow which needs no introduction. It didn't directly answer my question, BUT the source code I saw gave me the answer. Look at the /div/div/.. that looks like a directory structure would, e.g. c:/users/neil/My Documents.  

lxml xpath is like directories on your computer
Time to unit test the new query, have a look at the 2 ways you can do this:
    def test_parse_html_way1(self):
        web = bot.Crawler()
        page = self.test_get_shirt_html()
        data = web.parse_html(page, "//div[@class='h-text h-color-black title-typo h-p-top-m']/text()")

        self.assertEqual(data, u'CHF\xa0105.00')

    def test_parse_html_way2(self):
        web = bot.Crawler()
        page = self.test_get_shirt_html()
        data = web.parse_html(page, "//div[contains(@class, 'h-product-price')]/div/text()")

        self.assertEqual(data, u'CHF\xa0105.00')

Changes to the architecture

I spent a few hours on Sunday improving the web scraper by making it possible to crawl multiple web pages. Here is a high-level diagram to help you understand the program.

price grabber components; lambda functions, cloud watch alarms and s3 buckets

The arrows show system dependencies. For example, the grab_invoke program needs the S3 bucket with the list of sites to scan and it needs the program to get/grab the prices too.


I'm beginning to understand the my Infrastructure as Code (IaC) scripts, but for someone new I realize this is not trival. Its best to download the solution if you want to understand this:

We now have 2 serverless functions; grab_invoke & grab_price. I had to extend the CloudFormation script aws-create-lambda.json in the deployment directory. The GitLab deployment script .gitlab-ci.yml was modified to upload the new website-monitor-list.

I quickly found the GitLab pipeline failing again and had to start using the AWS CLI to test my scripts. I've decided to add these command to source control in the development directory under the file I'm fairly sure they will be needed again in the near future.


Coming Next

I'd like to be able to recreate the solution at a touch of a button, so need to codify the CloudWatch schedule and alarms. The solution also should save the prices somewhere.