Abstract Image for Deploying AWS CloudFormation through GitLab Blog

Lambda web scraper project (week 5)

What Happened Last Week

I've continued building Python-based Lambda functions on Amazon in a GitLab project aws-lambda-price-grabber...

The first part of the week I created the AWS Cloud Formation scripts necessary to be able to recreate this solution. I wanted to get the tool to save prices but creating the template took up more time than I expected. 

I also noticed I had neglected to unit test some parts and started refactoring to be able to write some new tests. A code coverage utility might have highlighted the fact earlier! I am learning Python progressively but quickly ran into problems with my new unit tests. They were not running, lesson learnt: Not only does the filename of a unit test need to start with the word test_ but also the method names.

Stuff I had done in previous weeks like downloading the libraries, zipping and manually uploading is no longer required, so I changed my scripts in the build directory. One the other things I did was to add an open-source license to the project.

Setting up using the AWS CloudFormation Stack

A stack is a collection of AWS resources that you can manage as a single unit. This allows you to create a template for a solution, which is what I did last week.

There are 2 ways you can setup; run my deployment/aws-create-foundation.json template using the CloudFormation console. OR having created a new user with administrator-level access and access keys, you can do this via the AWS CLI, which is the way I would recommend since if you are going to install, then you are likely to want to make changes and the CLI will help you do everything faster:

aws cloudformation create-stack --stack-name grabber-foundation  --template-body file://./deployment/aws-create-foundation.json --capabilities CAPABILITY_NAMED_IAM --parameters ParameterKey=S3BucketName,ParameterValue=aws-lambda-price-grabber

aws cloudformation delete-stack --stack-name grabber-foundation

The parameter value aws-lambda-price-grabber needs to be changed. An Amazon S3 bucket name is globally unique, no two people can use the same name. 

One problem I was not able to programmatically solve is updating the S3 bucket name which appears in the CI/CD pipeline and the CloudFormation template for creating the Lambda functions (you'll need to do that):

Once the above commands are run and the files updated, then you simply need to run the CI/CD pipeline.

How to trigger a GitLab CI/CD pipeline

Lessons learnt while creating a CloudFormation template

(1). You don't have to learn to create the CloudFormation (CF) scripts by hand, there is a CloudFormer tool that can reverse engineer the resources you have set up. I decided not to use it because it requires an EC2 machine which would cost. In retrospect, it probably would have been more cost efficient to try it because it took me about 9 hours to figure out everything to make my stack template.

(2). I found the AWS role documentation confusing, in particular, what the AssumeRolePolicyDocument was all about:

"CFNRole": {
  "Type": "AWS::IAM::Role",
  "Properties": {
    "RoleName": "AssumeRolePolicyDocument",
    "AssumeRolePolicyDocument": {
      "Version": "2012-10-17",
      "Statement": [
          "Effect": "Allow",
          "Principal": {
            "Service": "lambda.amazonaws.com"
          "Action": "sts:AssumeRole"

The CloudFormation AssumeRolePolicyDocument attribute for roles is found on the IAM screen under the tab "Trust Relationships", click the blue button to "edit", then copy paste to your template.

The CloudFormation AssumeRolePolicyDocument attribute for roles is found on the IAM screen under the tab

(3). The last lesson I want to share is about linking multiple policies with a role. In a CloudFormation template, you can't do that!

Over the last couple of weeks while setting up the role for my Lambda functions, I was creating a separate role for each kind of thing I wanted them to do:

  • Read/Write to S3 bucket.
  • Creating log entries.
  • Permissions to execute other functions.

I liked that because you then know exactly what the purpose of the permissions was for. Unfortunately, I found when running my CloudFormation template that only one of the policies I wanted was ever hooked up to my new role. So I merged the permissions into one policy. 

Deep Dive into Cloud Formation

Coming Next

I want to finish unit testing the Lambda function that invokes the grabber, get the solution saving prices it getting and send alerts. 

AWS Lambda Python-based web scraper query needed adaptation. Blog

Lambda web scraper project (week 4)

What Happened Last Week

Having an automated deployment system in place for my price scraper, I scheduled the Lambda function to run every 2 hours and quickly discovered I needed to get alerts when it is not getting the prices. Using CloudWatch alarms it is set up rapidly, but unfortunately, I didn't find the time to create the CloudFormation scripts for these yet.

I improved the query that is finding the information in the HTML page and also added a new invoker program which is triggers the web scraping. The new program is now able to look at the prices from as many pages and sites. 

Improving my HTML parse with lxml

The web scraper was throwing exceptions on the first night of operation.

Couldn't find string in HTML: //div[@class='h-text h-color-black title-typo h-p-top-m']/text(): Exception Traceback (most recent call last): File "/var/task/lambda_function.py"...

It turns out the onlineshop I'm interested in like to change both the price and text colour from black to red regularly.

Parsing HTML example div class h-text h-color-red
<div class="h-product-price h-m-bottom-xl topSection">
    <div class="h-text h-color-red title-typo h-p-top-m">CHF 90.00
        <span as="span" class="h-text h-color-dark-grey detail h-p-left-s h-normal-weight">inkl. MwSt.</span>
    <span as="span" class="h-text h-color-red detail h-p-right-s  h-p-top-xs">14% sparen</span>
    <span as="span" class="h-text h-color-black detail h-strike h-p-top-s  h-p-right-s">CHF 105.00</span>
Parsing HTML example div class h-text h-color-black
<div class="h-product-price h-m-bottom-xl topSection">
    <div class="h-text h-color-black title-typo h-p-top-m">CHF 105.00
        <span as="span" class="h-text h-color-dark-grey detail h-p-left-s h-normal-weight">inkl. MwSt.</span>

Time to look at the HTML again... The solution is the DIV tag above the price with the CLASS attribute "h-product-price". This is typical with a lot of online shops and it makes parsing for data easier. We need to query by that tag and get the price out, but how?

I wrote a test program to see what came out of lxml:

import lxml.html

with open('../test-data/shirt.html', 'r') as myfile:  # dev server
    result = myfile.read().replace('\n', '')

doc = lxml.html.document_fromstring(result)
result = doc.xpath("//div[contains(@class, 'h-product-price')]")

print result[0]

It returned the data

<Element div at 0x7f004d244158>

To be honest, at this point I had one look at the developing with lxml documentation and after searching and after 5 minutes searching for the terms like wildcard * I gave up.

The hot tip came after I searched google with the data my program had output:

Searching Google for lxml help

This came back with Stack Overflow which needs no introduction. It didn't directly answer my question, BUT the source code I saw gave me the answer. Look at the /div/div/.. that looks like a directory structure would, e.g. c:/users/neil/My Documents.  

lxml xpath is like directories on your computer
Time to unit test the new query, have a look at the 2 ways you can do this:
    def test_parse_html_way1(self):
        web = bot.Crawler()
        page = self.test_get_shirt_html()
        data = web.parse_html(page, "//div[@class='h-text h-color-black title-typo h-p-top-m']/text()")

        self.assertEqual(data, u'CHF\xa0105.00')

    def test_parse_html_way2(self):
        web = bot.Crawler()
        page = self.test_get_shirt_html()
        data = web.parse_html(page, "//div[contains(@class, 'h-product-price')]/div/text()")

        self.assertEqual(data, u'CHF\xa0105.00')

Changes to the architecture

I spent a few hours on Sunday improving the web scraper by making it possible to crawl multiple web pages. Here is a high-level diagram to help you understand the program.

price grabber components; lambda functions, cloud watch alarms and s3 buckets

The arrows show system dependencies. For example, the grab_invoke program needs the S3 bucket with the list of sites to scan and it needs the program to get/grab the prices too.


I'm beginning to understand the my Infrastructure as Code (IaC) scripts, but for someone new I realize this is not trival. Its best to download the solution if you want to understand this: https://gitlab.com/neilspink/aws-lambda-price-grabber

We now have 2 serverless functions; grab_invoke & grab_price. I had to extend the CloudFormation script aws-create-lambda.json in the deployment directory. The GitLab deployment script .gitlab-ci.yml was modified to upload the new website-monitor-list.

I quickly found the GitLab pipeline failing again and had to start using the AWS CLI to test my scripts. I've decided to add these command to source control in the development directory under the file aws-cli-commands.md. I'm fairly sure they will be needed again in the near future.


Coming Next

I'd like to be able to recreate the solution at a touch of a button, so need to codify the CloudWatch schedule and alarms. The solution also should save the prices somewhere.