AWS Lambda Python-based web scraper query needed adaptation. Blog

Lambda web scraper project (week 4)

What Happened Last Week

Having an automated deployment system in place for my price scraper, I scheduled the Lambda function to run every 2 hours and quickly discovered I needed to get alerts when it is not getting the prices. Using CloudWatch alarms it is set up rapidly, but unfortunately, I didn't find the time to create the CloudFormation scripts for these yet.

I improved the query that is finding the information in the HTML page and also added a new invoker program which is triggers the web scraping. The new program is now able to look at the prices from as many pages and sites. 

Improving my HTML parse with lxml

The web scraper was throwing exceptions on the first night of operation.

Couldn't find string in HTML: //div[@class='h-text h-color-black title-typo h-p-top-m']/text(): Exception Traceback (most recent call last): File "/var/task/lambda_function.py"...

It turns out the onlineshop I'm interested in like to change both the price and text colour from black to red regularly.

Parsing HTML example div class h-text h-color-red
<div class="h-product-price h-m-bottom-xl topSection">
    <div class="h-text h-color-red title-typo h-p-top-m">CHF 90.00
        <span as="span" class="h-text h-color-dark-grey detail h-p-left-s h-normal-weight">inkl. MwSt.</span>
    </div>
    <span as="span" class="h-text h-color-red detail h-p-right-s  h-p-top-xs">14% sparen</span>
    <span as="span" class="h-text h-color-black detail h-strike h-p-top-s  h-p-right-s">CHF 105.00</span>
</div>
Parsing HTML example div class h-text h-color-black
<div class="h-product-price h-m-bottom-xl topSection">
    <div class="h-text h-color-black title-typo h-p-top-m">CHF 105.00
        <span as="span" class="h-text h-color-dark-grey detail h-p-left-s h-normal-weight">inkl. MwSt.</span>
    </div>
</div>

Time to look at the HTML again... The solution is the DIV tag above the price with the CLASS attribute "h-product-price". This is typical with a lot of online shops and it makes parsing for data easier. We need to query by that tag and get the price out, but how?

I wrote a test program to see what came out of lxml:

import lxml.html

with open('../test-data/shirt.html', 'r') as myfile:  # dev server
    result = myfile.read().replace('\n', '')

doc = lxml.html.document_fromstring(result)
result = doc.xpath("//div[contains(@class, 'h-product-price')]")

print result[0]

It returned the data

<Element div at 0x7f004d244158>

To be honest, at this point I had one look at the developing with lxml documentation and after searching and after 5 minutes searching for the terms like wildcard * I gave up.


The hot tip came after I searched google with the data my program had output:

Searching Google for lxml help

This came back with Stack Overflow which needs no introduction. It didn't directly answer my question, BUT the source code I saw gave me the answer. Look at the /div/div/.. that looks like a directory structure would, e.g. c:/users/neil/My Documents.  

lxml xpath is like directories on your computer
Time to unit test the new query, have a look at the 2 ways you can do this:
    def test_parse_html_way1(self):
        web = bot.Crawler()
        page = self.test_get_shirt_html()
        data = web.parse_html(page, "//div[@class='h-text h-color-black title-typo h-p-top-m']/text()")

        self.assertEqual(data, u'CHF\xa0105.00')

    def test_parse_html_way2(self):
        web = bot.Crawler()
        page = self.test_get_shirt_html()
        data = web.parse_html(page, "//div[contains(@class, 'h-product-price')]/div/text()")

        self.assertEqual(data, u'CHF\xa0105.00')

Changes to the architecture

I spent a few hours on Sunday improving the web scraper by making it possible to crawl multiple web pages. Here is a high-level diagram to help you understand the program.

price grabber components; lambda functions, cloud watch alarms and s3 buckets

The arrows show system dependencies. For example, the grab_invoke program needs the S3 bucket with the list of sites to scan and it needs the program to get/grab the prices too.

Deployment

I'm beginning to understand the my Infrastructure as Code (IaC) scripts, but for someone new I realize this is not trival. Its best to download the solution if you want to understand this: https://gitlab.com/neilspink/aws-lambda-price-grabber

We now have 2 serverless functions; grab_invoke & grab_price. I had to extend the CloudFormation script aws-create-lambda.json in the deployment directory. The GitLab deployment script .gitlab-ci.yml was modified to upload the new website-monitor-list.

I quickly found the GitLab pipeline failing again and had to start using the AWS CLI to test my scripts. I've decided to add these command to source control in the development directory under the file aws-cli-commands.md. I'm fairly sure they will be needed again in the near future.

 

Coming Next

I'd like to be able to recreate the solution at a touch of a button, so need to codify the CloudWatch schedule and alarms. The solution also should save the prices somewhere.

AWS Lambda Python based web scraper deployed using GitLab Blog

Lambda web scraper project (week 3)

What Happened Last Week

Having already created a Python crawler in the first week and written both unit tests and automated using a GitLab CI/CD pipeline (week 2), last week was about deploying updates to the AWS cloud platform. This is a quite a long post which is documenting the learnings I made in the 3rd week of my project. You should note that I have not developed much using either Lambda functions or Python and never used GitLab in another project before. 

In my first attempts of setting up the deployment jobs, I was uploading all the source code to Amazon S3 without zipping, then only to realize that AWS Lambda functions wanted a ZIP file and that a function does not automatically update when you upload a new ZIP. I still don't get why Amazon suggest using the S3 bucket, then they copy the information somewhere else and furthermore don't watch for updates, which make it so much easier to use and intuitive.

I figure out I needed to run an AWS cloud formation script to provision and update my Lambda function. Anyway, Infrastructure as Code (IaC) programs make everything repeatable, you don't want to skimp on this bit, you never know if a data centre goes down or someone accidentally breaks your configuration. The AWS cloud formation script is not going to be portable to another cloud platform like Google GCP, which I will be looking at in the near future, we would need to use something more generic like Terraform, but that would require a virtual machine and add unnecessary complexity at this time. 

Actually, it took me slightly over a week, but I written my GitLab jobs to package and deploy, you can see the stages in the diagram below.

GitLab commit build test and deploy to AWS

You can find the source code for this project here and might want to look at the tag for week3:

https://gitlab.com/neilspink/aws-lambda-price-grabber

Package

On GitLab, the CI / CD pipeline jobs are configured in the .gitlab-ci.yml file. Here is the job I added (note you have to add the name "package" to the stages list at the top of the file). Unwanted unit test files are being removed before everything is zipped up, making the program ready to be uploaded to an AWS S3 bucket, where it can then be loaded into a Lambda function.

package:
  image: buildpack-deps
  stage: package
  script:
    - rm source/test_*
    - apt-get update
    - apt-get -y install zip unzip
    - cd source
    - zip -r source.zip *
  artifacts:
    paths:
      - ./

Up to this now my jobs have all been using the python2.7 image, here you can see I'm using buildpack-deps. I use it so I could install ZIP to make the package. It took me a while to figure this out because I found nobody was writing much about this when I was searching for help on the subject of CI /CD deployment pipelines to AWS cloud and Lambda functions.

I found in the GitLab documentation that you need to read and scroll down to get to the juicy important information, like what I needed on using Docker images, but I still didn't find it that helpful as it is interlaced information related to different tasks. Docker hub images are a very cool feature, but they also bring certain security risks, a few of the blog posts I saw had custom images. Would you trust a strange to bring your money? No, you wouldn't. You do need to be very careful with these docker images, there have already been cases of hacked docker images and a nice how-to blog is a good way to get people to use them too.

Before I figured out how to zip my own package in the job I was downloading the artefacts ZIP that GitLab creates. The tricky part was getting the token to be able to get the download and I'm providing it for prosperity and the case you might need it

deploy:
  image: python:latest
  stage: deploy
  script:
    - curl --header "JOB-TOKEN':' $CI_JOB_TOKEN" -o artifacts.zip https://gitlab.com/neilspink/aws-lambda-price-grabber/-/jobs/$CI_JOB_NAME/artifacts/download?job=$CI_JOB_NAME
    - pip install awscli
    - aws s3 cp ./source.zip s3://$S3_BUCKET/

Deploy

There are few steps to infrastructure deployment using AWS using cloud formation. There are two key skills you need to have or acquire; using the AWS Command Line Interface (CLI) and the AWS Identity and Access Management (IAM). The CLI is for running the deployments and IAM you will need to figure out to give your deployment system the right level of access because you should really be giving it full admin access to your cloud.

I spent some time digging around the web trying to find the right templates, wondering if I need to learn to make my own thinking I would lose too much time learning it. In the end, I had no choice to read the documentation and figure it out myself. I'm going to break it down into the order you need to do this, note that IAM configuration is iterative, you need to go back again and again. My current IAM access policy, for example, has too broad permissions and I will be fixing that in the coming week. 

S3 Bucket

In a first step,  I need to upload the source code from GitLab to AWS and needed an S3 bucket, so I created one. There is a comprehensive wizard on AWS to guide you in the creation, I chose all the default options. I will be adding this step to my cloud formation script in a coming week.

IAM Roles

This step is can be done before or while you create a Lambda function. Even though I documented this in a previous week, I'm listing it because later you will need the ARN code for the cloud formation script. You get its ARN code from the top of the IAM role screen.

My decision on what my lambda function can do on my AWS infrastructure might not the same your yours. I want them to be able to create and log stuff eventually, so I have provided for that. You can use a wizard to create the role or on the create screen you can copy and paste in this JSON policy.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": "arn:aws:logs:*:*:*"
        }
    ]
}

Lamba Function

Manually creating the function, then using a zip from the S3 bucket is an important step before going on to automate. I encourage you to do it before moving on too.

Pasting an S3 link URL to your lambda function code .zip.

IAM Technical User

Your GitLab deployment pipeline needs access to the AWS cloud. You will also need programmatic access keys if you're going to use the CLI like describe in the next steps. I am calling this a technical user. You'll need to go into the IAM screen and create a new user, specifying its for programmatic access.

You will also need to create a group if you don't already have one for the purpose of deploying stuff. A group needs an access policy, which I found myself updating at each step of the way. Giving the minimum amount of permissions is hard if like me you haven't done much of this before, I'm sure this task has scared more than a few people off using the cloud platform.

Here is my AWS policy which I try to summarise:

  • S3
    • Actions: GetObject, PutObject, DeleteObject, ListBucket
    • Resource: All objects in bucket aws-lambda-price-grabber.
    • Why: To upload ZIP file from GitLab to Amazon
  • IAM
    • Action: PassRole
    • Resource: This is the ARN code of the role created in the previous step (get it from the IAM screen).
    • Why: The cloud formation script is creating a lambda function and it needs the role to execute under.
  • CloudFormation
    • Action: Currently ALL
    • Resource: restricted to the region Frankfurt
    • Why: To create a lambda function, but the permissions given are too broad.

Some permissions get nicely grouped and others don't because actions have nuances, but these are best seen on the IAM policy editor. This is my current JSON policy, I am unhappy with the cloud formation part but at least it works:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "iam:PassRole",
                "cloudformation:DeleteStack",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:cloudformation:eu-central-1:*:stack/*/*",
                "arn:aws:s3:::aws-lambda-price-grabber/*",
                "arn:aws:iam::385753165070:role/lambda-website-checker"
            ]
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "cloudformation:CreateUploadBucket",
                "lambda:CreateFunction",
                "cloudformation:ListExports",
                "cloudformation:ListStacks",
                "cloudformation:ListImports",
                "lambda:InvokeFunction",
                "lambda:GetFunction",
                "lambda:UpdateFunctionConfiguration",
                "cloudformation:GetTemplateSummary",
                "lambda:UpdateAlias",
                "cloudformation:EstimateTemplateCost",
                "lambda:UpdateFunctionCode",
                "cloudformation:DescribeAccountLimits",
                "lambda:PublishVersion",
                "lambda:DeleteFunction",
                "cloudformation:ValidateTemplate"
            ],
            "Resource": "*"
        },
        {
            "Sid": "VisualEditor2",
            "Effect": "Allow",
            "Action": "s3:ListBucket",
            "Resource": "arn:aws:s3:::aws-lambda-price-grabber"
        },
        {
            "Sid": "VisualEditor3",
            "Effect": "Allow",
            "Action": "cloudformation:*",
            "Resource": [
                "arn:aws:cloudformation:eu-central-1:*:stack/*/*",
                "arn:aws:cloudformation:eu-central-1:*:stackset/*:*"
            ]
        }
    ]
}

Cloud Formation

I started my script by using the AWS template designer in the AWS console and figured out the properties and syntax using the Amazon documentation. It still wasn't that easy, I spent quite some time Googling around reading blog post after how-to post and there are loads out there, but everyone is skimming over the subject. I only found blogs favouring to talk about other maybe cooler but highly technical things like how to use the Gradual Code Deployment feature of AWS, stuff I simply didn't need in this solution.

To create a Lambda function you'll need to look up the required properties and the first one is code, as in source code for the function. You need then need to drill down on code figure out what is needed. My current template used in the GitLab pipeline only creates a Lambda function, the following is the YAML version which is easier to read:

AWSTemplateFormatVersion: '2010-09-09'
Resources:
  LF4P1F9:
    Properties:
      Code:
        S3Bucket: aws-lambda-price-grabber
        S3Key: source.zip
      Description: Reading the price off the online shop.
      FunctionName: aws-lambda-price-grabber
      Handler: lambda_function.lambda_handler
      Role: arn:aws:iam::385753165070:role/lambda-website-checker
      Runtime: python2.7
    Type: AWS::Lambda::Function

In the code property I have given the S3 bucket name where my lambda function source code is, it's as simple as that. 

The handler property "lambda_function.lambda_handler" is the filename then a dot and then the function name.

Visually showing where the handler property

You need the ARN code for the role you created for your Lambda function to run under. You go to the IAM roles screen for that, as you can see in this screenshot.

Showing the IAM role where you get the ARN code for cloud formation

For valid values, like the Runtime property, you need to look through the AWS Lambda Developer Guide, you'll also save time by looking for all required properties.

You can see remnants of the cloud formation designer like the name LF4P1F9; I removed much of the metadata from the designer tool which makes the formation script much more challenging to read. I would encourage you to try the same.

I tested my cloud formation script in the AWS console using my credentials before moving on to the AWS CLI, which then runs under the technical user and won't have as many privileges and might give you some work on getting the IAM policy right.

AWS CLI

Initially, I was trying to develop this by running this from the GitLab job which is a very silly way. You will lose hours, install and set up the AWS CLI on your desktop. On Ubuntu it took the following commands:

:~$ pip install awscli
:~$ aws configure set default.region eu-central-1
:~$ aws configure set aws_access_key_id YOUR-SECRET-ID
:~$ aws configure set aws_secret_access_key YOUR-SECRET-KEY

You need to provide a region code and the access key for the IAM technical user you created in the previous step.

Simulating what GitLab has to do to deploy

You have to have a zip file ready to upload, the S3 bucket and the technical user.

Step 1: Uploading the ZIP file to the S3 bucket.

:~$ aws s3 cp ./test.zip s3://aws-lambda-price-grabber/

Step 2: Packaging the cloud formation script, ready to be deployed (my template).

:~$ aws cloudformation package --template-file ./deployment/aws-create-lambda.json --s3-bucket aws-lambda-price-grabber --output-template template-export.yml

Step 3: Deploying the cloud formation script which will create the lambda function and provide the initial ZIP file as the source code.

:~$ aws cloudformation deploy --template-file template-export.yml --stack-name aws-lambda-price-grabber-stack --capabilities CAPABILITY_IAM

It is taking the template-export.yml file we generated in the packing step. The CAPABILITY_IAM is acknowledging the capability.

You may also want to delete what was created using the formation script. I noted this command would fail if the deployment failed and then had to go into the AWS console to remove manually.

:~$ aws cloudformation delete-stack --stack-name aws-lambda-price-grabber-stack

Step 4: The last command I want to share is for updating lambda functions source code, which you will need as well.

:~$ aws lambda update-function-code --function-name aws-lambda-price-grabber --s3-bucket aws-lambda-price-grabber --s3-key source.zip

GitLab Deploy Job

In the end, each job when completed will show up green in the pipeline. In this section, you'll see my CI script.

GitLab pipeline jobs passing; build, test, package and deploy.

The .gitlab-ci.yml file has now become rather long.

variables:
    S3_BUCKET: aws-lambda-price-grabber
    AWS_DEFAULT_REGION: eu-central-1

stages:
  - build
  - test
  - package
  - deploy

build:
  image: python:2.7
  stage: build
  script:
    - echo "Building"
    - pip install -r source/requirements.txt -t source/
  artifacts:
    paths:
      - source/
      - test-data/

test:
  image: python:2.7
  stage: test
  script:
    - echo "Testing"
    - pip install -r source/requirements.txt
    - pip install nose2
    - nose2 -v

package:
  image: buildpack-deps
  stage: package
  script:
    - rm source/test_*
    - apt-get update
    - apt-get -y install zip unzip
    - cd source
    - zip -r source.zip *
  artifacts:
    paths:
      - ./

deploy:
  image: python:latest
  stage: deploy
  script:
    - pip install awscli
    - aws s3 cp source/source.zip s3://$S3_BUCKET/
    - aws cloudformation package --template-file ./deployment/aws-create-lambda.json --s3-bucket $S3_BUCKET --output-template template-export.yml
    - aws cloudformation deploy --template-file template-export.yml --stack-name aws-lambda-price-grabber-stack --capabilities CAPABILITY_IAM --no-fail-on-empty-changeset
    - aws lambda update-function-code --function-name aws-lambda-price-grabber --s3-bucket $S3_BUCKET --s3-key source.zip
  artifacts:
    paths:
      - ./template-export.yml

The contents of the aws-create-lambda.json were discussed above under cloud formation. If you're really interested in this then you better just have a look at my source code repository

https://gitlab.com/neilspink/aws-lambda-price-grabber/tree/master

Next

The plan for the coming week - I want to schedule the function on AWS and get it to store some data in the cloud too.