Abstract Image for Unit Testing Python Web Scraper

Lambda web scraper project (week 6)

What Happened Last Week

I've been working on my Python unit tests involving AWS Lambda functions. Those unit tests then became particularly handy, as I changed from using Python 2.7 to Python 3.6. Finally, I also started work on a class for creating log files but that is unfinished.

The source code can be found here https://gitlab.com/neilspink/aws-lambda-price-grabber

Why Change from Python 2.7 to Python 3.6

It hasn't bothered me so far, but I got the hint that going to Python 3 is a must after receiving a newsletter from DigitalOcean with their book 'how to code in python'.

Apparently in 2020 Python 2 is to lose support andgenerally when creating a solution or program you want it to have a little longer life than a year.

Up to this point, I was not aware that in Python 2 any number that you type without decimals is treated as an integer and it does floor division. So, a division like 5/2 = 2 instead of 2.5. My progress has already slowed a little and I don't need any additional programming in the next modules of my project, where there will definitely be division going on.

The final additional benefit of going to Python 3 is that it uses Unicode by default and because web pages usually are containing Unicode this could also save extra development time. I remember in my first wee seeing output like u'90.00' for the price, the u' being Python 2 syntax for Unicode.

What I did to upgrade to Python3

I started by creating a branch in the source code, because I wasn't sure if I'd complete the job.

git checkout -b python3

The next thing I had to do was setting changing the Python interpreter in PyCharm my Python editor, that's found under-> file -> settings-> project -> project interpreter.

I installed PIP3 and boto3 which I am using for accessing AWSresources like S3 buckets and Lambda functions.

sudo apt install python3-pip

pip3 install boto3

Appart from having to change print statements, e.g. Print 'Hello' v2 style to v3 style Print('Hello'). I found the importing references to other classes had slightly changed.

After all the changes I was exceptionally happy to have unit tests which proved everything was still working. The final changes were in my AWS CloudFormation template making the "Runtime": "python3.6" and in my .gitlab-ci.yml for GitLab to take image: python:3.6

Time to merge the changes back into the master.

git checkout master
git merge python3
git push

I was amazed it all worked in just 2 commits. You can compare my before and after source code here; Commit af9a61a3 and Commit 5711a10e

Unit Test Code Coverage

Knowing what % of the source code gets tested can help you identify parts of a system you forgot to unit test and happened to me while developing AWS Lambda functions that were calling other Lambda functions. I started searching how to get the code coverage on my project and found there is a --with-coverage command line switch for nose2 utility I've been using.

nose2 --with-coverage

I increased the code coverage from 26% to 65% the invoker class which was calling other AWS Lambda functions. The increase is good enough for me. I'm not a believer in having high percentage coverages, because all too often the tests are less functional. I prefer if I can learn something about a program from a unit test, i.e. a unit test setting and reading string parameters is not valuable, doing an action or calculation is. 

Showing code coverage using the nose2 utility with the command line switch --with-coverage

Unit Test Mocking

I'm not sure if you know what mocking is, but in case you haven't heard this term, it is just a way to replace parts of your system under test with mock ones, faking part of the system to simulate the behaviour of the real ones. For me, I needed to mock AWS Lambda functions. 

I lost a lot of time trying a library called Moto which I found on GitHub, it looked very promising, but I gave up on it. I spent several hours trying to get it to work, my final test before I gave up was cloning the library and running nose2, and none of the unit tests worked on my computer. 

Luckily I found the documentation on Python.org 😀 https://docs.python.org/3/library/unittest.mock.html, and I got my unit tests working with the patch. I said this at the beginning 5-weeks ago; I don't think you need to pay for any online courses, the documentation is all there. Although I don't find it easy to read sometimes, it was worth the effort.

Mocking an AWS Lambda Function

A little background on what I was testing. In my web scraper project, I want to get the prices from multiple websites. I have a list of jobs and invoke the grabber to get the prices.

My first step, I moved all the code from my Lambda function to a class named Invoker.

from invoker import Invoker

def lambda_handler(event, context):
    if 'source' not in event:
        raise Exception("The 'source' key is missing from the event dictionary.")

    job = Invoker()

    result = job.grab(event)

    print(result)
    return result
In the Invoker class, I moved the moved the AWS Lambda call into a private method called _invoke_lambda. The underscore prefix is just meant as a hint to another programmer that it is intended for internal use.
   def grab(self, event):
        ...
        for site in website_list['sites']:  
            ...
            response = self._invoke_lambda(payload)
The AWS Lambda call being
@staticmethod
def _invoke_lambda(payload):

    client = boto3.client('lambda')

    return client.invoke(FunctionName='grab-price',
                         InvocationType='RequestResponse',
                         Payload=payload)

In just wanted it to return an arbitrary text message "none". For that I added at the top

from mock import patch

and used the patch function decorator

lambda_result = "none" 

with patch.object(Invoker, '_invoke_lambda', return_value=lambda_result):
    result = Invoker().grab(events)

Coming Next

I want to finish saving prices that get scrapped from websites, schedule my crawler and watch it run for a couple of days. 

Abstract Image for Deploying AWS CloudFormation through GitLab

Lambda web scraper project (week 5)

What Happened Last Week

I've continued building Python-based Lambda functions on Amazon in a GitLab project aws-lambda-price-grabber...

The first part of the week I created the AWS Cloud Formation scripts necessary to be able to recreate this solution. I wanted to get the tool to save prices but creating the template took up more time than I expected. 

I also noticed I had neglected to unit test some parts and started refactoring to be able to write some new tests. A code coverage utility might have highlighted the fact earlier! I am learning Python progressively but quickly ran into problems with my new unit tests. They were not running, lesson learnt: Not only does the filename of a unit test need to start with the word test_ but also the method names.

Stuff I had done in previous weeks like downloading the libraries, zipping and manually uploading is no longer required, so I changed my scripts in the build directory. One the other things I did was to add an open-source license to the project.

Setting up using the AWS CloudFormation Stack

A stack is a collection of AWS resources that you can manage as a single unit. This allows you to create a template for a solution, which is what I did last week.

There are 2 ways you can setup; run my deployment/aws-create-foundation.json template using the CloudFormation console. OR having created a new user with administrator-level access and access keys, you can do this via the AWS CLI, which is the way I would recommend since if you are going to install, then you are likely to want to make changes and the CLI will help you do everything faster:

aws cloudformation create-stack --stack-name grabber-foundation  --template-body file://./deployment/aws-create-foundation.json --capabilities CAPABILITY_NAMED_IAM --parameters ParameterKey=S3BucketName,ParameterValue=aws-lambda-price-grabber

aws cloudformation delete-stack --stack-name grabber-foundation

The parameter value aws-lambda-price-grabber needs to be changed. An Amazon S3 bucket name is globally unique, no two people can use the same name. 

One problem I was not able to programmatically solve is updating the S3 bucket name which appears in the CI/CD pipeline and the CloudFormation template for creating the Lambda functions (you'll need to do that):

Once the above commands are run and the files updated, then you simply need to run the CI/CD pipeline.

How to trigger a GitLab CI/CD pipeline

Lessons learnt while creating a CloudFormation template

(1). You don't have to learn to create the CloudFormation (CF) scripts by hand, there is a CloudFormer tool that can reverse engineer the resources you have set up. I decided not to use it because it requires an EC2 machine which would cost. In retrospect, it probably would have been more cost efficient to try it because it took me about 9 hours to figure out everything to make my stack template.

(2). I found the AWS role documentation confusing, in particular, what the AssumeRolePolicyDocument was all about:

"CFNRole": {
  "Type": "AWS::IAM::Role",
  "Properties": {
    "RoleName": "AssumeRolePolicyDocument",
    "AssumeRolePolicyDocument": {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "Service": "lambda.amazonaws.com"
          },
          "Action": "sts:AssumeRole"
        }
      ]
    }
  }

The CloudFormation AssumeRolePolicyDocument attribute for roles is found on the IAM screen under the tab "Trust Relationships", click the blue button to "edit", then copy paste to your template.

The CloudFormation AssumeRolePolicyDocument attribute for roles is found on the IAM screen under the tab

(3). The last lesson I want to share is about linking multiple policies with a role. In a CloudFormation template, you can't do that!

Over the last couple of weeks while setting up the role for my Lambda functions, I was creating a separate role for each kind of thing I wanted them to do:

  • Read/Write to S3 bucket.
  • Creating log entries.
  • Permissions to execute other functions.

I liked that because you then know exactly what the purpose of the permissions was for. Unfortunately, I found when running my CloudFormation template that only one of the policies I wanted was ever hooked up to my new role. So I merged the permissions into one policy. 

Deep Dive into Cloud Formation

Coming Next

I want to finish unit testing the Lambda function that invokes the grabber, get the solution saving prices it getting and send alerts.