Abstract Image for Python Web Scraper Blog

Lambda web scraper project (week 7)

What Happened Last Week

Battled through a problem with the imports on my Python 3 project. Spent time reading multiple definitive guides to solve the infamous Python “ModuleNotFoundError” exception. It didn't occur until I was 3 classes deep, so a unit test was referencing a class which imported another. My vlog this week is just on modules.

I got my log files saved to S3 buckets and I am quite happy because in the process I think I managed to reduce the complexity. The program was getting many IF statements because I was trying to switch between local hard disk storage during tests and S3 buckets. I started using Abstract Base Classes and everything looks and feels much better.

 

Python Package Structural Changes

The last 6 weeks had been going too smoothly, but within a week of upgrading to Python 3 I hit a very challenging problem. I wrote unit tests which worked but then when I ran the modules from the command line I would get import errors. Fix one way of running the program and the other wouldn't work.

After several days of intense reading the idea to look at some base code, I choose the calendar library and quickly realized you don't need to have one class per file the way I was used to doing it in C#.

Once I merge all my classes into one module/file, the ImportError: No module named XXXX basically, puff, disappeared. 

Actually, the solution looks much different to 6 weeks ago.

Class diagram of Python web scraper classes

Feature Switch

Two weeks ago I started the logging feature, but the import module problem delayed me and I was unable to finish it. So I added a feature switch to allow the solution to still run when deployed to AWS.
class Log(object):
    """
    Stores latest execution information
    """

    FEATURE_ENABLED = False  # can be removed once S3 bucket is integrated

    def __init__(self, storage: AbstractLogStorage):
        self.storage = storage

    def latest_execution(self, job_name, price):
        if not self.FEATURE_ENABLED:
            return
        self._append_to_job_log(job_name, price)  # each job gets its own file with price history
        self._update_central_job_log(job_name, price)  # goes into the LAST_EXECUTED_FILENAME
...

Abstract Base Classes

In C# we get these plus Interfaces. I was searching for this early in the project when I started thinking about how to mock my S3 bucket and AWS Lambda functions.

class AbstractJobsStorage(ABC):
    """
    Jobs can be loaded from S3 or local hard disk (YML files)
    """

    @abstractmethod
    def load(self, object_name: str) -> dict:
        pass

Passing in (ABC) makes it abstract. You'll need to it to the imports list at the top of the file too.

from abc import ABC, abstractmethod

Then we can write different implementations. I started with the storing data locally, until I had the data structures sorted out.

class JobStorageOS(AbstractJobsStorage):
    """
    Load jobs from local hard disk
    """

    def __init__(self, filepath):
        self.filepath = filepath

    def load(self, object_name):
        with open(self.filepath + object_name, 'r') as logfile:
            return yaml.load(logfile)

The inheritance is set on the class using (AbstractJobsStorage).

In my unit test, I used it like this.

storage = LogStorageOS(self.STORAGE_PATH)
log = Log(storage)

If you haven't done it before, my explanation might not be enough. I suggest quickly clone the solution, run pip3 on the requirements.txt and requirements_unit_test.txt file, then nose2. If it works and you have the diagram you'll figure it. Also read the python.org doco on ABC too.

AWS Problems

Deploying Lambda functions doesn't mean they work. I'm thinking of writing a post-deployment test for that and mention it because it occurred a couple of times and I didn't spot it.

My solution has 2 AWS Cloud Formation templates, and I had to make modifications to the foundation template which sets up the environment. The Lambda functions needed additional permissions to write to S3 buckets. Pushing the changes to GitLab did not make them appear on AWS, and I realise now that I need to rework this part of the solution because I create part of the stack manually.

Figuring out the policy change for the role that my Lamba functions use was also tricky. I thought just adding the s3:PutObject and s3:ListBucket action directly to the template would be enough, but after it didn't work tried via the IAM policy editor to find out that s3:ListBucket needed additional restrictions. The good part here is you can copy paste back the policy into your Cloud Formation template.

On a final note, Lambda functions recently got a new feature called Layers. It allows functions to share libraries. I haven't bother implementing it yet.

Coming Next

I have setup AWS to trigger my web scraper every 2 hours and will let it run for a few days. In the meantime, I have already created a new project for the next phase of this project. To do what I did on Amazon but on Google Cloud Platforms (GCP).

https://gitlab.com/neilspink/gcp-price-grabber

I will be figuring out how to get the solution up and running on GCP.

Abstract Image for Unit Testing Python Web Scraper Blog

Lambda web scraper project (week 6)

What Happened Last Week

I've been working on my Python unit tests involving AWS Lambda functions. Those unit tests then became particularly handy, as I changed from using Python 2.7 to Python 3.6. Finally, I also started work on a class for creating log files but that is unfinished.

The source code can be found here https://gitlab.com/neilspink/aws-lambda-price-grabber

Why Change from Python 2.7 to Python 3.6

It hasn't bothered me so far, but I got the hint that going to Python 3 is a must after receiving a newsletter from DigitalOcean with their book 'how to code in python'.

Apparently in 2020 Python 2 is to lose support andgenerally when creating a solution or program you want it to have a little longer life than a year.

Up to this point, I was not aware that in Python 2 any number that you type without decimals is treated as an integer and it does floor division. So, a division like 5/2 = 2 instead of 2.5. My progress has already slowed a little and I don't need any additional programming in the next modules of my project, where there will definitely be division going on.

The final additional benefit of going to Python 3 is that it uses Unicode by default and because web pages usually are containing Unicode this could also save extra development time. I remember in my first wee seeing output like u'90.00' for the price, the u' being Python 2 syntax for Unicode.

What I did to upgrade to Python3

I started by creating a branch in the source code, because I wasn't sure if I'd complete the job.

git checkout -b python3

The next thing I had to do was setting changing the Python interpreter in PyCharm my Python editor, that's found under-> file -> settings-> project -> project interpreter.

I installed PIP3 and boto3 which I am using for accessing AWSresources like S3 buckets and Lambda functions.

sudo apt install python3-pip

pip3 install boto3

Appart from having to change print statements, e.g. Print 'Hello' v2 style to v3 style Print('Hello'). I found the importing references to other classes had slightly changed.

After all the changes I was exceptionally happy to have unit tests which proved everything was still working. The final changes were in my AWS CloudFormation template making the "Runtime": "python3.6" and in my .gitlab-ci.yml for GitLab to take image: python:3.6

Time to merge the changes back into the master.

git checkout master
git merge python3
git push

I was amazed it all worked in just 2 commits. You can compare my before and after source code here; Commit af9a61a3 and Commit 5711a10e

Unit Test Code Coverage

Knowing what % of the source code gets tested can help you identify parts of a system you forgot to unit test and happened to me while developing AWS Lambda functions that were calling other Lambda functions. I started searching how to get the code coverage on my project and found there is a --with-coverage command line switch for nose2 utility I've been using.

nose2 --with-coverage

I increased the code coverage from 26% to 65% the invoker class which was calling other AWS Lambda functions. The increase is good enough for me. I'm not a believer in having high percentage coverages, because all too often the tests are less functional. I prefer if I can learn something about a program from a unit test, i.e. a unit test setting and reading string parameters is not valuable, doing an action or calculation is. 

Showing code coverage using the nose2 utility with the command line switch --with-coverage

Unit Test Mocking

I'm not sure if you know what mocking is, but in case you haven't heard this term, it is just a way to replace parts of your system under test with mock ones, faking part of the system to simulate the behaviour of the real ones. For me, I needed to mock AWS Lambda functions. 

I lost a lot of time trying a library called Moto which I found on GitHub, it looked very promising, but I gave up on it. I spent several hours trying to get it to work, my final test before I gave up was cloning the library and running nose2, and none of the unit tests worked on my computer. 

Luckily I found the documentation on Python.org 😀 https://docs.python.org/3/library/unittest.mock.html, and I got my unit tests working with the patch. I said this at the beginning 5-weeks ago; I don't think you need to pay for any online courses, the documentation is all there. Although I don't find it easy to read sometimes, it was worth the effort.

Mocking an AWS Lambda Function

A little background on what I was testing. In my web scraper project, I want to get the prices from multiple websites. I have a list of jobs and invoke the grabber to get the prices.

My first step, I moved all the code from my Lambda function to a class named Invoker.

from invoker import Invoker

def lambda_handler(event, context):
    if 'source' not in event:
        raise Exception("The 'source' key is missing from the event dictionary.")

    job = Invoker()

    result = job.grab(event)

    print(result)
    return result
In the Invoker class, I moved the moved the AWS Lambda call into a private method called _invoke_lambda. The underscore prefix is just meant as a hint to another programmer that it is intended for internal use.
   def grab(self, event):
        ...
        for site in website_list['sites']:  
            ...
            response = self._invoke_lambda(payload)
The AWS Lambda call being
@staticmethod
def _invoke_lambda(payload):

    client = boto3.client('lambda')

    return client.invoke(FunctionName='grab-price',
                         InvocationType='RequestResponse',
                         Payload=payload)

In just wanted it to return an arbitrary text message "none". For that I added at the top

from mock import patch

and used the patch function decorator

lambda_result = "none" 

with patch.object(Invoker, '_invoke_lambda', return_value=lambda_result):
    result = Invoker().grab(events)

Coming Next

I want to finish saving prices that get scrapped from websites, schedule my crawler and watch it run for a couple of days.