Abstract Image for end of scraper project

Lambda web scraper project (week 8)

What Happened Last Week

In this final week, I was hoping to move my AWS Lamba functions over to Google Cloud Platform (GCP). Unfortunately, at least for me at this time it didn't work, the GCP Python 3.7 functions offering is currently in BETA which means it's available for testing but not general release.

I didn't just give GCP a try though, I also registered on a new platform https://binaris.com. Who suggest their offering provides functions that invoke in milliseconds and can replace any container or instance based service. Her also I unfortunately also got stuck, but here it was more the lack of documentation. There service is also not fully finished.

Are Functions For Me?

For the moment I say hell YES! It currently costs $0 for me to run them.

While I don't like the fact of currently being tied into AWS and not being able to easily switch to another platform, I haven't had to pay a dime for running my Lambda functions. 

No Amount Due

My monthly bills during development show:

  • November $0 for 1.250 Lambda-GB-Second
  • December $0 for 46.375 Lambda-GB-Second
  • January so far $0 for 129.700 Lambda-GB-Second

I estimate now that I am running my crawler every 2 hours on two pages by the end of January it will be about 600 Lambda-GB-Seconds. On AWS Lambda you currently get 400,000 GB-Seconds for free. 

You need to be wary if you think you'll become a high volume user. The service costs could change at any time and depending on your architecture it might be expensive to get away from their pricing model.


Videos from the project

As part of this 8-week project, I managed to upload over 2.5 hours of video to YouTube and talked extensively about GitLab CI/CD, Python2 to 3 migration, Unit testing and learning Python as well as writing AWS cloud formation templates.

https://www.youtube.com/playlist?list=PL67_9ze31skoKRh-h0BMtukzC41Ossj8p

YouTube playlist for the python web scraper project

What's Next

I'm not sure. Last week I did start playing with porting my solution to Docker, but I know running containers will certainly cost money. I need to sleep on it and will vlog about it next Friday here https://www.youtube.com/neilspink

Abstract Image for Python Web Scraper

Lambda web scraper project (week 7)

What Happened Last Week

Battled through a problem with the imports on my Python 3 project. Spent time reading multiple definitive guides to solve the infamous Python “ModuleNotFoundError” exception. It didn't occur until I was 3 classes deep, so a unit test was referencing a class which imported another. My vlog this week is just on modules.

I got my log files saved to S3 buckets and I am quite happy because in the process I think I managed to reduce the complexity. The program was getting many IF statements because I was trying to switch between local hard disk storage during tests and S3 buckets. I started using Abstract Base Classes and everything looks and feels much better.

 

Python Package Structural Changes

The last 6 weeks had been going too smoothly, but within a week of upgrading to Python 3 I hit a very challenging problem. I wrote unit tests which worked but then when I ran the modules from the command line I would get import errors. Fix one way of running the program and the other wouldn't work.

After several days of intense reading the idea to look at some base code, I choose the calendar library and quickly realized you don't need to have one class per file the way I was used to doing it in C#.

Once I merge all my classes into one module/file, the ImportError: No module named XXXX basically, puff, disappeared. 

Actually, the solution looks much different to 6 weeks ago.

Class diagram of Python web scraper classes

Feature Switch

Two weeks ago I started the logging feature, but the import module problem delayed me and I was unable to finish it. So I added a feature switch to allow the solution to still run when deployed to AWS.
class Log(object):
    """
    Stores latest execution information
    """

    FEATURE_ENABLED = False  # can be removed once S3 bucket is integrated

    def __init__(self, storage: AbstractLogStorage):
        self.storage = storage

    def latest_execution(self, job_name, price):
        if not self.FEATURE_ENABLED:
            return
        self._append_to_job_log(job_name, price)  # each job gets its own file with price history
        self._update_central_job_log(job_name, price)  # goes into the LAST_EXECUTED_FILENAME
...

Abstract Base Classes

In C# we get these plus Interfaces. I was searching for this early in the project when I started thinking about how to mock my S3 bucket and AWS Lambda functions.

class AbstractJobsStorage(ABC):
    """
    Jobs can be loaded from S3 or local hard disk (YML files)
    """

    @abstractmethod
    def load(self, object_name: str) -> dict:
        pass

Passing in (ABC) makes it abstract. You'll need to it to the imports list at the top of the file too.

from abc import ABC, abstractmethod

Then we can write different implementations. I started with the storing data locally, until I had the data structures sorted out.

class JobStorageOS(AbstractJobsStorage):
    """
    Load jobs from local hard disk
    """

    def __init__(self, filepath):
        self.filepath = filepath

    def load(self, object_name):
        with open(self.filepath + object_name, 'r') as logfile:
            return yaml.load(logfile)

The inheritance is set on the class using (AbstractJobsStorage).

In my unit test, I used it like this.

storage = LogStorageOS(self.STORAGE_PATH)
log = Log(storage)

If you haven't done it before, my explanation might not be enough. I suggest quickly clone the solution, run pip3 on the requirements.txt and requirements_unit_test.txt file, then nose2. If it works and you have the diagram you'll figure it. Also read the python.org doco on ABC too.

AWS Problems

Deploying Lambda functions doesn't mean they work. I'm thinking of writing a post-deployment test for that and mention it because it occurred a couple of times and I didn't spot it.

My solution has 2 AWS Cloud Formation templates, and I had to make modifications to the foundation template which sets up the environment. The Lambda functions needed additional permissions to write to S3 buckets. Pushing the changes to GitLab did not make them appear on AWS, and I realise now that I need to rework this part of the solution because I create part of the stack manually.

Figuring out the policy change for the role that my Lamba functions use was also tricky. I thought just adding the s3:PutObject and s3:ListBucket action directly to the template would be enough, but after it didn't work tried via the IAM policy editor to find out that s3:ListBucket needed additional restrictions. The good part here is you can copy paste back the policy into your Cloud Formation template.

On a final note, Lambda functions recently got a new feature called Layers. It allows functions to share libraries. I haven't bother implementing it yet.

Coming Next

I have setup AWS to trigger my web scraper every 2 hours and will let it run for a few days. In the meantime, I have already created a new project for the next phase of this project. To do what I did on Amazon but on Google Cloud Platforms (GCP).

https://gitlab.com/neilspink/gcp-price-grabber

I will be figuring out how to get the solution up and running on GCP.