Serverless application for scraping and filtering

26.4.2018 | 6 minutes reading time

In this article, I’m going to write about an application which I wrote for scraping and filtering advertisements from few different websites. The application uses the serverless framework (link ) and runs on AWS and the code is written in Python.

Background story

I am looking for a flat to buy. Since there are lots of recurring adverts on different ad websites, and the same flats are advertised multiple times to be always on the first page of results, I thought it would be an interesting idea to write an application which scrapes the advertisements from different sites and compares them. The main goal is to make a unique list of advertisements which are browseable.

In the past couple of months, I am working with serverless technologies and I came up with the idea to implement this application as a serverless application, document it, and share it in a blog post, as an intro to the serverless world with AWS and serverless framework.

The application is simple enough to understand, yet not a typical hello word application.

You can find the project on my GitLab account here .

Concept of serverless

In a nutshell, serverless means that you do not have to think about the servers. Just write the code which executes the business logic. The provider takes care of the rest (spinning up a container, initialization of the execution environment, code execution, scaling, etc.)

This enables fast project setup and efficient development.

Application architecture

The application is designed to be hosted entirely on AWS. Lambdas are used to implement the business logic (fetching, processing and filtering the advertisements), and serving the processed data through an HTTP endpoint using API Gateway. DynamoDB is used to store the data. A public S3 bucket is used to host the frontend.

The architecture of the application is in the picture below.

The architecture consists of the following functions:

scraper – scrapes the advertisements from three different sites for advertising. It extracts the data from the ads, formats the data and puts it into ScrapedAdverts DynamoDB table. This is a scheduled lambda which is executed 3 times per day.
aggregator – reads the data from ScrapedAdverts table and processes them. Checks if the given advertisement exists in FilteredAdverts table by performing a similarity check. If the advertisement exists, it is going to be updated. If it does not exists, the data is going to be inserted. This lambda is also scheduled and runs several times per day. It processes only a chunk of data from ScrapedAdverts (the amount of data which is returned in one scan by DynamoDB)
adverts_controller – acts as request handler for the API Gateway. It is mapped to GET /adverts/get?page= call.
db_cleaner – is executed once per day and cleans the ScrapedAdverts table. It deletes the entries which are older than 15 days.

The frontend is a static website (HTML and JS), which fetches the data from GET /adverts/get?page= endpoint and visualizes it. The next screenshot shows the frontend.

The frontend is hosted in a public S3 bucket and can be reached here . It is a simple static website which fetches and visualizes the data got from adverts_controller. The data is a list of scraped and filtered adverts in JSON format:

1{
2   "items":[
3      {
4         "metadata":[
5            "Key1=Value1",
6            "Key2=Value2"
7         ],
8         "location":[
9            "Location name 1",
10            "Location name 2"
11         ],
12         "area":55,
13         "processed":true,
14         "timestamp":1524463412,
15         "images":[
16            "https://url.to/image.jpg",
17         ],
18         "text":"Longer description of the property",
19         "link":"https://link.to/propery",
20         "advertiser":{
21            "name":"Advertiser name",
22            "phones":[
23               "066 1234567",
24               "021 1234567"
25            ]
26         },
27         "price":66000,
28         "title_hash":"11344e17595d494506e87fa61925018b34443016",
29         "title":"Title of advert",
30         "similar_adverts":[
31           {
32             "link":"http://link.to/similar-advert/1",
33             "title":"Similar advert 1"
34           }
35         ]
36      }
37   ],
38   "page":0,
39   "number_of_pages":5,
40   "count":124,
41   "page_count":25
42}

Project structure

The project is structured in a way that every function is in its own folder, and has its own dependencies (requirements.txt).
The exceptions are the following directories:

test – contains the unit tests.
utils – contains common helper functions. The content of this directory is included in every packed function.
client/dist – contains the frontend code (HTML and JS).

In the root of the project is the file serverless.yml, which is the main entry point. This file configures the serverless framework for this application.

The project has the following structure:

The serverless.yml

The main entry point of the project is the serverless.yml file. This file tells the serverless framework what do deploy and how.

It consists of the following parts:

provider: configures the cloud provider, which is AWS in our case. It defines the runtime, region and other common values which are applied to every function.
package: configures the way of packing the functions.
functions: defines the lambda functions. Under every function is the configuration for the given function. handler specifies the method which is called when the function is invoked. The global configuration values can be overridden in the functions. event defines what invokes the function.
resources: defines the resources which should be created when the application is deployed. The resources part must use CloudFormation syntax.
plugins: defines the plugins which are used by the serverless framework.
custom: defines the custom variables set by the user and the configuration values for the plugins.

You can find the project’s serverless.yml here .

Below is an example which defines a function, configures an HTTP event for the given function and creates a DynamoDB table:

1service: my-sls-service
2
3provider:
4  name: aws
5  runtime: python3.6
6  iamRoleStatements:
7    - Effect: "Allow"
8      Action:
9        - "dynamoDB:*"
10      Resource: "*"
11
12package:
13  individually: true
14
15functions:
16  my_controller:
17    handler: lambda_handler.handle
18    module: my_controller
19    environment:
20      USERS_TABLE: Users
21    events:
22      - http:
23          path: users/get/all
24          method: get
25
26resources:
27  Resources:
28    usersTable:
29      Type: AWS::DynamoDB::Table
30      Properties:
31        TableName: Users
32        AttributeDefinitions:
33          - AttributeName: email
34            AttributeType: S
35        KeySchema:
36          - AttributeName: email
37            KeyType: HASH
38        ProvisionedThroughput:
39          ReadCapacityUnits: 5
40          WriteCapacityUnits: 5
41
42plugins:
43  - serverless-python-requirements

The lambda handler

On AWS lambda, when using Python, the function which is used for handling the invocation should have the following signature:

1def handler(event, context):
2    return

event holds the data which is passed to function, e.g.: if the handler handles HTTP events, the request body, the query parameters, path parameters, etc. are passed in the event object.
context is injected by the AWS lambda runtime, and it can be used to gather information and interact with the runtime. You can find more info about this on this link .

Boto3 library is the de-facto standard in Python to interact with the AWS services. It is available in the AWS Python runtime.

The code which would satisfy the above-given example would look like this:

1import boto3
2from os import environ as env
3 
4 
5def handle(event, context):
6    users_table = boto3.resource('dynamodb').Table(env['USERS_TABLE'])
7    return users_table.scan()['Items']

Testing

So far, so good, it’s simple and easy to write functions. But what about the testing? Testing a lambda function by deploying it and invoking, and then watching the logs is a bad idea.

Writing unit tests is a crucial step in writing better code. Fortunately, the lambda functions are easily testable. Moto is a powerful library for testing lambda function. It mocks AWS services like DynamoDB and the mocked service behaves like the real service.

You can find more about testing in the project’s readme file .

Conclusion

We’ve seen that implementing a simple application is easy with the help of the serverless framework and AWS stack.

There are several things which can be added/improved:

security: deny all permissions and allow only needed permissions on function level
ability to manage ads: add an authenticated user which can manage the scraped adverts
make the scraper configurable
improve the adverts matching algorithm

Links

Project on my GitLab
Serverless framework
Boto3
Moto
Awesome Serverless

Was this post helpful?

Blog author

Jozef Jung

Do you still have questions? Just send me a message.

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Das ultimative Tool für Ingenieure und Entwickler: Compass Premium

Es kommt nicht jeden Tag vor, dass ein Tool auf den Markt kommt und die Arbeitsweise von Ingenieur- und Entwicklungsteams neu definiert, aber Compass ist das Tool mit einer bahnbrechenden Lösung. Als sofort einsatzbereite interne Entwicklerplattform ...

Atlassian
Cloud

3.12.2024 | 4 Minuten Lesezeit

Özge Kavas

AWS European Sovereign Cloud: Die wichtigsten Informationen

Im Oktober 2023 hat Amazon Web Services seine Pläne vorgestellt, die AWS European Sovereign Cloud als “unabhängige” europäische Cloud-Lösung auf den Markt zu bringen. Erklärtes Ziel des Vorhabens ist es, Kunden im öffentlichen Sektor und streng regulierten...

Cloud
AWS
Compliance

29.10.2024 | 9 Minuten Lesezeit

Björn Bohn

Wir haben unsere SaaS-Anwendung auf fly.io deployed (und dabei richtig...

Wie wir unsere Anwendung in einem Bruchteil der Zeit bereitgestellt und dabei 100 % der Kosten eingespart haben. Unser Team, bestehend aus einer Gruppe erfahrener Software-Entwickler ohne Cloud Vorkenntnisse, wollte unseren OCPP-konformen EV-Ladesäulen...

AWS
Cloud

23.10.2024 | 4 Minuten Lesezeit

Jannis Mainczyk

Cloud-Lösungen in der Architekturrichtlinie des Bundes

Um die Rahmenbedingungen für IT- und Digitalisierungsvorhaben für die Bundesverwaltung festzulegen, existiert bereits seit einigen Jahren die Architekturrichtlinie für die IT des Bundes. Im Folgenden haben wir die Vorgaben hinsichtlich des Themas Cloud...

Cloud
Compliance

10.10.2024 | 7 Minuten Lesezeit

Björn Bohn

Marc Bialowons

Lessons learned: Was wir in einem Jahr ML Orchestrierung mit Dagster gelernt...

In einem gemeinsamen Projekt haben Tom Scholz und ich Machine Learning (ML) Services gebaut, um einem Kunden bei der Analyse von Dokumenten zu helfen. Eine Proof-Of-Concept Lösung war schnell gebaut, die es nun zu operationalisieren gilt. Hierbei war...

Machine Learning
Python
Data
Data Science

12.9.2024 | 27 Minuten Lesezeit

Patrick Soschinski

Tom Scholz

Dangling DNS in Cloud Infrastrukturen

Dangling DNS Einträge sind nichts neues. Vergessene, veraltete oder fehlerhafte DNS-Einträge können dazu führen, dass Subdomänen übernommen werden können und beispielsweise bei Phishing-Kampagnen genutzt werden um Geheimnisse von MitarbeiterInnen zu ...

IT-Security
Validierung
Cloud
AWS
Infrastructure

5.9.2024 | 3 Minuten Lesezeit

Markus Höfer

Rust in der Cloud: Performance-Vergleich mit TypeScript und Java in AWS...

In diesem Artikel setzen wir Rust ein, um AWS-Lambda-Funktionen zu implementieren und vergleichen die Performance mit TypeScript (Node.js) und Java (JVM). Rust ist momentan in aller Munde und wird für seine Performance, Effizienz und Speichersicherheit...

Rust
Cloud
AWS
Serverless
Node.js
Java
JavaScript
Green IT

20.6.2024 | 6 Minuten Lesezeit

Nicolas Großmann

Willkommen in der nächsten Ära von Jira!

Während der jährlichen Konferenz in Las Vegas hat Atlassian die sogenannte „nächste Ära von Jira“ angekündigt, die ab dem 1. Mai beginnt.Bis jetzt haben wir alle „verschiedene“ Jiras genutzt. Das eine war „Jira-Software“ und das andere „Jira Work Management...

Cloud
Atlassian

15.5.2024 | 4 Minuten Lesezeit

Aurimas Brazaitis

Adrian Voigt

Von Skepsis zu Innovation: Wie Confidential Computing den Weg in die Cloud...

“Unser Datenschutzbeauftragter sagt, wir können nicht in die Cloud.” - "Es ist zu riskant", sagt er, während er besorgt auf den Berg von Papierakten auf seinem Schreibtisch blickt. “Unsere sensiblen Daten überall anders als in unseren sicheren, physischen...

Digitalisierung
Cloud
Compliance

14.5.2024 | 7 Minuten Lesezeit

Stefanie Schwilski

Philip Herzog

Public Cloud im regulierten Sektor: Das ist zu beachten

Es war längere Zeit ein weit verbreitetes und in strategischen Debatten häufig zitiertes Missverständnis, dass die Bundesanstalt für Finanzdienstleistungsaufsicht (BaFin) dem Einsatz von Public-Cloud-Anbietern wie AWS, Azure und Co. einen Riegel vorschiebt...

Cloud
Compliance

10.4.2024 | 6 Minuten Lesezeit

Marc Bialowons

Björn Bohn

Green Cloud: Daten und Emissionen sparen

Das Internet produziert jährlich 900 Millionen Tonnen CO₂ – das ist deutlich mehr als Deutschland insgesamt emittiert. Hauptverantwortlich ist der immer weiter steigende Stromverbrauch beim Transport und der Speicherung von Daten. Wenn ihr kurz darüber...

Cloud
Green IT
Softwarearchitektur
Data

11.3.2024 | 5 Minuten Lesezeit

Dennis

AZ-900-Zertifizierung: Mein How-to!

Was ist AZ-900? Azure bietet eine Reihe verschiedener Zertifizierungen an. Zu finden sind sie hier. Darunter befindet sich auch die Zertifizierung AZ-900. Bei diesem Zertifikat handelt es sich um Microsoft Certified: Azure Fundamentals. Diese prüft unter...

Azure
Cloud

2.1.2024 | 5 Minuten Lesezeit

Ege Inanc

Mit FinOps die größten Kostenfallen bei AWS S3 verhindern

In der Welt der Cloud-Technologie und insbesondere bei AWS (Amazon Web Services) ist die effiziente Verwaltung von Ressourcen von entscheidender Bedeutung, um unnötige Kosten zu vermeiden. Dieser Blogbeitrag konzentriert sich auf AWS S3 und die teuren...

AWS
Cloud

27.11.2023 | 4 Minuten Lesezeit

Lukas Miliunas

Maximilian Mayer

Cloud FinOps

Cloud FinOps bietet einen etablierten Prozess, um Kosten für den Cloudbetrieb zu reduzieren (s. auch diesen Artikel). Zu diesem Zweck bietet es ein etabliertes Cloud-unabhängiges Vorgehen, das eine Organisation schrittweise aufgreifen kann. Das Tooling...

Cloud
Cloud Native
Green IT

26.10.2023 | 5 Minuten Lesezeit

Lukas Miliunas

Marco Paga

Mehr Struktur in der Cloud mit Azure Landing Zones

Die Migration in die Cloud bringt einige Herausforderungen mit sich. Viele Unternehmen stehen vor der Frage, wie ein effizienter und sicherer Aufbau einer skalierbaren Cloud-Infrastruktur umzusetzen ist. Die Antwort auf diese Herausforderung liegt in...

Cloud
Azure
IT-Governance

4.8.2023 | 4 Minuten Lesezeit

Florian Moll

Nils Bauroth

CI/CD-Pipelines mit AWS CDK CodePipeline

Das Aufsetzen der CI/CD-Pipeline ist ein typischer Task in der Anfangszeit eines Projekts. Ist die Pipeline dann aufgesetzt, sind Änderungen nur noch selten notwendig. Dementsprechend wenig Routine entwickeln Programmierende im Umgang mit der Konfiguration...

Cloud
CI/CD
AWS

17.7.2023 | 4 Minuten Lesezeit

Dennis

Green Cloud: Nachhaltig skalieren

Wenn Softwareprojekte in die Cloud gebracht werden, versprechen wir uns davon hohe Verfügbarkeit, planbare Kosten und eine immer dem Bedarf entsprechende Skalierung. Aufgrund der grenzenlosen Angebote ist es aber auch leicht, die Komponenten eines Systems...

Cloud
Softwarearchitektur
Green IT

12.6.2023 | 5 Minuten Lesezeit

Dennis

Crossplane: Eine Lösung für hybride Cloud-Herausforderungen?

Crossplane ist ein plattformübergreifendes Kontrollsystem (Control-Plane), das das Management von Cloud-Ressourcen vereinfachen und automatisieren soll. Das Tool ermöglicht es, verschiedene Cloud-Provider und lokale Ressourcen, z. B. Kubernetes-Cluster...

Cloud
Cloud Native

12.5.2023 | 2 Minuten Lesezeit

Matthias Niehoff

Green Cloud: Ideen für eine nachhaltigere Architektur

Die ökologische Nachhaltigkeit eines Systems ist aktuell häufig noch kein Thema. Nachhaltigkeit bedeutet für mich in diesem Kontext die Reduktion der verursachten Emissionen durch gesenkten Ressourcenverbrauch – egal ob die Emissionen beim Cloudprovider...

Cloud
Softwarearchitektur
Green IT

5.5.2023 | 5 Minuten Lesezeit

Dennis

Datenanalyse auf die schnelle Art – mit Amazon Athena und GitLab

Wenn wir Erkenntnisse aus großen Datenmengen gewinnen wollen, bieten uns Cloud Service Provider inzwischen Lösungen an, dank derer wir uns kein Data Warehouse oder Hadoop-Cluster mehr in den Keller stellen müssen. AWS hat mit Athena, RedShift und EMR...

Cloud
Big Data
AWS
Serverless
GitLab

21.3.2023 | 16 Minuten Lesezeit

Maik Fleuter

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.

Hilf uns, noch besser zu werden.

Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.

Serverless application for scraping and filtering

Background story

Concept of serverless

Application architecture

Project structure

The serverless.yml

The lambda handler

Testing

Conclusion

Links

Was this post helpful?

Blog author

Get in contact

Get in contact

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

View Job

More articles in this subject area

Das ultimative Tool für Ingenieure und Entwickler: Compass Premium

AWS European Sovereign Cloud: Die wichtigsten Informationen

Wir haben unsere SaaS-Anwendung auf fly.io deployed (und dabei richtig...

Cloud-Lösungen in der Architekturrichtlinie des Bundes

Lessons learned: Was wir in einem Jahr ML Orchestrierung mit Dagster gelernt...

Dangling DNS in Cloud Infrastrukturen

Rust in der Cloud: Performance-Vergleich mit TypeScript und Java in AWS...

Willkommen in der nächsten Ära von Jira!

Von Skepsis zu Innovation: Wie Confidential Computing den Weg in die Cloud...

Public Cloud im regulierten Sektor: Das ist zu beachten

Green Cloud: Daten und Emissionen sparen

AZ-900-Zertifizierung: Mein How-to!

Mit FinOps die größten Kostenfallen bei AWS S3 verhindern

Cloud FinOps

Mehr Struktur in der Cloud mit Azure Landing Zones

CI/CD-Pipelines mit AWS CDK CodePipeline

Green Cloud: Nachhaltig skalieren

Crossplane: Eine Lösung für hybride Cloud-Herausforderungen?

Green Cloud: Ideen für eine nachhaltigere Architektur

Datenanalyse auf die schnelle Art – mit Amazon Athena und GitLab

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Unsere Leistungen

Hilf uns, noch besser zu werden.

Zu den Jobangeboten