ALL >> Computer-Programming >> View Article
How To Make A Web Scraper With Aws Lambda And The Serverless Framework?
Before initiating with development, it is necessary to learn the below things:
Node.js and modern JavaScript
NPM
The Document Object Model
Basic Linux command line
Basic donkey care
The AWS idea is that Amazon provisioned and maintained all aspects of your application, from storage to processing power, in a cloud environment (i.e., on Amazon's computers), allowing you to design cloud hosting apps that grow automatically. You won't have to deal with setting up or managing servers because Amazon will take care of it. A Lambda function is a cloud-based function that may execute when it's needed and is triggered by signals or API requests. The use of a serverless framework is recommended to develop the Lambda function.
Why Use Scraper?
For instance, if you want to fetch the recipes which are posted on a particular website. Scraping this information from the website is possible.
Step 1: Serverless Setup
serverless-setup
Read the quick start guide for the serverless framework. Serverless will eliminate all of the difficulties associated with setting AWS infrastructure, ...
... allowing to develop and test locally before deploying everything to the cloud.
Developing a new serverless project work:
$ serverless create --template aws-nodejs --path donkeyjob$ cd donkeyjob
A serverless.yml file is used to start the project. YAML is a commonly used language for system settings, and it is this file that holds all of the AWS configuration information. For the time being, we can ignore all of the remarks and stick to the following:
service: donkeyjobprovider: name: aws runtime: nodejs6.10functions: getdonkeyjobs: handler: handler.getdonkeyjobs
As per the requirement, we have a function known as getdonkeyjobs, and we will export a function having the name from handler.js.
This is the function that will help in deploying AWS and will trigger to scrape the job listings data.
Creating a handler.js basic function.
Lambda functions are made up of three parts: an event, a context, and a callback. Let's start with some basics for now. You can remove the rest of the file.
module.exports.getdonkeyjobs = (event, context, callback) => {
callback(null, 'Hello world');
};
Check the script locally.
$ serverless invoke local --function getdonkeyjobs
“Hello World” will be the result of the above script.
Step 2: Scraping The Data
Building a scraping functionality for the donkey Sanctuary jobs page and parsing the HTML page for fetching the list of the jobs in the required format.
[
{job: 'Marketing Campaigns Officer', closing: 'Fri Jul 21 2017 00:00:00 GMT+0100', location: 'Leeds, UK'},
{job: 'Registered Veterinary Nurse', closing: 'Sat Jul 22 2017 00:00:00 GMT+0100', location: 'Manchester, UK'},
{job: 'Building Services Manager', closing: 'Fri Jul 21 2017 00:00:00 GMT+0100', location: 'London, UK'}
];
Axios is used for requesting the page contents and then passing over to the HTML string and a parsing function that can be tested. Inside the parsing function, the use of library cheerio is done for parsing the HTML file and getting the desired data.
Cheerio is similar to jQuery in which you can feed an HTML string (for example, the answer you get from a GET request for a page) and it will construct a document object-oriented approach for you to navigate and manage.
The Moment is a useful package for dealing with dates that makes it simple to construct an ISO String format.
const request = require('axios');
const {extractListingsFromHTML} = require('./helpers');
module.exports.getdonkeyjobs = (event, context, callback) => {
request('https://www.thedonkeysanctuary.org.uk/vacancies')
.then(({data}) => {
const jobs = extractListingsFromHTML(data);
callback(null, {jobs});
})
.catch(callback);
};
const cheerio = require('cheerio');
const moment = require('moment');
function extractListingsFromHTML (html) {
const $ = cheerio.load(html);
const vacancyRows = $('.view-Vacancies tbody tr');
const vacancies = [];
vacancyRows.each((i, el) => {
// Extract information from each row of the jobs table
let closing = $(el).children('.views-field-field-vacancy-deadline').first().text().trim();
let job = $(el).children('.views-field-title').first().text().trim();
let location = $(el).children('.views-field-name').text().trim();
closing = moment(closing.slice(0, closing.indexOf('-') - 1), 'DD/MM/YYYY').toISOString();
vacancies.push({closing, job, location});
});
return vacancies;
}
module.exports = {
extractListingsFromHTML
};
To use cheerio, you need to understand how to navigate the DOM with precision and choose the items you desire. To accomplish this, use the dev tools in your browser to study the HTML structure of the website you're scraping, and keep in mind that if the layout of that HTML future changes, your scraper may become worthless.
work-for-us
If we run our function now, we need to see the following array of jobs:
$ serverless invoke local --function getdonkeyjobs
work-for-us
Step 3: Setup DynamoDB
Lambda function cannot be used to persist data but it only saves the temporary information. We will configure DynamoDB as an AWS resource and give Lambda function permission to interact. Here the serverless.yml will look like this:
service: donkeyjob
provider:
name: aws
runtime: nodejs6.10
functions:
getdonkeyjobs:
handler: handler.getdonkeyjobs
resources:
Resources:
donkeyjobs:
Type: AWS::DynamoDB::Table
Properties:
TableName: donkeyjobs
AttributeDefinitions:
- AttributeName: listingId
AttributeType: S
KeySchema:
- AttributeName: listingId
KeyType: HASH
ProvisionedThroughput:
ReadCapacityUnits: 1
WriteCapacityUnits: 1
# A policy is a resource that states one or more permssions. It lists actions, resources and effects.
DynamoDBIamPolicy:
Type: AWS::IAM::Policy
DependsOn: donkeyjobs
Properties:
PolicyName: lambda-dynamodb
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- dynamodb:DescribeTable
- dynamodb:Query
- dynamodb:Scan
- dynamodb:GetItem
- dynamodb:PutItem
- dynamodb:UpdateItem
- dynamodb:DeleteItem
Resource: arn:aws:dynamodb:*:*:table/donkeyjobs
Roles:
- Ref: IamRoleLambdaExecution
In order to construct the DynamoDB resource, we'll have to deploy this to AWS. Because a Lambda function is merely a function, we can test it locally before communicating with AWS, but we can't test how a database works without actually having one.
As a result, we move:
$ serverless deploy
This sends our program to AWS and generates the resources we specified in the configuration file.
Step 4: Interact with DynamoDB
interact-with-dynamodb
It is now possible to use the database. For using a database, we need to install and use a package known as AWS-SDK (AWS Software Development Kit) which makes interaction of DynamoDB simple.
Here are the steps mentioned for scraping a new list of jobs.
Fetch yesterday’s job from the database using dynamo.scan method.
{
jobs: [ {job: 'Donkey Feeder',
closing: 'Fri Jul 21 2017 00:00:00 GMT+0100',
location: 'Leeds, UK'},
{job: 'Chef',
closing: 'Fri Jul 21 2017 00:00:00 GMT+0100',
location: 'Sheffield, UK'}
],
listingId: 'Fri Jul 21 2017 14:25:35 GMT+0100 (BST)'
}
You can compare to check the difference between yesterday’s jobs and today’s jobs by employing some handy dash techniques.
Dynamo.delete will help to delete yesterday’s job from the database.
Save the new jobs instead of with the dynamo.put technique.
callback with the new jobs.
const request = require('axios');
const AWS = require('aws-sdk');
const dynamo = new AWS.DynamoDB.DocumentClient();
const { differenceWith, isEqual } = require('lodash');
const { extractListingsFromHTML } = require('./helpers');
module.exports.getdonkeyjobs = (event, context, callback) => {
let newJobs, allJobs;
request('https://www.thedonkeysanctuary.org.uk/vacancies')
.then(({ data }) => {
allJobs = extractListingsFromHTML(data);
// Retrieve yesterday's jobs
return dynamo.scan({
TableName: 'donkeyjobs'
}).promise();
})
.then(response => {
// Figure out which jobs are new
let yesterdaysJobs = response.Items[0] ? response.Items[0].jobs : [];
newJobs = differenceWith(allJobs, yesterdaysJobs, isEqual);
// Get the ID of yesterday's jobs which can now be deleted
const jobsToDelete = response.Items[0] ? response.Items[0].listingId : null;
// Delete old jobs
if (jobsToDelete) {
return dynamo.delete({
TableName: 'donkeyjobs',
Key: {
listingId: jobsToDelete
}
}).promise();
} else return;
})
.then(() => {
// Save the list of today's jobs
return dynamo.put({
TableName: 'donkeyjobs',
Item: {
listingId: new Date().toString(),
jobs: allJobs
}
}).promise();
})
.then(() => {
callback(null, { jobs: newJobs });
})
.catch(callback);
};
We can test the function locally by executing
$ serverless invoke local --function getdonkeyjobs
And therefore, we should expect our callback to include a list of all the positions published on The Donkey Sanctuary today because they are all ‘new' to us. There are still no jobs in our database from the day before.
You should see today's data saved there if you go to the AWS console now, go to DynamoDB, select your donkey jobs table, and look at the entries.
We will see that the jobs array is clear if you run the function locally again. It's because we're comparing the jobs to whatever is already in the database, and nothing has changed unless a new job was added in the last few minutes.
Step 5: Sending a Text Using Nexmo
sending-a-text-using-nexmo
Let's send an SMS to our users notifying them of all the fascinating donkey employment they may be applying for now that we have a list of new opportunities!
To begin, create an account with Nexmo. It provides you a free $2 credit to play with, which is more than plenty. After you join up, you should be taken to a dashboard where you will be given a password and classified information. To send a text message from Nexmo, you'll need these.
We can easily handle the request to send a text using the nexmo npm package. It should be installed and placed in your handler.js file. We may send any text message we wish before calling the last callback on our getdonkeyjobs handler:
.then(() => {
if (newJobs.length) {
var nexmo = new Nexmo({
apiKey: NEXMO_API_KEY,
apiSecret: NEXMO_API_SECRET
});
nexmo.message.sendSms('Donkey Jobs Finder', MY_PHONE_NUMBER, 'Hello, we found a new donkey job!');
}
callback(null, { jobs: newJobs });
})
To test this, we'll need to clear the DynamoDB database (as seen below) so that the Lambda assumes there are new jobs, and then we'll be able to run our function locally once again.
sending-a-text-using-nexmo
And, with just about any luck, an SMS message should have arrived!
The final step is to improve the formatting of our text or email. For this, we may make a new helper function that takes a list of jobs and outputs a formatted message with the deadlines, locations, and job names for everything that's available.
Remember that anytime we would like to evaluate our function, we'll have to continue cleaning the table (there are certainly better ways to do this, but for now, it's easy enough to just remove the Item on the AWS console).
function formatJobs (list) {
return list.reduce((acc, job) => {
return `${acc}${job.job} in ${job.location} closing on ${moment(job.closing).format('LL')}\n\n`;
}, 'We found:\n\n');
}
module.exports = {
extractListingsFromHTML,
formatJobs
};
And now that we've finished, we can finally deploy our entire application to AWS:
$ serverless deploy
Step 6: Configuring Lambda to Execute Every Day
configuring-lambda-to-execute-every-day
After we've deployed the function, we can check it to make sure it's working properly:
We may also set the function to run once a day automatically. Selecting ‘Add Trigger', selecting ‘CloudWatch Events' from the drop-down menu, and then filling in the relevant details. To run it daily, we can use the schedule expression rate (1 day).
If you have any queries regarding this blog or if you want any web scraping services then Contact 3i data Scraping or ask for a free quote!
3i Data Scraping is an Experienced Web Scraping Services Company in the USA. We are Providing a Complete Range of Web Scraping, Mobile App Scraping, Data Extraction, Data Mining, and Real-Time Data Scraping (API) Services. We have 11+ Years of Experience in Providing Website Data Scraping Solutions to Hundreds of Customers Worldwide.
Add Comment
Computer Programming Articles
1. Which Are The Best Java Coding Classes In Bhopal?Author: Shankar Singh
2. Warehouse Management In Zambia: Essential Features To Look For
Author: Doris Rose
3. Ecommerce Web Design And Development In Melbourne With The Merchant Buddy
Author: themerchantbuddy
4. Why Website Maintenance Is Crucial For Business Success
Author: Yogendra Shinde
5. Boost Your Business With Smart Invoice Pos Software In Zambia
Author: Cecilia Robert
6. How Stablecoin Development Ensures Stability And Security?
Author: Michael noah
7. Công Cụ Tính Chiều Cao Chuẩn Từ Minbin Tool: Đo Lường Và Cải Thiện Chiều Cao Hiệu Quả
Author: KenJi123
8. How To Make A Courier App For Courier Delivery And Tracking Service
Author: Deorwine Infotech
9. Reputation Management In The Digital Age: Protecting And Enhancing Your Law Firm’s Image
Author: jamewilliams
10. What Features Should I Look For In Wordpress Ecommerce Plugins?
Author: Rocket Press
11. Staying Updated With The Latest Gaming News
Author: Next Tech Plus
12. Game Development: Evolving Technologies And New Horizons
Author: Rick John
13. Why Svelte Is The Most In-demand Framework For Web Development?
Author: Why Svelte Is The Most In-Demand Framework For Web
14. Maximizing Ebay Sales With Maropost/neto And Pursuit Info Solutions
Author: rachelvander
15. The Importance Of Software Testing: Ensuring Quality In Development
Author: John Mathew