1.0.5
Main Scraper class
// npm install easy_web_crawler
const Scaper = require('easy_web_crawler')
var scraper =new Scraper();
This is mandatory.
Take the list of urls used as the starting point.
// add the urls as the starting point
scaper.startWithURLs(['www.googl.com','www.bing.com'])
scaper.startWithURLs('www.googl.com')
Takes a non async callback function as argument,url added to processing queue only if the function return true value.
This is optional.
By default is accept all urls added to processing queue.
(function)
// accept url contains www.google.com
scraper.allowIfMatches(function(url) {
return url.indexOf('www.google.com')>-1
})
This is optional setting.
This will save your progress in the file and you can stop and start the scraper from the previous state.
The file is a sqlite db file you can modify the content using sqllite clients.
If no file specified the stored in memory..
(string)
// state stored in state.db file
scraper.saveProgressInFile("./state.db")
This will allow the scraper to automatically download all the links form the page and add to processing queue.
Note the urls will be filtered if allowIfMatches function return 'false'.
(any)
(boolean)
true to enable
scraper.enableAutoCrawler(true)
This is the main function.Your scarping logic to be defined in the function.
This called for each page in the processing queue.
Called with pupetter page object as input.
The page object input got addtional methods to support scraping
(function)
a sync function with single input argument page.
scraper.waitBetweenPageLoad(90)
To start the scraping process. callbackOnFinish function is called once the scraping is completed.
scraper.start()
Pupetter page class. Enhanced with supporting function detailed below.
Download image from url and save to local disk
scraper.callbackOnPageLoad(async function(page){
var img = await page.$('img')
var img_src = await page.evaluate(img => img.getAttribute("src"), img);
page.download_image(img_src,"usr/test/profile.png")
})
Save the text result ,this will returned as input to callbackOnFinish function
Each url can store one result
(string)
scraper.callbackOnPageLoad(async function(page){
var article = await page.$eval('article', tag => tag.innerText);
page.saveResult(article)
})
Write text content to local file
scraper.callbackOnPageLoad(async function(page){
var article = await page.$eval('article', tag => tag.innerText);
page.download_image(article,"usr/test/article.txt")
});