easy_web_crawler 1.0.5 | Documentation

easy_web_crawler

1.0.5

Scaper ▸
- Instance members
- #startWithURLs
- #allowIfMatches
- #saveProgressInFile
- #enableAutoCrawler
- #waitBetweenPageLoad
- #callbackOnFinish
- #callbackOnPageLoad
- #start
Page ▸
- Instance members
- #download_image
- #saveResult
- #write_text_to_file
- #add_url_to_queue

Scaper

Main Scraper class

new Scaper()

Example

// npm install easy_web_crawler
const Scaper = require('easy_web_crawler')
var scraper  =new Scraper();

Instance Members

▸ startWithURLs(listOfURLs)

This is mandatory.
Take the list of urls used as the starting point.

startWithURLs(listOfURLs: (string | Array<string>))

Parameters

listOfURLs ((string | Array<string>))

Example

// add the urls as the starting point
scaper.startWithURLs(['www.googl.com','www.bing.com'])
scaper.startWithURLs('www.googl.com')

▸ allowIfMatches(nonAsyncFunction)

Takes a non async callback function as argument,url added to processing queue only if the function return true value.
This is optional. By default is accept all urls added to processing queue.

allowIfMatches(nonAsyncFunction: function)

Parameters

nonAsyncFunction (function)

Example

// accept url contains www.google.com
scraper.allowIfMatches(function(url) {  
  return url.indexOf('www.google.com')>-1 
})

▸ saveProgressInFile(filePath)

This is optional setting.
This will save your progress in the file and you can stop and start the scraper from the previous state.
The file is a sqlite db file you can modify the content using sqllite clients.
If no file specified the stored in memory..

saveProgressInFile(filePath: string)

Parameters

filePath (string)

Example

// state stored in state.db file
scraper.saveProgressInFile("./state.db")

▸ enableAutoCrawler(flag, enableAutoCrawler)

This will allow the scraper to automatically download all the links form the page and add to processing queue.
Note the urls will be filtered if allowIfMatches function return 'false'.

enableAutoCrawler(flag: any, enableAutoCrawler: boolean)

Parameters

flag (any)

enableAutoCrawler (boolean) true to enable

Example

scraper.enableAutoCrawler(true)

▸ waitBetweenPageLoad(delayInMilliSeconds)

Time delay between each page load in milliseconds

waitBetweenPageLoad(delayInMilliSeconds: number)

Parameters

delayInMilliSeconds

(number
            = 0)

Example

//wait for 90 milliseconds between page load
scraper.waitBetweenPageLoad(90)

▸ callbackOnFinish(asyncFunction)

Final callback when scarping is completed

callbackOnFinish(asyncFunction: number)

Parameters

asyncFunction (number)

Example

scraper.callbackOnFinish(function(result){
  console.log(result)
})

▸ callbackOnPageLoad(asyncFunction)

This is the main function.Your scarping logic to be defined in the function.
This called for each page in the processing queue.
Called with pupetter page object as input.
The page object input got addtional methods to support scraping

callbackOnPageLoad(asyncFunction: function)

Parameters

asyncFunction (function) a sync function with single input argument page.

Example

scraper.waitBetweenPageLoad(90)

▸ start()

To start the scraping process. callbackOnFinish function is called once the scraping is completed.

start()

Example

scraper.start()

Page

Pupetter page class. Enhanced with supporting function detailed below.

new Page()

Instance Members

▸ download_image(image_download_url, where_to_full_file_path)

Download image from url and save to local disk

download_image(image_download_url: string, where_to_full_file_path: string)

Parameters

image_download_url (string)

where_to_full_file_path (string)

Example

scraper.callbackOnPageLoad(async function(page){
var img = await page.$('img')
var img_src = await page.evaluate(img => img.getAttribute("src"), img);
 page.download_image(img_src,"usr/test/profile.png")
})

▸ saveResult(text)

Save the text result ,this will returned as input to callbackOnFinish function
Each url can store one result

saveResult(text: string)

Parameters

text (string)

Example

scraper.callbackOnPageLoad(async function(page){
  var article = await page.$eval('article', tag => tag.innerText);
  page.saveResult(article)
})

▸ write_text_to_file(content, filename)

Write text content to local file

write_text_to_file(content: string, filename: string)

Parameters

content (string)

filename (string)

Example

scraper.callbackOnPageLoad(async function(page){
  var article = await page.$eval('article', tag => tag.innerText);
  page.download_image(article,"usr/test/article.txt")
});

▸ add_url_to_queue(url)

Add the url to processing queue

add_url_to_queue(url: string)

Parameters

url (string)

Example

scraper.callbackOnPageLoad(async function(page){
   var a = await page.$('a')
   var url = await page.evaluate(a => a.getAttribute("href"), a);
   page.add_url_to_queue(url)
});