Module backend.sources
SourcesController
Sources Scraper
The sources scraper is an app that captures the contents of the URLs to be saved both as text and as a screenshot via a background task. The app is designed in such a way that it can be installed on additional servers independently of the main software. The connection to the RAT is made via the common postgresql database.
Set up the app
-
Change /config/config_db.ini to connect the app to your rat database
-
Change /config/config_sources.ini to your needs using the following parameters
{
"wait_time": 5, // Waiting time in seconds before the content of a page is saved. This waiting time is necessary because some web pages need more time to load.
"debug_screenshots": 0, // 0 = No screenshots are stored locally. 1 = Screenshots are stored locally for debug and analysis purposes.
"timeout": 60, // Time in seconds before the scraper stops trying to scrape a web page.
"headless": 1, // Debug variable. 0 = the browser opens on the local machine. 1 = the Firefox browser does not open on the local machine.
"job_server": 'your server', // You can specify your machine / server here to monitor scraping behavior if you use more than one server for scraping sources.
"refresh_time": 48 // Some scraping jobs may fail due to technical problems. Change the refresh time in hours as an instruction when to reset the scraping jobs.
"proxy": 0 // Option to activate activate a proxy server with authentification (user:password). 0 = no proxy (scraping will happen from the ip adress of the server). 1 = A proxy will be used (specify the proxy data at /config/config_proxy.ini).
}
Run the app
The app is built on the python background process sheduler, as scraping web pages is time and performance consuming.
- To start the app
(sources) >python sources_controller.py --start
- To stop the app
(sources) >python sources_controller.py --stop
- Alternatively, you can simply configure cronjobs to run sources_scraper.py
Debugging
The app comes with a lib to log the progress in sources.log for debugging.
Sub-modules
backend.sources.jobs-
Functions to run jobs for scraping sources (source code and screenshots of search results) and to reset failured scraping jobs
backend.sources.libsbackend.sources.sources_controller_start-
Controller to start and manage the Sources Scraper …
backend.sources.sources_controller_stop-
Controller to start and manage the Sources Scraper …
backend.sources.sources_reset-
Class to handle the resetting of failed scraping jobs …