How does the Full-text extraction scraper work?

anar

Hi there, really love the site and I'm working on kinda the same idea but then related to anarchism in the Netherlands. For https://forumvooranarchism.nl. Right now we are using RSS feeds, but like you probably already know they are not always complete and are missing featured images most of the time. So I was wondering if you can share how you guys are doing the full-text extraction so hopefully we can also use it. Please mail us at fva[@]riseup.net if more privacy is preferred. PGP key here

Ungovernable

Hi,

Everything we use is custom-made PHP scripts by our developers. We use scrapers to get the full HTML page of the sites we are aggregating, then we extract the content by finding the DOM element containing the text body. Then we run it through a bunch of filters to sanitize the text, remove HTML tags. Libraries like Goutte, Simple HTML DOM and html2text could be useful.

Then there are a bunch of other functions to extract the images, save them locally, and then run each images through an algorithm to decide which is the best to use as headline image

How many anarchist websites are there in Dutch language? Maybe we can work together to create a dutch news platform based on the system we use here