Everything we use is custom-made PHP scripts by our developers. We use scrapers to get the full HTML page of the sites we are aggregating, then we extract the content by finding the DOM element containing the text body. Then we run it through a bunch of filters to sanitize the text, remove HTML tags. Libraries like Goutte, Simple HTML DOM and html2text could be useful.
Then there are a bunch of other functions to extract the images, save them locally, and then run each images through an algorithm to decide which is the best to use as headline image
How many anarchist websites are there in Dutch language? Maybe we can work together to create a dutch news platform based on the system we use here