Spiders

From Freephile Wiki

Spiders, in this context, are things that index the web. So you might also call them indexers. Once you index the web, there are a lot of interesting things you can do. One of the use cases is to "scrape" data. Technologies such as Search rely on indexing.

A long time ago I wrote a spider. If I ever get around to digging up that old code, here is where I might find it. But that was old school, and not really useful for anything non-trivial these days. Lots of other people are making interesting spiders that you can use. Scrapy is a Python spider. There is a visual frontend, Portia, from the folks over in Cork, Ireland at Scrapinghub.

Use Case

I want to scrape questions and answers from OKCupid. The OKC content is dynamically generated on the client-side (browser) using JavaScript. This means we are NOT dealing with a "rendered" page to scrape, but rather their application sends a bunch of JavaScript to the browser (client). The user interacts via clicks to instruct the JavaScript to perform actions and render new results. Thus we need a browser equivalent that can speak JavaScript and can load additional JavaScript for execution on the "page". Portia doesn't do this. Instead, there are a couple options to integrate with Portia or to modify the middleware[1] of Portia.

The OKC login page isn't completely scripted, so Portia should be able to authenticate there. But, I'm getting a Twisted error:

unexpected error response: [Failure instance: Traceback (failure with no frames): : [>] ]

Splash is a JavaScript rendering service implemented in Python using Twisted and QT. Splash can be scripted. So, using Splash with Portia, we should be able to visually scrape OKC.


Basic scraping with JavaScript

document.querySelectorAll('h3.ud-accordion-panel-heading').forEach(function(e) {
  console.log(e.innerText);


document.querySelectorAll("span[data-purpose='item-title']").forEach(function(e) {
  console.log(e.innerText);
});




var subheadings = document.querySelectorAll("span[data-purpose='item-title']");
var subheadingTexts = Array.from(subheadings).map(function(subheading) {
	return subheading.textContent.trim();
});
console.log(subheadingTexts.join("\n"));


var headings = document.querySelectorAll('h3.ud-accordion-panel-heading');
var headingTexts = Array.from(headings).map(function(heading) {
    return heading.textContent.trim();
});
console.log(headingTexts.join('\n'));

References