Difference between revisions of "Spiders"

From Freephile Wiki
Jump to navigation Jump to search
Line 12: Line 12:
  
 
[https://github.com/scrapinghub/splash Splash] is a JavaScript rendering service implemented in Python using Twisted and QT.  Splash can be [http://splash.readthedocs.org/en/latest/scripting-tutorial.html scripted].  So, using Splash with Portia, we should be able to visually scrape OKC.
 
[https://github.com/scrapinghub/splash Splash] is a JavaScript rendering service implemented in Python using Twisted and QT.  Splash can be [http://splash.readthedocs.org/en/latest/scripting-tutorial.html scripted].  So, using Splash with Portia, we should be able to visually scrape OKC.
 
 
== Basic scraping with JavaScript ==
 
<syntaxhighlight lang=javascript>
 
document.querySelectorAll('h3.ud-accordion-panel-heading').forEach(function(e) {
 
  console.log(e.innerText);
 
 
 
document.querySelectorAll("span[data-purpose='item-title']").forEach(function(e) {
 
  console.log(e.innerText);
 
});
 
 
 
 
 
var subheadings = document.querySelectorAll("span[data-purpose='item-title']");
 
var subheadingTexts = Array.from(subheadings).map(function(subheading) {
 
return subheading.textContent.trim();
 
});
 
console.log(subheadingTexts.join("\n"));
 
 
 
var headings = document.querySelectorAll('h3.ud-accordion-panel-heading');
 
var headingTexts = Array.from(headings).map(function(heading) {
 
    return heading.textContent.trim();
 
});
 
console.log(headingTexts.join('\n'));
 
</syntaxhighlight>
 
  
 
{{References}}
 
{{References}}
  
 
[[Category:Web]]
 
[[Category:Web]]
[[Category:JavaScript]]
 

Revision as of 15:57, 17 February 2015

Spiders, in this context, are things that index the web. So you might also call them indexers. Once you index the web, there are a lot of interesting things you can do. One of the use cases is to "scrape" data. Technologies such as Search rely on indexing.

A long time ago I wrote a spider. If I ever get around to digging up that old code, here is where I might find it. But that was old school, and not really useful for anything non-trivial these days. Lots of other people are making interesting spiders that you can use. Scrapy is a Python spider. There is a visual frontend, Portia, from the folks over in Cork, Ireland at Scrapinghub.

Use Case[edit | edit source]

I want to scrape questions and answers from OKCupid. The OKC content is dynamically generated on the client-side (browser) using JavaScript. This means we are NOT dealing with a "rendered" page to scrape, but rather their application sends a bunch of JavaScript to the browser (client). The user interacts via clicks to instruct the JavaScript to perform actions and render new results. Thus we need a browser equivalent that can speak JavaScript and can load additional JavaScript for execution on the "page". Portia doesn't do this. Instead, there are a couple options to integrate with Portia or to modify the middleware[1] of Portia.

The OKC login page isn't completely scripted, so Portia should be able to authenticate there. But, I'm getting a Twisted error:

unexpected error response: [Failure instance: Traceback (failure with no frames): : [>] ]

Splash is a JavaScript rendering service implemented in Python using Twisted and QT. Splash can be scripted. So, using Splash with Portia, we should be able to visually scrape OKC.

References[edit source]