Difference between revisions of "Spiders"

From Freephile Wiki
Jump to navigation Jump to search
 
(One intermediate revision by one other user not shown)
Line 1: Line 1:
 
Spiders, in this context, are things that index the web.  So you might also call them indexers.  Once you index the web, there are a lot of interesting things you can do.  One of the use cases is to "scrape" data.  Technologies such as [[Search]] rely on indexing.
 
Spiders, in this context, are things that index the web.  So you might also call them indexers.  Once you index the web, there are a lot of interesting things you can do.  One of the use cases is to "scrape" data.  Technologies such as [[Search]] rely on indexing.
  
A long time ago I wrote a spider.  If I ever get around to digging up that old code, here is where I might find it.  But that was old school, and not really useful for anything non-trivial these days.  Lots of other people are making interesting spiders that you can use.  [http://scrapy.org/ Scrapy] is a Python spider.  There is a visual frontend, [https://github.com/scrapinghub/portia Portia], from the folks over in Cork, Ireland at [http://scrapinghub.com/ Scrapinghub].
+
A long time ago I wrote a spider.  If I ever get around to digging up that old code, here is where I might find it.  But that was old school, and not really useful for anything non-trivial these days.  Lots of other people are making interesting spiders that you can use.  [http://scrapy.org/ Scrapy] is a Python spider.  There is a visual frontend, [https://github.com/scrapinghub/portia Portia], from the folks over in Cork, Ireland at [http://scrapinghub.com/ Scrapinghub].
  
 
== Use Case ==
 
== Use Case ==
I want to scrape questions and answers from OKCupid.  The OKC content is dynamically generated on the client-side (browser) using JavaScript.  This means we are NOT dealing with a "rendered" page to scrape, but rather their application sends a bunch of JavaScript to the browser (client).  The user interacts via clicks to instruct the JavaScript to perform actions and render new results. Thus we need a browser equivalent that can speak JavaScript and can load additional JavaScript for execution on the "page". Portia doesn't do this.  Instead, there are a couple options to integrate with Portia or to modify the middleware<ref>[http://doc.scrapy.org/en/latest/topics/spider-middleware.html Middleware component of Scrapy]</ref> of Portia.
+
I want to scrape questions and answers from OKCupid.  The OKC content is dynamically generated on the client-side (browser) using JavaScript.  This means we are NOT dealing with a "rendered" page to scrape, but rather their application sends a bunch of JavaScript to the browser (client).  The user interacts via clicks to instruct the JavaScript to perform actions and render new results. Thus we need a browser equivalent that can speak JavaScript and can load additional JavaScript for execution on the "page". Portia doesn't do this.  Instead, there are a couple options to integrate with Portia or to modify the middleware<ref>[http://doc.scrapy.org/en/latest/topics/spider-middleware.html Middleware component of Scrapy]</ref> of Portia.
  
Even the OKC [http://pastebin.com/7B1TgUB9 login] page is scripted.   
+
The OKC [http://pastebin.com/7B1TgUB9 login] page isn't completely scripted, so Portia should be able to authenticate thereBut, I'm getting a Twisted error:
 +
<pre>
 +
unexpected error response: [Failure instance: Traceback (failure with no frames): : [>] ]
 +
</pre>
  
 
[https://github.com/scrapinghub/splash Splash] is a JavaScript rendering service implemented in Python using Twisted and QT.  Splash can be [http://splash.readthedocs.org/en/latest/scripting-tutorial.html scripted].  So, using Splash with Portia, we should be able to visually scrape OKC.
 
[https://github.com/scrapinghub/splash Splash] is a JavaScript rendering service implemented in Python using Twisted and QT.  Splash can be [http://splash.readthedocs.org/en/latest/scripting-tutorial.html scripted].  So, using Splash with Portia, we should be able to visually scrape OKC.
 +
 +
 +
== Basic scraping with JavaScript ==
 +
<syntaxhighlight lang=javascript>
 +
document.querySelectorAll('h3.ud-accordion-panel-heading').forEach(function(e) {
 +
  console.log(e.innerText);
 +
 +
 +
document.querySelectorAll("span[data-purpose='item-title']").forEach(function(e) {
 +
  console.log(e.innerText);
 +
});
 +
 +
 +
 +
 +
var subheadings = document.querySelectorAll("span[data-purpose='item-title']");
 +
var subheadingTexts = Array.from(subheadings).map(function(subheading) {
 +
return subheading.textContent.trim();
 +
});
 +
console.log(subheadingTexts.join("\n"));
 +
 +
 +
var headings = document.querySelectorAll('h3.ud-accordion-panel-heading');
 +
var headingTexts = Array.from(headings).map(function(heading) {
 +
    return heading.textContent.trim();
 +
});
 +
console.log(headingTexts.join('\n'));
 +
</syntaxhighlight>
  
 
{{References}}
 
{{References}}
  
 
[[Category:Web]]
 
[[Category:Web]]
 +
[[Category:JavaScript]]

Latest revision as of 20:09, 9 February 2024

Spiders, in this context, are things that index the web. So you might also call them indexers. Once you index the web, there are a lot of interesting things you can do. One of the use cases is to "scrape" data. Technologies such as Search rely on indexing.

A long time ago I wrote a spider. If I ever get around to digging up that old code, here is where I might find it. But that was old school, and not really useful for anything non-trivial these days. Lots of other people are making interesting spiders that you can use. Scrapy is a Python spider. There is a visual frontend, Portia, from the folks over in Cork, Ireland at Scrapinghub.

Use Case[edit | edit source]

I want to scrape questions and answers from OKCupid. The OKC content is dynamically generated on the client-side (browser) using JavaScript. This means we are NOT dealing with a "rendered" page to scrape, but rather their application sends a bunch of JavaScript to the browser (client). The user interacts via clicks to instruct the JavaScript to perform actions and render new results. Thus we need a browser equivalent that can speak JavaScript and can load additional JavaScript for execution on the "page". Portia doesn't do this. Instead, there are a couple options to integrate with Portia or to modify the middleware[1] of Portia.

The OKC login page isn't completely scripted, so Portia should be able to authenticate there. But, I'm getting a Twisted error:

unexpected error response: [Failure instance: Traceback (failure with no frames): : [>] ]

Splash is a JavaScript rendering service implemented in Python using Twisted and QT. Splash can be scripted. So, using Splash with Portia, we should be able to visually scrape OKC.


Basic scraping with JavaScript[edit | edit source]

document.querySelectorAll('h3.ud-accordion-panel-heading').forEach(function(e) {
  console.log(e.innerText);


document.querySelectorAll("span[data-purpose='item-title']").forEach(function(e) {
  console.log(e.innerText);
});




var subheadings = document.querySelectorAll("span[data-purpose='item-title']");
var subheadingTexts = Array.from(subheadings).map(function(subheading) {
	return subheading.textContent.trim();
});
console.log(subheadingTexts.join("\n"));


var headings = document.querySelectorAll('h3.ud-accordion-panel-heading');
var headingTexts = Array.from(headings).map(function(heading) {
    return heading.textContent.trim();
});
console.log(headingTexts.join('\n'));

References[edit source]