Difference between revisions of "Spiders"

From Freephile Wiki
Jump to navigation Jump to search
(Created page with "Spiders, in this context, are things that index the web. So you might also call them indexers. A long time ago I wrote a spider. If I ever get around to digging up that...")
 
Line 1: Line 1:
Spiders, in this context, are things that index the web.  So you might also call them indexers.   
+
Spiders, in this context, are things that index the web.  So you might also call them indexers.  Once you index the web, there are a lot of interesting things you can do.  One of the use cases is to "scrape" data.  Technologies such as [[Search]] rely on indexing.
  
A long time ago I wrote a spider.  If I ever get around to digging up that old code, here is where I might find it.  Lots of other people are making interesting spiders that you can use.
+
A long time ago I wrote a spider.  If I ever get around to digging up that old code, here is where I might find it.  But that was old school, and not really useful for anything non-trivial these days.  Lots of other people are making interesting spiders that you can use. [http://scrapy.org/ Scrapy] is a Python spider.  There is a visual frontend, [https://github.com/scrapinghub/portia Portia], from the folks over in Cork, Ireland at [http://scrapinghub.com/ Scrapinghub]. 
  
[https://github.com/scrapinghub/portia Portia] is one example from the folks over in Cork, Ireland at [http://scrapinghub.com/ Scrapinghub].  I won't repeat their documentation here needlessly, but I will note my experiences with the tools.  I wanted to scrape questions and answers from OKCupid, but so far Portia can't handle the JavaScript login.  I need to deconstruct it more to find out what the solution might be.
+
== Use Case ==
 +
I want to scrape questions and answers from OKCupid.  The OKC content is dynamically generated on the client-side (browser) using JavaScript.  This means we are NOT dealing with a "rendered" page to scrape, but rather their application sends a bunch of JavaScript to the browser (client).  The user interacts via clicks to instruct the JavaScript to perform actions and render new results. Thus we need a browser equivalent that can speak JavaScript and can load additional JavaScript for execution on the "page". Portia doesn't do this.  Instead, there are a couple options to integrate with Portia or to modify the middleware<ref>[http://doc.scrapy.org/en/latest/topics/spider-middleware.html Middleware component of Scrapy]</ref> of Portia. 
 +
 
 +
Even the OKC [http://pastebin.com/7B1TgUB9 login] page is scripted. 
 +
 
 +
[https://github.com/scrapinghub/splash Splash] is a JavaScript rendering service implemented in Python using Twisted and QT.  Splash can be [http://splash.readthedocs.org/en/latest/scripting-tutorial.html scripted].  So, using Splash with Portia, we should be able to visually scrape OKC.
 +
 
 +
{{References}}
  
 
[[Category:Web]]
 
[[Category:Web]]

Revision as of 13:15, 17 February 2015

Spiders, in this context, are things that index the web. So you might also call them indexers. Once you index the web, there are a lot of interesting things you can do. One of the use cases is to "scrape" data. Technologies such as Search rely on indexing.

A long time ago I wrote a spider. If I ever get around to digging up that old code, here is where I might find it. But that was old school, and not really useful for anything non-trivial these days. Lots of other people are making interesting spiders that you can use. Scrapy is a Python spider. There is a visual frontend, Portia, from the folks over in Cork, Ireland at Scrapinghub.

Use Case[edit | edit source]

I want to scrape questions and answers from OKCupid. The OKC content is dynamically generated on the client-side (browser) using JavaScript. This means we are NOT dealing with a "rendered" page to scrape, but rather their application sends a bunch of JavaScript to the browser (client). The user interacts via clicks to instruct the JavaScript to perform actions and render new results. Thus we need a browser equivalent that can speak JavaScript and can load additional JavaScript for execution on the "page". Portia doesn't do this. Instead, there are a couple options to integrate with Portia or to modify the middleware[1] of Portia.

Even the OKC login page is scripted.

Splash is a JavaScript rendering service implemented in Python using Twisted and QT. Splash can be scripted. So, using Splash with Portia, we should be able to visually scrape OKC.

References[edit source]