Difference between revisions of "Spiders"
Line 1: | Line 1: | ||
Spiders, in this context, are things that index the web. So you might also call them indexers. Once you index the web, there are a lot of interesting things you can do. One of the use cases is to "scrape" data. Technologies such as [[Search]] rely on indexing. | Spiders, in this context, are things that index the web. So you might also call them indexers. Once you index the web, there are a lot of interesting things you can do. One of the use cases is to "scrape" data. Technologies such as [[Search]] rely on indexing. | ||
− | A long time ago I wrote a spider. If I ever get around to digging up that old code, here is where I might find it. But that was old school, and not really useful for anything non-trivial these days. Lots of other people are making interesting spiders that you can use. [http://scrapy.org/ Scrapy] is a Python spider. There is a visual frontend, [https://github.com/scrapinghub/portia Portia], from the folks over in Cork, Ireland at [http://scrapinghub.com/ Scrapinghub]. | + | A long time ago I wrote a spider. If I ever get around to digging up that old code, here is where I might find it. But that was old school, and not really useful for anything non-trivial these days. Lots of other people are making interesting spiders that you can use. [http://scrapy.org/ Scrapy] is a Python spider. There is a visual frontend, [https://github.com/scrapinghub/portia Portia], from the folks over in Cork, Ireland at [http://scrapinghub.com/ Scrapinghub]. |
== Use Case == | == Use Case == | ||
− | I want to scrape questions and answers from OKCupid. The OKC content is dynamically generated on the client-side (browser) using JavaScript. This means we are NOT dealing with a "rendered" page to scrape, but rather their application sends a bunch of JavaScript to the browser (client). The user interacts via clicks to instruct the JavaScript to perform actions and render new results. Thus we need a browser equivalent that can speak JavaScript and can load additional JavaScript for execution on the "page". Portia doesn't do this. Instead, there are a couple options to integrate with Portia or to modify the middleware<ref>[http://doc.scrapy.org/en/latest/topics/spider-middleware.html Middleware component of Scrapy]</ref> of Portia. | + | I want to scrape questions and answers from OKCupid. The OKC content is dynamically generated on the client-side (browser) using JavaScript. This means we are NOT dealing with a "rendered" page to scrape, but rather their application sends a bunch of JavaScript to the browser (client). The user interacts via clicks to instruct the JavaScript to perform actions and render new results. Thus we need a browser equivalent that can speak JavaScript and can load additional JavaScript for execution on the "page". Portia doesn't do this. Instead, there are a couple options to integrate with Portia or to modify the middleware<ref>[http://doc.scrapy.org/en/latest/topics/spider-middleware.html Middleware component of Scrapy]</ref> of Portia. |
− | + | The OKC [http://pastebin.com/7B1TgUB9 login] page isn't completely scripted, so Portia should be able to authenticate there. But, I'm getting a Twisted error: | |
+ | <pre> | ||
+ | unexpected error response: [Failure instance: Traceback (failure with no frames): : [>] ] | ||
+ | </pre> | ||
[https://github.com/scrapinghub/splash Splash] is a JavaScript rendering service implemented in Python using Twisted and QT. Splash can be [http://splash.readthedocs.org/en/latest/scripting-tutorial.html scripted]. So, using Splash with Portia, we should be able to visually scrape OKC. | [https://github.com/scrapinghub/splash Splash] is a JavaScript rendering service implemented in Python using Twisted and QT. Splash can be [http://splash.readthedocs.org/en/latest/scripting-tutorial.html scripted]. So, using Splash with Portia, we should be able to visually scrape OKC. |
Revision as of 14:57, 17 February 2015
Spiders, in this context, are things that index the web. So you might also call them indexers. Once you index the web, there are a lot of interesting things you can do. One of the use cases is to "scrape" data. Technologies such as Search rely on indexing.
A long time ago I wrote a spider. If I ever get around to digging up that old code, here is where I might find it. But that was old school, and not really useful for anything non-trivial these days. Lots of other people are making interesting spiders that you can use. Scrapy is a Python spider. There is a visual frontend, Portia, from the folks over in Cork, Ireland at Scrapinghub.
Use Case[edit | edit source]
I want to scrape questions and answers from OKCupid. The OKC content is dynamically generated on the client-side (browser) using JavaScript. This means we are NOT dealing with a "rendered" page to scrape, but rather their application sends a bunch of JavaScript to the browser (client). The user interacts via clicks to instruct the JavaScript to perform actions and render new results. Thus we need a browser equivalent that can speak JavaScript and can load additional JavaScript for execution on the "page". Portia doesn't do this. Instead, there are a couple options to integrate with Portia or to modify the middleware[1] of Portia.
The OKC login page isn't completely scripted, so Portia should be able to authenticate there. But, I'm getting a Twisted error:
unexpected error response: [Failure instance: Traceback (failure with no frames): : [>] ]
Splash is a JavaScript rendering service implemented in Python using Twisted and QT. Splash can be scripted. So, using Splash with Portia, we should be able to visually scrape OKC.