|
|
||||||
|
#1
|
|
|
|
|
Hello.
Is there anyone who has successfully found a way to scrape a dynamically generated AJAX web site? If I view the source, it gives me the variables. If I use Firebug to view the DOM, it gives me the actual values. Any ideas? Thanks. |
|
|
|
#2
|
|
|
|
|
The problem is you need a DOM-aware Javascript interpreter in your
code to execute the Javascript, manipulate the DOM in the HTML, and then allow you to extract the data you need. There are projects like Rhino, which is a Javascript engine you can embed in other apps, but you still won't have the DOM of the page nor will you be able to manipulate it then extract the values, at least as far as I understand. You could use something like Ruby driving some sort of WebKit interface on Mac OS or Linux, but I have no idea where to start. That, to me, seems like the best answer. Maybe even a Ruby-based Cocoa app would be the trick. On Nov 29, 5:25=A0pm, Becca Girl <csch> wrote: [..] |
|
#3
|
|
|
|
|
On Sat, Nov 29, 2008 at 7:25 PM, Becca Girl <cschall> wrote:
> Hello. > > Is there anyone who has successfully found a way to scrape a dynamically > generated AJAX web site? If I view the source, it gives me the > variables. If I use Firebug to view the DOM, it gives me the actual > values. Any ideas? http://code.google.com/p/firewatir/ |
|
#4
|
|
|
|
|
[Note: parts of this message were removed to make it a legal post.]
scRUBYt! - http://scrubyt.org e.g. scraping your linkedin contacts: require 'rubygems' require 'scrubyt' property_data = Scrubyt::Extractor.define :agent => :firefox do fetch 'https://www.linkedin.com/secure/login' fill_textfield 'session_key', '****' fill_textfield 'session_password', '****' submit click_link_and_wait 'Connections', 5 vcard "//li[@class='vcard']" do first_name "//span[@class='given-name']" second_name "//span[@class='family-name']" email "//a[@class='email']" end end puts property_data.to_xml Cheers, Peter ___ http://www.rubyrailways.com http://scrubyt.org On 2008.11.30., at 1:25, Becca Girl wrote: [..] |
|
#5
|
|
|
|
|
On Sat, Nov 29, 2008 at 6:25 PM, Becca Girl <cschall> wrote:
> Hello. > > Is there anyone who has successfully found a way to scrape a dynamically > generated AJAX web site? If I view the source, it gives me the > variables. If I use Firebug to view the DOM, it gives me the actual > values. Any ideas? > > Thanks. > -- > Posted via [..]. As gf pointed out, the problem is you need a full DOM and working javascript for this, sometimes even working css, to really do it properly, you need a full blown, fully supported, web browser. Short story, use the WATIR library to interact with your browser's DOM to do this. http://watir.com/ I used to do this all the time for work, in a testing capacity. I tried a number of diferent solutions, and found WATIR far superior to anything else out there, including the very pricey pay packages. If you cut through all the marketing BS, half the pay-packages are functional the same as WATIR, and the other half are more primitive. --Kyle |
|
#6
|
|
|
|
|
[Note: parts of this message were removed to make it a legal post.]
Just for completeness sake: scRUBYt! (since 0.4.05) is using FireWatir as the agent (or mechanize - you can choose whether you want scrape AJAX or not) so you can do full blown AJAX scraping - but with a scraping DSL which usually speeds up the scraper creation, especially in the case of complicated scrapers. Cheers, Peter ___ http://www.rubyrailways.com http://scrubyt.org On 2008.12.01., at 3:39, Kyle Schmitt wrote: [..] |
|
#7
|
|
|
|
|
On Mon, Dec 1, 2008 at 3:48 AM, Peter Szinek <peter> wrote:
> Just for completeness sake: scRUBYt! (since 0.4.05) is using FireWatir as > the agent (or mechanize - you can choose whether you want scrape AJAX or > not) so you can do full blown AJAX scraping - but with a scraping DSL which > usually speeds up the scraper creation, especially in the case of > complicated scrapers. > > Cheers, > Peter Peter Neat. I'll have to give that a try next time I need to revisit scraping. |
|
#8
|
|
|
|
|
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1 Actually, firewatir and scRUBYt! are nice. But is there a possibility to start firefox with a second profile (so =20= that it circumvents the "one instance"-rule) and rendering to a hidden =20= display? [1][2] Otherwise, this really hurts testablity (as the browser might retain =20 your personal session) and usability on a deployment server. Regards, Florian Gilcher [1]: Preferably a virtal one on a console-only machine. [2]: Sadly, afaik, firefox has no hidden-mode. On Dec 1, 2008, at 10:48 AM, Peter Szinek wrote: [..] |
|
#9
|
|
|
|
|
On Mon, Dec 1, 2008 at 10:23 AM, Florian Gilcher <flo> wrote:
> -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Actually, firewatir and scRUBYt! are nice. > > But is there a possibility to start firefox with a second profile (so that > it circumvents the "one instance"-rule) and rendering to a hidden display? > [1][2] > > Otherwise, this really hurts testablity (as the browser might retain your > personal session) and usability on a deployment server. > > Regards, > Florian Gilcher > > [1]: Preferably a virtal one on a console-only machine. > [2]: Sadly, afaik, firefox has no hidden-mode. http://coderrr.wordpress.com/2007/10...efox-browsers/ |
|
#10
|
|
|
|
|
>>
>> [1]: Preferably a virtal one on a console-only machine. >> [2]: Sadly, afaik, firefox has no hidden-mode. > You could try using a virtual frame buffer if you are using Linux or similar. Xfvb :99 -ac & export DISPLAY=:99 Will |
|
#11
|
|
|
|
|
[Note: parts of this message were removed to make it a legal post.]
If the site is truely AJAX, i.e. the data is loaded from an HTTP call from JavaScript, you could monitor the HTTP requests made by the browser. On Firefox, I use the LiveHTTPHeaders extension. Just load go view-->sidebar-->HTTP Headers, load the page with whatever data, and look thru the requests for anything interesting. I used this method to get Facebook contact info and it worked fairly well. As a bonus, any data found with this method is usually in a very machine-understandable format like JSON or RSS. There are Ruby libraries for both. Dan On Sat, Nov 29, 2008 at 7:25 PM, Becca Girl <cschall> wrote: [..] |
|
#12
|
|
|
|
|
On Sat, Dec 20, 2008 at 7:27 AM, Will Simpson <will1> wrote:
>>> >>> [1]: Preferably a virtal one on a console-only machine. >>> [2]: Sadly, afaik, firefox has no hidden-mode. I've never used it, but Celerity appears to have Javascript support: http://celerity.rubyforge.org/ > You could try using a virtual frame buffer if you are using Linux or > similar. > > Xfvb :99 -ac & > export DISPLAY=:99 Or, start a vncserver with xstartup set to launch the scraper script. |
|
|
| Similar Threads | |
| Thread | Thread Starter |
| AJAX Mash-up Sites? My research I did a while ago showed there was no possibility to get web page content from a third-party website with AJAX only, without using a server side technology. Now I... |
VUNETdotUS |
| Any (preferrably Java) API for screen scraping sites able to login and batch user actions? Say, people would like to log into their hotmail, yahoo and gmail accounts and "keep an eye" on some text/part of a site .. I think something like that should be out there,... |
onetitfemme |
| AJAX sites and WSH I work for a large antivirus company and am very new to the whole concept of AJAX. I have a pretty good understanding of what it is (javascript and XML). However, after doing... |
stephcraw |
| Web Scraping on Secured Sites that require UserName and Password. I have a Web Site that I want to monitor and gather information from which requires a secured user id and password of which I have but don't know how to gather get by the... |
John West |
| using dynamic dns and dynamic dhcp to replicate client A records between sites Hi, i wonder if anybody can help, i have a question regards using dynamic dns and dynamic dhcp, instead of using wins. I have recently setup a test win2k server and started... |
Paul E. |
|
Privacy Policy | All times are GMT. The time now is 12:36 PM.
|
|
|