|
|
||||||
|
#1
|
|
|
|
|
Say, people would like to log into their hotmail, yahoo and gmail
accounts and "keep an eye" on some text/part of a site .. I think something like that should be out there, since not all sites provide RSS feeds nor are they really interested in providing consistent and informative content (what we (almost) all are looking for). .. I have been mostly programming java lately. THis is how I see such an API could -very basically indeed- be implemented: .. 1. Get the HTML text. 2. Run it through an HTML to XML/XHTML cleanser (tidy nicely fits the bill, but I truly hate how it changes character entities whichever way it thinks without giving you an option to let them be as you coded them. I haven't thoroughly checked JTidy, though) 3. parse 2 using a SAX parser and handle the callbacks it produces, based on 4. some XPath-like metadata that is kept from the page and some more metada how it should be processed ... .. I know XPath might not be the right technology since it uses the DOM and it might get a little taxing when you are processing many pages ... .. I recall there was some java project called HTMLCLient, but I wonder what appened to it .. I think search engines use similar algorithms and I was wondering about how the masters do it .. Thanks onetitfemme |
|
|
|
#2
|
|
|
|
|
onetitfemme wrote:
> Say, people would like to log into their hotmail, yahoo and gmail > accounts and "keep an eye" on some text/part of a site > . > I think something like that should be out there, since not all sites > provide RSS feeds nor are they really interested in providing > consistent and informative content (what we (almost) all are looking > for). > . > I have been mostly programming java lately. THis is how I see such an > API could -very basically indeed- be implemented: > I recall there was some java project called HTMLCLient, but I wonder > what appened to it > . > I think search engines use similar algorithms and I was wondering > about how the masters do it There are a long list of software here: [url down] Arne |
|
#3
|
|
|
|
|
You can try SWExplorerAutomation SWEA (http:\\webunittesting.com).
SWEA creates an object model (automation interface) for any Web application running in Internet Explorer. The SWEA works with DHTML pages, html dialogs, dialogs (alerts) and frames. SWEA is .Net API, but you can use J# for the development. onetitfemme wrote: [..] |
|
#4
|
|
|
|
|
onetitfemme wrote:
> Say, people would like to log into their hotmail, yahoo and gmail > accounts and "keep an eye" on some text/part of a site > . > I think something like that should be out there, since not all sites > provide RSS feeds nor are they really interested in providing > consistent and informative content (what we (almost) all are looking > for). > . > I have been mostly programming java lately. THis is how I see such an > API could -very basically indeed- be implemented: And then every time a provider changes the layout of its screen--then what? [...] > I recall there was some java project called HTMLCLient, but I wonder > what appened to it > . > I think search engines use similar algorithms and I was wondering > about how the masters do it Search engines read the page that it finds without knowing in advance what it contains and where to find the different pieces. That's very different from knowing in advance the structure of some page, knowing what you want to extract from that page, and writing a program to extract that information. |
|
#5
|
|
|
|
|
> And then every time a provider changes the layout of its screen--then what?
otf: well, this , as they say, is where the rubber meets the road ;-) .. I think such scraping APIs should have provisions for these cases, or don't they? Which of these APIs (in the long list) do that? .. I also see a way to reset the page context in a more or less automatic way. If the scraper notices incompatible changes in the page, it simply opens the page to the fleshy, slick end users (those sinner ones, you know) and let them deal with it while detecting the actions the user took ... ;-) and while doing so it transmit the information to a distributing server for many other users of this scraper/html context pages to update their "request contexts" after some technical supervision ... this way people responsible for the server end would have to crazily and constantly change their pages in a way that it might even be counter productive to themselves .. I think this is technically feasible and easily so, but do you see other issues lurking in there? .. I could imagine some people wouldn't like this kind of stuff. But I think, true freedom means they should be free to dump on us all their crud and we should be free to selectively filter in the type of crud we deem appropriate .. It amazes me how many people are very careful about what they eat and then sit for hours to watch CNN and Hollywood crap, even happily so ;-) .. otf |
|
|
| Similar Threads | |
| Set Note and/or Presence on LCS 2005 (preferrably using Java) Hi there, has anyone experience in setting a user's note and/or presence from an application that communicates with the LCS server, possibly running on the server? I have... |
|
| Java script, screen scraping, WebClinet.UploadData Hi, I posted this question previously and unfortunatly didn't get specific answer. I have a problem with logging into web site via screen scraping. User name and password... |
|
| Java script and screen scraping Hi, I have a problem with logging into web site via screen scraping. User name and password field contain 'name' property, and therefore I can easily do assignment to them:... |
|
| Automate screen scraping: How to programmically "push" a Login button on another web page? I'm hoping to write a utility program that will navigate to another web site, logon with my username/password, and download data automatically on a schedule. When I use... |
|
| Big Login Batch. User and Member problem! Please help! HI, I am trying to compile a few batch files into one login batch file. In the process, I would like to do some cleaning up and removal of desktop icons. This is what I have... |
|
|
All times are GMT. The time now is 04:55 PM. | Privacy Policy
|