[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: which example can i use to access this webpage
Hi Scott,
�
The xmlReader processes the HTML as a hierarchical tree and can give
you a lot of information by its supporting subprocedures such as the
block of xmlAddrInner/xmlSizeInner that kicks in when the reader meets
an endnode such as </order> and it will try to do the same when reading
HTML but it will automatically disable these functions if the reader
reach about the depth of about 1000 in the xPath tree caused by bad
coding, but it will still continue to read prober formatted nodes on a
one to one basis as markups as well as it will bypass <script> and
<style> sections that are uncontrollable section if the reader is put
in HTML mode.
�
At the one hand writing a perfect XML engine isn�so hard, but
inspired of all the work around of bad or not well formed HTML code the
HTML engines actually does in order to present a web page, I decided to
do the same in my xmlReader in order to be able to retrieve data from
not well formed web pages I can�change.
�
This is not a 100% maneuver, but since I control the xmlReader, I am
able to build in work around as I go along and if I am presented with a
problem I can adjust the code.
On Fri, May 18, 2012 at 7:10 PM, Scott Klement <[1]sk@xxxxxxxxxxxxxxxx>
wrote:
hi Henrik,
What happens when the HTML isn't well-formed (by XML rules)? Does
your
tool have a way to handle that?
That's always been my problem with using an XML parser to read HTML.
I'd
have something like this:
<html>
<body>
<img src="test.jpg">
</body>
</html>
And of course, there's no ending tag for the <img> tag, and it
causes
the XML parser to say the document isn't well-formed, and give up.
On 5/18/2012 6:23 AM, Henrik Rützou wrote:
> � � Hi Tim and Scott,
>
> � � may I suggest that you combine Scott's example with the
xmlReader in
>
> � � powerEXT Core that reads HTML as XML
>
> � � I have made a little example program that reads a HTML result
page from
> � � the
>
> � � search on the site:
>
> � � [1][2]http://89.239.242.111:6382/pextcgiCOR/readhtml.pgm
>
> � � The only changes neede to scotts code is to change the second
post so
> � � it
>
> � � stores the result in a temp IFS file
>
> � � On Fri, May 18, 2012 at 2:35 AM, Scott
Klement<[2][3]sk@xxxxxxxxxxxxxxxx>
> � � wrote:
>
> � � � Okay. I've attached an example that I hope will point you in
the
> � � � right direction.
> � � � This type of coding is hard, because this site isn't
intended to be
> � � � called by a computer program -- it's intended to be called
by a web
> � � � browser. � �Accessing a web site (as opposed to a web
service)
> � � � requires you to have a pretty strong knowledge of how a
programmer
> � � � wrote the web page. � And, figuring out how to read the
output is
> � � � challenging, because the output is designed to dictate a
screen
> � � � layout, it's not designed to identify what each field is and
what
> � � � it's for (as would be the case with a web service.) � �So
what
> � � � you're looking for is possible, but it's hard. � Not
because of the
> � � � tool, but because the site just wasn't meant to be used this
way.
> � � � But, the attached example does work. � It's just harder
than it
> � � � would be if it were a web service.
> � � � 1) You connect to the initial web page, and it sets cookies
that it
> � � � uses to identify your browser session. � HTTPAPI will
manage the
> � � � cookie for you -- but make sure you're running version 1.24
or
> � � � newer, because there have been bugs fixed recently in the
cookie
> � � � support.
> � � � 2) You create a web form containing the fields in the<input>
�tags
> � � � in the HTML. � Web sites can potentially modify this stuff
using
> � � � JavaScript on the page, so the<input> �tags are a good
starting
> � � � point, but you shouldn't rely on them 100%. � Instead, use
a tool
> � � � like the "Live HTTP Headers" plugin for Firefox to see
exactly
> � � � what's sent/received, then copy that in HTTPAPI.
> � � � 3) After submitting the login form, the site receives your
session
> � � � cookie and your login credentials (user/pass) and validates
them.
> � � � � Once that's done, it sets your session ID's status
(stored in a
> � � � file on the server) to "logged in". � From here on, you
must
> � � � re-submit the cookie with each request, or it won't know
you're
> � � � logged in. � That's okay, though, HTTPAPI manages the
cookies and
> � � � resubmits them as long as you're still running in the same
> � � � activation group.
> � � � 4) The server redirects you to a new page buy sending a 302
HTTP
> � � � response, and a new URL. � Your code can call
http_redir_loc to get
> � � � the new URL, and one of the http_get routines to follow the
> � � � redirect. � You'll see that in the sample code. � I
always like to
> � � � limit the number of redirects to prevent the program
gettting stuck
> � � � in a loop if the redirect points to another redirect, et al.
> � � � 5) Submit the form containing the zip code query. � �I
coded the
> � � � program to take the zip code as a parameter and send it as a
query.
> � � � � Again, I looked at the<input> �html tags on the page,
and used
> � � � Live HTTP Headers to make sure I was sending the right
things. � The
> � � � only thing that I made a variable is the zip code, and you
supply it
> � � � like this:
> � � � � �� �CALL PGM(MYFIRTEST) PARM(71635) � �(where
71635 is the zip
> � � � code)
> � � � 6) Finally, the response is received (as an HTML document,
> � � � explaining how to format data on the browser's screen)
containing
> � � � the list of foreclosures. � I simply displayed the raw
HTML on the
> � � � screen -- I'll leave it up to you to figure out how to get
the data
> � � � you need out of that page (by %scan, %subst, etc)
> � � � Good luck!
>
> � � On 5/17/2012 6:00 PM, [3][4]tim.dclinc@xxxxxxxxx wrote:
>
> � � � The site in question is
[4][5]http://www.myfir.com/myFir/login.asp
> � � � you can use [5][6]tim.dclinc@xxxxxxxxx as user, and
"password" as
> � � � password.
> � � � Its a public site which anyone can join...i just wanted to
> � � � programmatically "check" the site.
>
> � � � --------------------------------------------------------------------
> � � � ---
> � � � This is the FTPAPI mailing list. � To unsubscribe, please
go to:
> � � � [6][7]http://www.scottklement.com/mailman/listinfo/ftpapi
> � � � --------------------------------------------------------------------
> � � � ---
>
> � � --
> � � Regards,
> � � Henrik Rützou
> � � �
> � � [7][8]http://powerEXT.com
> � � �
> � � [plogofull200.png]
>
> References
>
> � � 1. [9]http://89.239.242.111:6382/pextcgiCOR/readhtml.pgm
> � � 2. mailto:[10]sk@xxxxxxxxxxxxxxxx
> � � 3. mailto:[11]tim.dclinc@xxxxxxxxx
> � � 4. [12]http://www.myfir.com/myFir/login.asp
> � � 5. mailto:[13]tim.dclinc@xxxxxxxxx
> � � 6. [14]http://www.scottklement.com/mailman/listinfo/ftpapi
> � � 7. [15]http://powerext.com/
>
>
>
>
-----------------------------------------------------------------------
> This is the FTPAPI mailing list. �To unsubscribe, please go to:
> [16]http://www.scottklement.com/mailman/listinfo/ftpapi
>
-----------------------------------------------------------------------
-----------------------------------------------------------------------
This is the FTPAPI mailing list. �To unsubscribe, please go to:
[17]http://www.scottklement.com/mailman/listinfo/ftpapi
-----------------------------------------------------------------------
--
Regards,
Henrik Rützou
� [18]http://powerEXT.com
� [plogofull200.png]
References
1. mailto:sk@xxxxxxxxxxxxxxxx
2. http://89.239.242.111:6382/pextcgiCOR/readhtml.pgm
3. mailto:sk@xxxxxxxxxxxxxxxx
4. mailto:tim.dclinc@xxxxxxxxx
5. http://www.myfir.com/myFir/login.asp
6. mailto:tim.dclinc@xxxxxxxxx
7. http://www.scottklement.com/mailman/listinfo/ftpapi
8. http://powerEXT.com/
9. http://89.239.242.111:6382/pextcgiCOR/readhtml.pgm
10. mailto:sk@xxxxxxxxxxxxxxxx
11. mailto:tim.dclinc@xxxxxxxxx
12. http://www.myfir.com/myFir/login.asp
13. mailto:tim.dclinc@xxxxxxxxx
14. http://www.scottklement.com/mailman/listinfo/ftpapi
15. http://powerext.com/
16. http://www.scottklement.com/mailman/listinfo/ftpapi
17. http://www.scottklement.com/mailman/listinfo/ftpapi
18. http://powerext.com/
-----------------------------------------------------------------------
This is the FTPAPI mailing list. To unsubscribe, please go to:
http://www.scottklement.com/mailman/listinfo/ftpapi
-----------------------------------------------------------------------