[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: which example can i use to access this webpage



   Hi Scott,

   �
   The xmlReader processes the HTML as a hierarchical tree and can give
   you a lot of information by its supporting subprocedures such as the
   block of xmlAddrInner/xmlSizeInner that kicks in when the reader meets
   an endnode such as </order> and it will try to do the same when reading
   HTML but it will automatically disable these functions if the reader
   reach about the depth of about 1000 in the xPath tree caused by bad
   coding, but it will still continue to read prober formatted nodes on a
   one to one basis as markups as well as it will bypass <script> and
   <style> sections that are uncontrollable section if the reader is put
   in HTML mode.

   �
   At the one hand writing a perfect XML engine isn�so hard, but
   inspired of all the work around of bad or not well formed HTML code the
   HTML engines actually does in order to present a web page, I decided to
   do the same in my xmlReader in order to be able to retrieve data from
   not well formed web pages I can�change.

   �
   This is not a 100% maneuver, but since I control the xmlReader, I am
   able to build in work around as I go along and if I am presented with a
   problem I can adjust the code.

   On Fri, May 18, 2012 at 7:10 PM, Scott Klement <[1]sk@xxxxxxxxxxxxxxxx>
   wrote:

     hi Henrik,
     What happens when the HTML isn't well-formed (by XML rules)? Does
     your
     tool have a way to handle that?
     That's always been my problem with using an XML parser to read HTML.
     I'd
     have something like this:
     <html>
     <body>
     <img src="test.jpg">
     </body>
     </html>
     And of course, there's no ending tag for the <img> tag, and it
     causes
     the XML parser to say the document isn't well-formed, and give up.

   On 5/18/2012 6:23 AM, Henrik Rützou wrote:
   > � � Hi Tim and Scott,
   >
   > � � may I suggest that you combine Scott's example with the
   xmlReader in
   >
   > � � powerEXT Core that reads HTML as XML
   >
   > � � I have made a little example program that reads a HTML result
   page from
   > � � the
   >
   > � � search on the site:
   >

     > � � [1][2]http://89.239.242.111:6382/pextcgiCOR/readhtml.pgm

   >
   > � � The only changes neede to scotts code is to change the second
   post so
   > � � it
   >
   > � � stores the result in a temp IFS file
   >

     > � � On Fri, May 18, 2012 at 2:35 AM, Scott
     Klement<[2][3]sk@xxxxxxxxxxxxxxxx>

   > � � wrote:
   >
   > � � � Okay. I've attached an example that I hope will point you in
   the
   > � � � right direction.
   > � � � This type of coding is hard, because this site isn't
   intended to be
   > � � � called by a computer program -- it's intended to be called
   by a web
   > � � � browser. � �Accessing a web site (as opposed to a web
   service)
   > � � � requires you to have a pretty strong knowledge of how a
   programmer
   > � � � wrote the web page. � And, figuring out how to read the
   output is
   > � � � challenging, because the output is designed to dictate a
   screen
   > � � � layout, it's not designed to identify what each field is and
   what
   > � � � it's for (as would be the case with a web service.) � �So
   what
   > � � � you're looking for is possible, but it's hard. � Not
   because of the
   > � � � tool, but because the site just wasn't meant to be used this
   way.
   > � � � But, the attached example does work. � It's just harder
   than it
   > � � � would be if it were a web service.
   > � � � 1) You connect to the initial web page, and it sets cookies
   that it
   > � � � uses to identify your browser session. � HTTPAPI will
   manage the
   > � � � cookie for you -- but make sure you're running version 1.24
   or
   > � � � newer, because there have been bugs fixed recently in the
   cookie
   > � � � support.
   > � � � 2) You create a web form containing the fields in the<input>
   �tags
   > � � � in the HTML. � Web sites can potentially modify this stuff
   using
   > � � � JavaScript on the page, so the<input> �tags are a good
   starting
   > � � � point, but you shouldn't rely on them 100%. � Instead, use
   a tool
   > � � � like the "Live HTTP Headers" plugin for Firefox to see
   exactly
   > � � � what's sent/received, then copy that in HTTPAPI.
   > � � � 3) After submitting the login form, the site receives your
   session
   > � � � cookie and your login credentials (user/pass) and validates
   them.
   > � � � � Once that's done, it sets your session ID's status
   (stored in a
   > � � � file on the server) to "logged in". � From here on, you
   must
   > � � � re-submit the cookie with each request, or it won't know
   you're
   > � � � logged in. � That's okay, though, HTTPAPI manages the
   cookies and
   > � � � resubmits them as long as you're still running in the same
   > � � � activation group.
   > � � � 4) The server redirects you to a new page buy sending a 302
   HTTP
   > � � � response, and a new URL. � Your code can call
   http_redir_loc to get
   > � � � the new URL, and one of the http_get routines to follow the
   > � � � redirect. � You'll see that in the sample code. � I
   always like to
   > � � � limit the number of redirects to prevent the program
   gettting stuck
   > � � � in a loop if the redirect points to another redirect, et al.
   > � � � 5) Submit the form containing the zip code query. � �I
   coded the
   > � � � program to take the zip code as a parameter and send it as a
   query.
   > � � � � Again, I looked at the<input> �html tags on the page,
   and used
   > � � � Live HTTP Headers to make sure I was sending the right
   things. � The
   > � � � only thing that I made a variable is the zip code, and you
   supply it
   > � � � like this:
   > � � � � �� �CALL PGM(MYFIRTEST) PARM(71635) � �(where
   71635 is the zip
   > � � � code)
   > � � � 6) Finally, the response is received (as an HTML document,
   > � � � explaining how to format data on the browser's screen)
   containing
   > � � � the list of foreclosures. � I simply displayed the raw
   HTML on the
   > � � � screen -- I'll leave it up to you to figure out how to get
   the data
   > � � � you need out of that page (by %scan, %subst, etc)
   > � � � Good luck!
   >

     > � � On 5/17/2012 6:00 PM, [3][4]tim.dclinc@xxxxxxxxx wrote:
     >
     > � � � The site in question is
     [4][5]http://www.myfir.com/myFir/login.asp
     > � � � you can use [5][6]tim.dclinc@xxxxxxxxx as user, and
     "password" as

   > � � � password.
   > � � � Its a public site which anyone can join...i just wanted to
   > � � � programmatically "check" the site.
   >
   > � � �   --------------------------------------------------------------------
   > � � � ---
   > � � � This is the FTPAPI mailing list. � To unsubscribe, please
   go to:

     > � � � [6][7]http://www.scottklement.com/mailman/listinfo/ftpapi
     > � � �     --------------------------------------------------------------------
     > � � � ---
     >
     > � � --
     > � � Regards,
     > � � Henrik Rützou
     > � � �
     > � � [7][8]http://powerEXT.com
     > � � �
     > � � [plogofull200.png]
     >
     > References
     >
     > � � 1. [9]http://89.239.242.111:6382/pextcgiCOR/readhtml.pgm
     > � � 2. mailto:[10]sk@xxxxxxxxxxxxxxxx
     > � � 3. mailto:[11]tim.dclinc@xxxxxxxxx
     > � � 4. [12]http://www.myfir.com/myFir/login.asp
     > � � 5. mailto:[13]tim.dclinc@xxxxxxxxx
     > � � 6. [14]http://www.scottklement.com/mailman/listinfo/ftpapi
     > � � 7. [15]http://powerext.com/

   >
   >
   >
   >
   -----------------------------------------------------------------------
   > This is the FTPAPI mailing list. �To unsubscribe, please go to:
   > [16]http://www.scottklement.com/mailman/listinfo/ftpapi
   >
   -----------------------------------------------------------------------
   -----------------------------------------------------------------------
   This is the FTPAPI mailing list. �To unsubscribe, please go to:
   [17]http://www.scottklement.com/mailman/listinfo/ftpapi
   -----------------------------------------------------------------------

   --
   Regards,
   Henrik Rützou
   �   [18]http://powerEXT.com
   �   [plogofull200.png]

References

   1. mailto:sk@xxxxxxxxxxxxxxxx
   2. http://89.239.242.111:6382/pextcgiCOR/readhtml.pgm
   3. mailto:sk@xxxxxxxxxxxxxxxx
   4. mailto:tim.dclinc@xxxxxxxxx
   5. http://www.myfir.com/myFir/login.asp
   6. mailto:tim.dclinc@xxxxxxxxx
   7. http://www.scottklement.com/mailman/listinfo/ftpapi
   8. http://powerEXT.com/
   9. http://89.239.242.111:6382/pextcgiCOR/readhtml.pgm
  10. mailto:sk@xxxxxxxxxxxxxxxx
  11. mailto:tim.dclinc@xxxxxxxxx
  12. http://www.myfir.com/myFir/login.asp
  13. mailto:tim.dclinc@xxxxxxxxx
  14. http://www.scottklement.com/mailman/listinfo/ftpapi
  15. http://powerext.com/
  16. http://www.scottklement.com/mailman/listinfo/ftpapi
  17. http://www.scottklement.com/mailman/listinfo/ftpapi
  18. http://powerext.com/
-----------------------------------------------------------------------
This is the FTPAPI mailing list.  To unsubscribe, please go to:
http://www.scottklement.com/mailman/listinfo/ftpapi
-----------------------------------------------------------------------