[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Ftpapi] GSK_WOULD_BLOCK issue after upgrade to 7.4



Hi Scott,

We finally get a reply from IBM ( they recommended us to apply a ptf earlier in this case and we asked what kind of change was buried in that fix ). The reply is as follows - does it make any sense to you?

Feedback from our development team.

SI71871 -
 In http code, we call "rc = poll(&pfd, 1, timeout);" When rc>0, will return APR_SUCCESS. This means TLS has data. And at this time, we will call gsk_secure_soc_read, but get nothing.  So we return ssl read failed(502). Although there has TLS data at present, it is still unreadable. So we changed the process. It will loop to read until data can be read or time out.

In regards to the customer program:

The issue is described by usage note #3 in the documentation. The note refers to an un-timed blocking socket however non-blocking with select() has the same issue with a partial TLS record available triggering select.  https://www.ibm.com/support/knowledgecenter/ssw_ibm_i_74/apis/gsk_secure_soc_read.htm

The recommendation is to change the socket to a "blocking" socket with a time out on the gsk_secure_soc_read() call.  The program already has code to handle GSK_IBMI_ERROR_TIMED_OUT if it was returned. The read time out can be set using gsk_attribute_set_numeric_value() with attribute GSK_IBMI_READ_TIMEOUT (6993).  https://www.ibm.com/support/knowledgecenter/ssw_ibm_i_74/apis/gsk_attribute_set_numeric_value.htm  

What is happening now is a hard loop of gsk_secure_soc_read() / select() as fast as the thread can run until the 2nd part of the TLS record arrives because select() immediately returns true due to the 1st part of the TLS record sitting on the socket receive queue. TLS can only pull full records off the socket queue and select/poll are TLS unaware. 

If the program can not be changed to use a blocking call with time out, then some retry logic with a sleep(I think 1 second is the smallest value allowed) or select() without a descriptor (since with a descriptor it returns immediately) for a sub second time out should be used to reduce the CPU consumption caused by the current hard loop that exceeds 10 but is less than 100. 7.4 isn't slower, rather it is faster now getting through the hard loop of 10 which turns out to be faster than the 2nd part of the record can be placed on the socket receive queue by the TCP stack after it comes in on the wire.

Let me know if you have further questions and if the issue can be closed.

------------------------------------------ < eof IBM >----------------------------------------------------------------

Thanks,
-- 
No trees were killed in the sending of this message, but a large number of electrons were terribly upset.

Stefan Tageson
+46 732 369934
stefan@xxxxxxxxxx




-- 
_______________________________________________
Ftpapi mailing list
Ftpapi@xxxxxxxxxxxxxxxxxxxxxx
http://scottklement.com/mailman/listinfo/ftpapi