Error Handling with Sockets

In last week's newsletter, I provided an introduction to writing RPG programs that communicate over a TCP/IP network. In this article, I show you how to check for errors and how to time out connections when they're not responding.

Checking Errors
In last week's article, I explained that whenever one of the socket APIs encounters an error, it returns -1. For instance, if the following API can't connect to a server, it returns a -1, and the program checks for this so that it knows that the connection attempt has failed:

     D connto          ds                  likeds(sockaddr_in)
         .
         .
          connto = *ALLx'00';
          connto.sin_family = AF_INET;
          connto.sin_addr   = addr;
          connto.sin_port   = port;

          if ( connect(s: %addr(connto): %size(connto)) = -1 );
             callp close(s);
             msg = 'Connect failed!';
             // report error to user
             return;
          endif;

The -1 tells you that the connection failed, but it doesn't tell you why. Did it fail because the IP address was wrong? Or maybe TCP/IP isn't running. Or maybe the computer that you tried to connect to doesn't have a program running on the port that you supplied. Your program needs to know which error occurred so that it can provide a meaningful message to send to the user.

Whenever a Unix-type API (including the sockets APIs) fails, it indicates the cause of the error by storing an error number in a special variable called errno.

The errno variable is a 4-byte (10-digit) binary integer that's part of the ILE C runtime library. Fortunately, that does not preclude you from using it from ILE RPG. You can get a pointer to errno by calling an API called __errno().

Here's the prototype for the __errno() API:

     D get_errno       pr              *   ExtProc('__errno')
     D errno           s             10I 0 based(p_errno)

The following code demonstrates retrieving the value of errno and letting the user know about it:

          if ( connect(s: %addr(connto): %size(connto)) = -1 );
             callp close(s);
             p_errno = get_errno();
             msg = 'Connect failed with errno=' + %char(errno);
             // report error to user
             return;
          endif;

That's certainly an improvement. Now the user not only knows that something went wrong, but he or she also has a number to use to look up what went wrong.

To look up what the error number means, you can look for a corresponding CPE message in the QCPFMSG message file. For example, if errno was set to 3425, you could type the following command to see what 3425 means:

  DSPMSGD RANGE(CPE3425) MSGF(QCPFMSG)

That works, but it's not a nice thing to do to your users! They won't like receiving an obscure error number, and they'll like having to look it up in a message file even less.

The Set Pointer to Runtime Error Message (strerror) API makes it relatively easy to get a human-readable error message that you can show to your user. You pass strerror() the value of errno, and it returns a pointer to a C-style null-terminated string that explains what went wrong.

RPG has a built-in function (BIF) called %str() that converts a C-style string into an ordinary RPG field. The following code snippet demonstrates using it with the strerror() API to get a message that you can display to your user:

     D strerror        PR              *   ExtProc('strerror')
     D    errnum                     10I 0 value
          .
          .
           if ( connect(s: %addr(connto): %size(connto)) = -1 );
             callp close(s);
             p_errno = get_errno();
             msg = %str(strerror(errno));
             // Report error to user.
             return;
          endif;

To see the error conditions that the API can return, look up one of the sockets APIs in the Information Center. The following are some of the conditions that it lists for the connect() API:

[EAFNOSUPPORT] 	The type of socket is not supported in this protocol family.
[EALREADY] 	Operation already in progress.
[EBADF] 	Descriptor not valid.
[ECONNREFUSED] 	The destination socket refused an attempted connect operation.

The names that it lists in the first column, such as EALREADY and ECONNREFUSED, are named constants that correspond to values that errno can have. I've included a copy book named ERRNO_H in the downloadable code for this article, and it contains the definitions of all these named constants.

The following code is an excerpt from that copy book:

     D EAFNOSUPPORT    C                   3422
      *  Operation already in progress.
     D EALREADY        C                   3423
      *  Connection ended abnormally.
     D ECONNABORTED    C                   3424
      *  A remote host refused an attempted connect operation.
     D ECONNREFUSED    C                   3425

In addition to helping you understand the notes in the Information Center, these constants are also useful when checking for errors in your program.

For example, I find that many users don't understand the description of the ECONNREFUSED error, so I like to provide a message that I think they'll understand. I do that by checking for the ECONNREFUSED value of errno, as follows:

          if ( connect(s: %addr(connto): %size(connto)) = -1 );
             callp close(s);
             p_errno = get_errno();
             if ( errno = ECONNREFUSED );
                msg = 'No program is listening for connections '
                    + 'on port ' + %char(port);
             else;
                msg = %str(strerror(errno));
             endif;
             // Report error to user
             return;
          endif;

Time Outs
There's one important rule whenever you're writing a communications program: Never trust the communications link. In this case, the communications link is the TCP/IP network.

For example, when an error occurs during a TCP session, the computer that detected the error sends an ICMP datagram back to your iSeries. Your iSeries sees that an error has occurred and sets errno to the error that it received, and it returns -1 to your program. That works great in a perfect world. Unfortunately, the world isn't perfect.

For security reasons, many network admins configure their firewall software to block ICMP datagrams. Eeek! Now, when something goes wrong, you get no error messages.

Even worse, what if the router starts on fire? Or a cable gets cut? Or the power goes out on the computer that you're talking to? In any of these situations, there's no way for an ICMP datagram to be sent to you. There's no way it can tell you that something went wrong!

That's why you should always write communications programs that time out after no activity has been received. If you want your programs to be as robust as possible, you have to be sure that you always let the connections time out.

In sockets programming, there are two ways to perform timeouts. You can do it with signals, or you can do it with the select() API.

Timeouts with Signals
In Unix programming, programs need a way to communicate events to one another. These event notifications are called signals. When they're received, the program that receives the signal stops what it's doing and runs a special subprocedure called a signal handler. This subprocedure can then determine what to do about the signal.

For example, you can send a signal to specify that a program should end or that it should pause or that it should continue running or that an error occurred and that it needs to handle that error.

In any case, if that program has designated a subprocedure for the signal that you've sent, it runs that subprocedure immediately, interrupting whatever it's currently doing.

In this case, I use the SIGALRM (alarm) signal. I designate a subprocedure that should be called whenever the program receives this signal. Then, I tell the operating system to send me a signal after 30 seconds.

If the connect() API is still running when the signal is received, it returns a -1 to indicate failure, and it sets errno to the EINTR (Interrupted by signal) error number.

To enable my program to receive signals, I run the following subprocedure:

     P init_signals    B
     D init_signals    PI
     D act             ds                  likeds(sigaction_t)
      /free
          Qp0sEnableSignals();
          sigemptyset(act.sa_mask);
          sigaddset(act.sa_mask: SIGALRM);
          act.sa_handler   = %paddr(got_alarm);
          act.sa_flags     = 0;
          act.sa_sigaction = *NULL;
          sigaction(SIGALRM: act: *omit);
      /end-free
     P                 E

It calls the Qp0sEnableSignals() API to turn signals on for the current job. To ensure that SIGALRM is the only signal that my program handles, I've created a fresh sigaction_t (signal action) data structure and made sure that there are no signals already set in the data structure by calling the sigemptyset() API. Then, when I call the sigaddset() API for the SIGALRM signal, I know that it's the only one in the set.

Finally, I designate the got_alarm() subprocedure as the one that should be called when this signal arrives, and I call the sigaction() API to register this signal handler with the operating system.

Now that the system knows that it should call the got_alarm() subprocedure when a signal is received, I can call the alarm() API to tell the operating system to send me a signal after a given number of seconds.

The sockets APIs are designed to stop what they're doing and let your program take control whenever they receive a signal. They return -1 to indicate an error has occurred, and they set errno to EINTR (interrupted by signal) so that your program knows what the error was.

Because of this helpful behavior, it's not necessary for the got_alarm() subprocedure to do anything. My got_alarm() subprocedure looks like this:

     P got_alarm       B
     D got_alarm       PI
     D   signo                       10I 0 value
      /free
         // Do nothing. The connect() API will return
         //  EINTR ("interrupted by signal") when the
         //  signal is received.
      /end-free
     P                 E

I've written a subprocedure called tconnect() that calls the connect() API and uses signals to make it time out. Consider the code for the tconnect() subprocedure:

     P tconnect        B
     D tconnect        PI            10I 0
     D   sock                        10I 0 value
     D   addr                          *   value
     D   size                        10I 0 value
     D   timeout                     10I 0 value
     D rc              s             10I 0
      /free
          alarm(timeout);
          rc = connect(sock: addr: size);
          alarm(0);
          return rc;
      /end-free
     P                 E

This subprocedure calls the connect() API but lets you specify a timeout interval. It calls the alarm() API to tell the operating system to send this program a SIGALRM signal after a given number of seconds. For example, if you pass the number 30 in the timeout parameter, it sends a SIGALRM signal in 30 seconds.

It then calls the connect() API to create a TCP connection. If that connection completes in less than 30 seconds, great. Everything is set to go. The alarm() API is called with a zero parameter to turn off the signal, so that the operating system won't send that alarm in 30 seconds, after all.

However, if there's a problem and the program gets stuck on the connect() API for more than 30 seconds, the operating system sends the SIGALRM signal, and the connect() API aborts with -1 and errno set to EINTR.

The routine that called the tconnect() API can check for the error by doing the following:

          if ( tconnect(s: %addr(connto): %size(connto): 30) = -1 );
             callp close(s);
             p_errno = get_errno();
             select;
              when  ( errno = ECONNREFUSED );
                msg = 'No program is listening for connections '
                      + 'on port ' + %char(port);
              when  ( errno = EINTR );
                msg = 'Connection attempt timed out!';
              other;
                msg = 'connect(): ' + %str(strerror(errno));
             endsl;
             // Report error to user
             return;
          endif;

This way, you can be sure to never get stuck indefinitely waiting for a connection to complete.

Although the preceding example demonstrates a timeout on the connect() API, the same technique also works with the send() and recv() APIs:

     P trecv           B
     D trecv           PI            10I 0
     D   sock                        10I 0 value
     D   data                          *   value
     D   size                        10I 0 value
     D   timeout                     10I 0 value
     D rc              s             10I 0
      /free
          alarm(timeout);
          rc = recv(sock: data: size: 0);
          alarm(0);
          return rc;
      /end-free
     P                 E

     P tsend           B
     D tsend           PI            10I 0
     D   sock                        10I 0 value
     D   data                          *   value
     D   size                        10I 0 value
     D   timeout                     10I 0 value
     D rc              s             10I 0
      /free
          alarm(timeout);
          rc = send(sock: data: size: 0);
          alarm(0);
          return rc;
      /end-free
     P                 E

Timeouts with the Select() API
The select() API is another way to handle timeouts in socket programming. I've discussed this in the following article:
Timing Out Sockets

Code Download
I've included a sample SMTP client program with the techniques that I presented in this article. You can download the sample code, as well as the SOCKET_H and ERRNO_H copy members from the following link:
http://www.scottklement.com/rpg/socktimeout/TcpProg2.zip