## Wednesday, December 22, 2010

### Small projects

There was a post recently in one of the forums I frequent about a new developer wanting to know what to do for a project. This is hardly a new question; the net is littered with remnants of first year CS students enamored with their new found abilities looking for ways to expand. In general, I suggest that a good way to build skills is to get involved with an existing open source program. The road map I layout is along the lines of:
• Pick something that interests you
• Start small (read and document some code)
• Tackle increasingly difficult bug tickets
• Develop portions (or all) of a small feature request
• Rinse and Repeat
I'll also occasionally mention that an alternative is to pick something that you do not like about your everyday computer experience and make that better somehow. It is this last point that I'd like to expand on here. Not that the suggestion is altogether vague, but an example always helps to illuminate a point.

In my reply to the post mentioned above I made the comment that I own a MacBook and that one of the features that I really liked about it was the dashboard dictionary/thesaurus widget. In case you are not familiar, the widget allows you to enter a word and get what is basically a dictionary listing for that word. This is directly available from the desktop environment by way of the dashboard.

I also mentioned that I frequently do work in a Linux environment and that I miss the availability of this type of tool there. My experience to define a word in Linux is to load up a browser and search for the definition on the web. I explained that this is the type of thing that you could use as a starting point for a project to hone your skills. In my case, I decided to write my own little command-line tool that would define a word for me. This is a far cry from a translucent, fly-in, dashboard widget but it gives me the immediate satisfaction that I'm looking for on my Linux machine. For example:

ezpz@sandbox:define > define "Ignoratio elenchi"
the logical fallacy of supposing that an argument proving an irrelevant point
has proved the point at issue

Ignoratio elenchi (also known as irrelevant conclusion or irrelevant thesis)
is the informal fallacy of presenting an argument that may in itself be
valid, but does not address the issue in question. ...


The result, I claimed, was that I got added goodness in my Linux environment and learned a few things along the way. In fact, I found that not adding 'Connection: close' to an HTTP request can have differing effects depending on whether or not you sit behind a proxy.

It's important when you are starting out to set meaningful, attainable goals - even if they are redundant (i.e. another tool does something very similar). The meaningful means that you get something tangible (or at least relevant) out of the experience but the attainable ensures that you do not bury yourself in something you can not hope to achieve.

Sure, there exist tools that do what my little command-line tool does. Yes, they are way more sophisticated. That misses the point. I worked to better my experience, learned something along the way and now I'm a better developer because of it. And as for taking my own advise on the rinse and repeat - I've recently purchased a machine with Windows 7 installed and there is no dictionary gadget included by default. Looks like I'm gonna be learning some Javascript to build me a new gadget.

## Wednesday, December 15, 2010

### The Moment of Trust

I was recently propositioned by Bloomberg with a lead engineering position.  Truth be told I really didn’t do anything in particular to precipitate this event; I just got a keyword match on my online resume. This means, of course, that I did not deal with Bloomberg directly at first but through their hired recruiting firm.

This firm was a little different than those I have dealt with in the past; they were hired directly and had a specific process for weeding out candidates before they ever got to the Bloomberg campus. The claim by the recruiting firm was that the hiring process was so strenuous that most people would fold without preparation. My interpretation was that Bloomberg is big enough to not want to waste time with unqualified applicants.

In either case, the process with this agency was fairly drawn out. There was an initial phone screen; then an extended session where I did some live coding online; and finally there was a detailed technical phone interview. Most of the initial portion (the first two sessions) dealt with my knowledge of C++ topics and how well I wrote optimized production code on-the-fly (because that is relevant). The final screen dove into logic problems, questions with dynamic rules, and the like. The reasoning offered to me was that Bloomberg’s interview process was modeled on that of Microsoft and Google and I could expect these types of things.

Ultimately, I passed these stages and was set for the actual Bloomberg interview. According to the recruiting firm Bloomberg only interviewed as long as the found it relevant which meant I would know quick if there would be no offer since they would just show me the door with thanks. Specifically, I was told that less than 2 hours was not good and 3 to 4 hours meant I’d made it the full way through.

The interview process was set up to incrementally increase in difficulty. It started with what would be fellow engineers asking pointed technical questions and progressed to middle-management asking design questions and more vague logic questions. Then, upper-management asked questions about how I deal with conflict and people management related stuff. The process climaxed with me meeting the director of operations and discussing the role I’d play and compensation. All told, I was in the facility for more than 5 hours. Things had gone well.

My wife was ecstatic. After all, it was her idea to consider the opportunity in the first place. In addition to that nugget, there was a significant amount of money on the table – I stood to make more than 150% of my current salary. After talking with the recruiting firm on the phone they assured me that Bloomberg had been in contact and that they would be preparing an offer.

I took the long way home from that interview. I ran over the good points and bad points and how I could have done better – I don’t think I’ll ever be satisfied with the progress I make.  When I got home, my wife and I had a long talk about what I would gain and lose by accepting an offer and I decided to write the recruiting firm and tell them to convey my thanks to Bloomberg but that I would not be accepting any offer.

In that decision, and my wife’s reaction to it, I knew that I’d found true happiness. I asked her what she thought about my decision and – with everything we stood to lose – she said: “I trust you.”

## Sunday, December 12, 2010

I was recently pulled into a project that needed a developer for a short sprint to make a portable version of a set of already existing tools. The functional version of this suite consists mostly of C and Perl code written to one of three platforms: FreeBSD, Linux, or Windows. The Windows support was provided via Cygwin and the version I was to develop intended to remove this and other crutches. It's also relevant to mention that I would only be converting the client side code; the server would remain untouched.

The port itself was relatively straightforward. Moving Perl and C to Python is not terribly taxing in and of itself, it's more of a burden ensuring that you do not break any existing functionality. These tools essentially provide a remote query/response mechanism across multiple platforms which simplified the testing approach. I would just iterate through all supported queries, store the output, and do a diff on each target platform.

I had two weeks to deliver. At the end of the second week the C and Perl files were converted to Python and I was testing the functionality of the system but was getting a 50% fail rate. A random set of commands were only getting a portion of the response - no error codes, just partial responses.

From trace output in each version the behavior looked sane. From tcpdump output I could tell that the handshake and shutdown was happening properly it just looked as if the server had less to send my version of the client. Because it can be easier to elicit salient information through Wireshark's interface I plugged the pcap traces in and looked a them side-by-side. There it was: a single byte was different between the two. The original query:

And the version I was sending:

The difference is a single space (the 0x20 sitting at the end of the first image). It threw me the first time since I was actually looking for non-printable characters in the ASCII-formatted output. Non-printable characters show up as '.' but 0x20 is a space character. Grrr. For whatever reason, the server will only return a full response if the command has a trailing space character. It's intriguing that a partial response is returned for an invalid request without warning. My guess is that there is an off-by-one error somewhere in the server command parsing code.

I pushed a bug up to the server group though I doubt it will see the light of day since I've already covered the ground necessary to make the new tool work. If it ain't broke costing us money, don't fix it.

The moral of the story is that you should know the tools that can help you track down problems. You should also understand the strengths of different versions of those tools (Wireshark vs. tcpdump). Tried and true tools are invaluable to getting good work done and it is your responsibility as a developer to have your set at the ready at all times.

## Saturday, December 4, 2010

### Intercepting Linux system calls: Part III

[Update] This code can now be found on github (ntrace)

In the previous post in this series, I described the similarities between strace and LD_PRELOAD and why you may opt to use one over the other. This is the final post in the series where I provide a situation where strace is inadequate and LD_PRELOAD becomes a more appropriate tool.

The contrived example provided earlier was trivial but this example is one that I have run across and was the catalyst for this set of posts. The situation I faced is that I had to understand how an application was interacting with the host system to communicate with the network. I needed a way to instrument this application to provide concise details about how changes being made at the transport layer were affecting the upper-layer protocols (i.e. performance).

This sounds straightforward considering the explanation of strace so far - and for the most part it is... but there are caveats. Let's consider the following piece of code:

/*
* reads from the net, writes to a pipe
*/
void read_client (int sock, int fd) {
ssize_t n = 0;
char buff[2048] = {0};
while ((n = recv (sock, buff, 2048, 0)) > 0) {
/* write data to the pipe */
write (fd, buff, n);
}
}


The setup for this call is that there is a server that handles connections by setting up a pipe, forking a new process, and allowing the child process to read from the connection and write that data to the pipe. The parent then handles the data from the pipe and not directly from the network connection. The above call receives the network connection (sock) and write end of the pipe (fd) as parameters. So far, there is nothing that precludes us from using strace with the proper flags to monitor the behavior of this process. For example:

strace -f -e trace=network -o trace.log ./recv_example


provides the following bit of output

13328 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3
13328 listen(3, 10)                     = 0
13362 recv(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2048, 0) = 2048
13362 recv(4, "", 2048, 0)              = 0
13328 --- SIGCHLD (Child exited) @ 0 (0) ---


We can see that strace only caught the network-related calls, followed the fork, realized the recv call in the child, and that the server received 2048 bytes from the network. This would be enough to accomplish the goal... but there are caveats.

As it turns out, not everyone follows the same approach when reading/writing to network sockets. For instance, the function described above could have been written using the read system call instead and would perform in exactly the same way. Consider the following change:

--- read_example.c 2010-12-03 10:44:32.000000000 -0500
+++ recv_example.c 2010-12-03 10:44:43.000000000 -0500
@@ -18,7 +18,7 @@
void read_client (int sock, int fd) {
ssize_t n = 0;
char buff[2048] = {0};
-    while ((n = read (sock, buff, 2048)) > 0) {
+    while ((n = recv (sock, buff, 2048, 0)) > 0) {
/* write data to the pipe */
write (fd, buff, n);
}


Now our strace approach gets slightly more complicated. Using the exact same flags as above our log now contains the following after a run:

14126 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3
14126 listen(3, 10)                     = 0
14126 --- SIGCHLD (Child exited) @ 0 (0) ---


Notice the absence of the read call? According to -e trace=network, read is not a network-related system call. Fine. Easy fix, we just add the read call to the list of calls to trace and be done with it.

strace -f -e trace=network,read -o trace.log ./recv_example


Unfortunately, that only further muddies the water.

14130 read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\200\257\274\0004\0\0\0"..., 512) = 512
14130 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3
14130 listen(3, 10)                     = 0
14132 read(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2048) = 2048
14132 read(4, "", 2048)                 = 0
14130 --- SIGCHLD (Child exited) @ 0 (0) ---
14130 read(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2048) = 2048
14130 read(5, "", 2048)                 = 0


Now we are getting all of the read calls including those related to loading ELF binaries, reading from the pipe in the parent process, and the network call. This magnifies as you consider the number of ways these calls can be arranged.

What this is really telling us is that the information we desire is buried in context - it dependent upon state.

This is where LD_PRELOAD with a shim library becomes appropriate. What we'd really like to do is take every way that a developer can interact with a network file descriptor and monitor those calls but with the added benefit of knowing something about the file descriptor itself. In the previous output we could build the state from the trace - parse out the return values for the system calls, associate them with a particular behavior, etc.  That's not entirely true since, at that point, we would need to trace all calls (pipe, for instance) but it would get us close. That is error prone and tedious. If the format of the output changes or the flags to strace weren't just so we'd need to adjust.

It is much better to have that state build up with the execution of the program and remain resident until the program exits. The solution presented here is well beyond the needs of this small example but it exposes how we can overcome the limitations of strace when the occasion calls for it.

If we use this shim library tool on the second program above (the one using read instead of recv) we get two files - one for each process: 14228.log and 14230.log. In this case, 14228 is the parent process and reads from the pipe while 14230 is reading from the network and writing to the pipe. These files recap the overall activity of each process and solve the issue exactly. We now see what operations are related to the network - the top of each file provides call-by-call details - and we don't need to worry about changes to external flags or output format to ensure proper operation.

14228.log
call,fd,pid,seq,time,size,recv_so_far,send_so_far

Network totals: (0, 0)
recv : 0 (0)
recvfrom : 0 (0)
recvmsg : 0 (0)
write : 0 (0)
send : 0 (0)
sendto : 0 (0)
sendmsg : 0 (0)

Pipe:14228:8 (2048,0)
Pipe totals: (2048, 0)
recv : 0 (0)
recvfrom : 0 (0)
recvmsg : 0 (0)
write : 0 (0)
send : 0 (0)
sendto : 0 (0)
sendmsg : 0 (0)

Disk totals: (0, 0)
recv : 0 (0)
recvfrom : 0 (0)
recvmsg : 0 (0)
write : 0 (0)
send : 0 (0)
sendto : 0 (0)
sendmsg : 0 (0)


14230.log
call,fd,pid,seq,time,size,recv_so_far,send_so_far

Network:14230:7 (2048,0)
Network totals: (2048, 0)
recv : 0 (0)
recvfrom : 0 (0)
recvmsg : 0 (0)
write : 0 (0)
send : 0 (0)
sendto : 0 (0)
sendmsg : 0 (0)

Pipe:14230:9 (0,2048)
Pipe totals: (0, 2048)
recv : 0 (0)
recvfrom : 0 (0)
recvmsg : 0 (0)
write : 1 (2048)
send : 0 (0)
sendto : 0 (0)
sendmsg : 0 (0)

Disk totals: (0, 0)
recv : 0 (0)
recvfrom : 0 (0)
recvmsg : 0 (0)
write : 0 (0)
send : 0 (0)
sendto : 0 (0)
sendmsg : 0 (0)


We can also get an strace-like output that allows us to find out exactly what is happening during execution. This is helpful to understand the internals of the program and is exactly what strace provides but we get it (and the added state) for free. Notice in the output below that each operation on a file descriptor is classified (NET vs. PIPE).

** DEBUG (14228) ** Initialized the debug channel
** DEBUG (14228) **  Created process state 0x8e51008. pid: 14228, ppid: 4285, thid: -1208371520
** DEBUG (14228) **  Setting up traffic on fd 0
** DEBUG (14228) **   Details:
** DEBUG (14228) **   Disk oriented
** DEBUG (14228) **  Setting up traffic on fd 1
** DEBUG (14228) **   Details:
** DEBUG (14228) **   Disk oriented
** DEBUG (14228) **  Setting up traffic on fd 2
** DEBUG (14228) **   Details:
** DEBUG (14228) **   Disk oriented
** DEBUG (14228) ** socket (2, 1, 0) = 6
** DEBUG (14228) **  Setting up traffic on fd 6
** DEBUG (14228) **   Details:
** DEBUG (14228) **   Network oriented
** DEBUG (14228) ** bind (6, 0xbfeeae20, 16) = 0
** DEBUG (14228) ** accept (6, 0xbfeeae10, 16) = 7
** DEBUG (14228) **  Setting up traffic on fd 7
** DEBUG (14228) **   Details:
** DEBUG (14228) **   Network oriented
** DEBUG (14228) ** pipe ([8, 9]) = 0
** DEBUG (14228) **  Setting up traffic on fd 8
** DEBUG (14228) **   Details:
** DEBUG (14228) **   Pipe oriented
** DEBUG (14228) **  Setting up traffic on fd 9
** DEBUG (14228) **   Details:
** DEBUG (14228) **   Pipe oriented
** DEBUG (14230) ** Initialized the debug channel
** DEBUG (14230) **  Copying process details from parent 14228 (getppid: 14228)
** DEBUG (14230) **  Rehashing fd 0 (disk)
** DEBUG (14230) **  Rehashing fd 1 (disk)
** DEBUG (14230) **  Rehashing fd 2 (disk)
** DEBUG (14230) **  Rehashing fd 6 (net)
** DEBUG (14230) **  Rehashing fd 7 (net)
** DEBUG (14230) **  Rehashing fd 8 (pipe)
** DEBUG (14230) **  Rehashing fd 9 (pipe)
** DEBUG (14230) ** close (8) = 0
** DEBUG (14230) **  (Exiting fd 8)
** DEBUG (14230) **   network     : no
** DEBUG (14230) **   disk        : no
** DEBUG (14230) **   pipe        : yes
** DEBUG (14230) **   active      : yes
** DEBUG (14230) **   has_hb      : no
** DEBUG (14230) ** [ NET] read (7, 0xbfee9dc0, 2048) = 2048
** DEBUG (14230) **  fd 7 used for the first time
** DEBUG (14230) **  spawing thread for fd 7
** DEBUG (14230) **  Detail
** DEBUG (14230) **   delay (in usec)    : 1000000
** DEBUG (14230) **  thread created: -1208374368
** DEBUG (14230) ** [PIPE] write (9, 0xbfee9dc0, 2048) = 2048
** DEBUG (14230) ** [ NET] read (7, 0xbfee9dc0, 2048) = 0
** DEBUG (14230) ** close (9) = 0
** DEBUG (14230) **  (Exiting fd 9)
** DEBUG (14230) **   network     : no
** DEBUG (14230) **   disk        : no
** DEBUG (14230) **   pipe        : yes
** DEBUG (14230) **   active      : yes
** DEBUG (14230) **   has_hb      : no
** DEBUG (14230) **  [atexit]
** DEBUG (14230) ** _exit (0) = ...
** DEBUG (14230) **  Cleanup. Stopping any active threads
** DEBUG (14230) **  < joining with thread -1208374368 ... >
** DEBUG (14228) ** close (9) = 0
** DEBUG (14228) **  (Exiting fd 9)
** DEBUG (14228) **   network     : no
** DEBUG (14228) **   disk        : no
** DEBUG (14228) **   pipe        : yes
** DEBUG (14228) **   active      : yes
** DEBUG (14228) **   has_hb      : no
** DEBUG (14230) **  < ... join complete >
** DEBUG (14228) ** [PIPE] read (8, 0xbfeea600, 2048) = 2048
** DEBUG (14228) ** [PIPE] read (8, 0xbfeea600, 2048) = 0
** DEBUG (14228) ** close (8) = 0
** DEBUG (14228) **  (Exiting fd 8)
** DEBUG (14228) **   network     : no
** DEBUG (14228) **   disk        : no
** DEBUG (14228) **   pipe        : yes
** DEBUG (14228) **   active      : yes
** DEBUG (14228) **   has_hb      : no
** DEBUG (14228) ** close (7) = 0
** DEBUG (14228) **  (Exiting fd 7)
** DEBUG (14228) **   network     : yes
** DEBUG (14228) **   disk        : no
** DEBUG (14228) **   pipe        : no
** DEBUG (14228) **   active      : yes
** DEBUG (14228) **   has_hb      : no
** DEBUG (14228) ** close (6) = 0
** DEBUG (14228) **  (Exiting fd 6)
** DEBUG (14228) **   network     : yes
** DEBUG (14228) **   disk        : no
** DEBUG (14228) **   pipe        : no
** DEBUG (14228) **   active      : yes
** DEBUG (14228) **   has_hb      : no
** DEBUG (14228) **  [atexit]
** DEBUG (14228) ** _exit (0) = ...
** DEBUG (14228) **  Cleanup. Stopping any active threads


## Wednesday, December 1, 2010

### Intercepting Linux system calls: Part II

[Update] This code can now be found on github (ntrace)

Continuing from the first part of this discussion, I thought it would be best to first show what strace does and how LD_PRELOAD compares to that before continuing on to a full-blown example of using one over the other.

strace
strace is a utility that either attaches to an already running process or spawns an entirely new process and monitors that process' system calls and signals. This is done via ptrace and thus provides an almost scary level of control over the process being monitored. This is precisely why I prefer to use strace over LD_PRELOAD in most cases.

LD_PRELOAD is an environment variable that can contain a custom library to be loaded before others when invoking a program. This allows for behaviors such as intercepting library calls (e.g. libc) and invoking custom behavior around or instead of the intended target code. LD_PRELOAD does not allow a user (without extended effort) to poke at the registers and data sections of a running process or to receive signals destined to the process.

To give an idea of how you might use each of these in context, lets say that you have a program that will print updates to the screen whenever it receives events internally. As a matter of black-boxing that application, you would like to know how often these events are occurring but you do not have access to the source code and there is no timing information other than some text output to the screen. For example:

ezpz@sandbox:~/blog > ./event_handler
updated again
updated again
updated again
updated again
...


All you can see is a stream of notifications. What you'd like to see, however, is some indication of time with (or instead of) the meaningless update. strace is perfect for this as it gets notified each time a write is about to occur and prints a timestamp along with that data. Here is an example that prefixes traced calls with a seconds/microseconds timestamp and only responds to the write system call.

The command is:

strace -ttt -e trace=write -o strace.log ./event_handler


That command produces a log file with the following contents:

1291244916.592714 write(1, "updated again\n"..., 14) = 14
1291244917.024000 write(1, "updated again\n"..., 14) = 14
1291244917.217090 write(1, "updated again\n"..., 14) = 14
1291244917.354294 write(1, "updated again\n"..., 14) = 14
...


Now you can easily see when each event fires and with a simple script could determine with what frequency these events most often occur.

Similarly, you can use LD_PRELOAD to do this. In this particular case there is significantly more work to be done (another reason to opt for strace first) but bear with me for the illustration.

In the LD_PRELOAD case you need to provide an shared library that can be loaded and used in place of the libraries you want to modify. Let's set that up: first, the interception code:

#define _GNU_SOURCE
#include <sys/time.h>
#include <stdio.h>
#include <dlfcn.h>

/*
* Use the same signature as libc's puts when defining our interceptor
* and function pointer
*  int puts(const char *s);
*/
typedef int (*libc_puts)(const char *);

/*
* Provide our version of puts here
*/
int puts (const char *s) {
/* grab the actual puts */
libc_puts original = dlsym (RTLD_NEXT, "puts");

/* prefix the output with our timestamp */
struct timeval tv;
gettimeofday (&tv, 0);
printf ("[%ld.%06ld] ", tv.tv_sec, tv.tv_usec);

/* pass the arguments through to the libc call */
return original(s);
}


Once that is built you can use the shim by

ezpz@sandbox:~/blog > LD_PRELOAD=./shim.so ./event_handler


which yields

ezpz@sandbox:~/blog > LD_PRELOAD=./shim.so ./event_handler
[1291246609.278321] updated again
[1291246609.709342] updated again
[1291246609.902186] updated again
[1291246610.039169] updated again
...


So you have an equally viable solution in either case. The point here was not necessarily to show why either of the two of these methods is better than the other in this contrived example but to expose the fact that you can do similar things with each. The strace solution here does not allow for injecting your own code into a running process but it does capture the exact thing you want without the overhead of having to build a shared library to get it done.

I'll continue next time with why you might need to modify program behavior to get data that is just not available with strace.

### wu riddles - for loops

I like logic problems and code puzzles. I find that the naive solution generally comes without effort and I can manage better attempts as I invest more time and effort. Unfortunately, I wasn't born with an 'it just clicks' mind; I've actually got to work at my solutions.

I found a little teaser yesterday that I enjoyed and was able to get fairly quickly. This (and much, much more) can be found at wuriddles. This particular riddle lives in the CS section.

The original problem provides a C for loop that, as it is constructed, prints an infinite number of '-' characters. The code is:

/* Original code provided */
int i, n=20;
for (i=0; i < n; i--) {printf("-");};


The goal of the challenge is to modify that code by either adding or modifying only a single character to make the loop print the '-' character exactly 20 times. There are three ways to modify the above code to meet the requirements of the challenge. Give it a shot before looking at the solutions below.

Here are the solutions I was able to come up with.

/* Modifications */
for (i=0; i < n; n--) {printf("-");};
for (i=0; i + n; i--) {printf("-");};
for (i=0;-i < n; i--) {printf("-");};