splot — visualizing behavior of many concurrent processes ========================================================= **splot** draws a single Gantt-like chart with a birds-eye view of the activity of a number of concurrent processes, each of which may be in one of several different states at each moment (e.g. processing one of several jobs, or being in a particular stage of processing). The input to **splot**, in the basic form, is a sequence of events *activity started*, *activity finished* (specifying the timestamp and the process whose activity we're speaking about). Every event also includes a color with which to render the activity. Simple example -------------- This example has two tracks ``thread-1`` and ``thread-2`` and demonstrates most of the basic features:: 2010-10-21 16:45:12.014 >thread-2 blue 2010-10-21 16:45:13.329 >thread-1 blue 2010-10-21 16:45:13.635 thread-1 blue 2010-10-21 16:45:13.810 >thread-2 orange 2010-10-21 16:45:14.010 >thread-1 orange 2010-10-21 16:45:14.258 thread-2 orange 2010-10-21 16:45:15.138 >thread-1 orange 2010-10-21 16:45:15.319 >thread-1 blue 2010-10-21 16:45:15.502 TRK COLOR`` — Start activity of color COLOR on track TRK at time TIME (if there already is an activity, finish it and start a new one instead). **Example:** ``2010-10-21 16:45:09,431 >r2b3.t5 blue`` * ``TIME worker-2 blue 2010-10-21 16:45:13.329 >worker-1 blue 2010-10-21 16:45:13.635 worker-1 blue 2010-10-21 16:45:13.810 >worker-2 orange 2010-10-21 16:45:14.010 >worker-1 orange 2010-10-21 16:45:14.258 worker-2 orange 2010-10-21 16:45:15.138 >worker-1 orange 2010-10-21 16:45:15.319 >worker-1 blue 2010-10-21 16:45:16.512 worker-2 blue 2010-10-21 16:45:20.112 `` is missing ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Sometimes you process logs which start in the middle of the program's execution, so the log doesn't catch the beginning events of activities that were active at the moment the log was started (however, the log *does* catch their finishing events). **splot** can display such "phantom" activities in a color of your choice, using the ``-phantom COLOR`` option. Specifically, if a track starts with a ``<`` event instead of ``>``, then **splot** will assume that there was a ``>`` event with color COLOR in the past on this track. Consider the same example as above, but let us cut it in the beginning, as if we had a truncated log:: 2010-10-21 16:45:14.010 >worker-1 orange 2010-10-21 16:45:14.258 worker-2 orange 2010-10-21 16:45:15.138 >worker-1 orange 2010-10-21 16:45:15.319 >worker-1 blue 2010-10-21 16:45:16.512 worker-2 blue 2010-10-21 16:45:20.112 RACK2UNIT067.4ca8 8610 2011-07-27 02:42:03.495 >RACK4UNIT075.5b15 8610 2011-07-27 02:42:03.496 >RACK4UNIT067.ec72 877B 2011-07-27 02:42:03.496 >RACK4UNIT067.f3fe 877B 2011-07-27 02:42:03.496 >RACK5UNIT017.0c21 8610 2011-07-27 02:42:03.498 >RACK2UNIT030.9e7a 071C ... So, we use track names of the form ``MACHINE.WORKERID`` and instead of using color at ``>`` events, we use the client ID, asking **splot** to color-code clients for us. Here red color denotes tasks that completed unsuccessfully and other colors (e.g. green and blue; red is excluded from the default color scheme precisely for situations like this) encode different clients *(Yes, as this rather crazy picture might hint, the program* **did** *have lots of problems — exposing them, that's what* **splot** *is for)*. .. figure:: pics/splot/splot-color-gen.png :width: 50% :align: center Automatic color generation with **splot** **Color schemes.** Assume the same case as above with worker processes servicing different clients. Assume also that worker processes might be in two modes: *regular* and *workstealing* (if they have nothing to do with their current client, they try to service tasks of some other client). We wish to depict regular tasks and tasks picked by work-stealing in different shades: e.g. regular tasks as bright colors and work-stolen ones as pale. We still wish to use color generation for both. This can be achieved by using color schemes. Specifically, we'll color-core regular tasks by ``TIMESTAMP >WORKER CLIENT`` and work-stolen tasks by ``TIMESTAMP `_ but with fractional seconds supported via ``%OS`` — will parse ``12.4039`` or ``12,4039``. Also, ``%^[+-][N]s`` will parse seconds since the epoch, for example ``%^-3s`` are milliseconds since the epoch (N can only be 1 digit) - ``%Y-%m-%d %H:%M:%OS`` * - ``-tickInterval MILLIS`` - Ticks on the X axis will be this often - ``1000`` * - ``-largeTickFreq N`` - Every N'th tick will be larger than the others - ``10`` * - ``-sort SORT`` - Sort tracks by time of first event (``-sort time``) or by track name (``-sort name``) — see "track sorting" above - ``name`` * - ``-expire MILLIS`` - Expire activities that do not finish within MILLIS milliseconds — see "expiring activities" above - *none (don't expire)* * - ``-phantom COLOR`` - Set the phantom color which is used if the first event on a track is ``<`` — see "phantom color" above - *none (no phantom color)* * - ``-fromTime TIME`` - Clip the picture on the left (time in the format of ``-tf``, i.e. same as in the input) - *none (don't clip)* * - ``-toTime TIME`` - Clip the picture on the right (time in the format of ``-tf``, i.e. same as in the input) - *none (don't clip)* * - ``-numTracks N`` - Explicitly specify the number of tracks for better performance on very large data (see section "Performance" below) - *none (compute from input)* * - ``-colorscheme SCHEME COLORS`` - Declare a colorscheme (see "Color schemes" above). Can be used multiple times. Scheme is an arbitrary string, e.g. ``pale`` or ``bright``. COLORS is a space-separated list of colors in SVG or hex, e.g. ``'red green 0x0000FF'`` - *none* * - ``-legendWidth WIDTH`` - Allocate WIDTH pixels to a legend to the left of the plot area - *none* Gallery ^^^^^^^ The next figures show a number of pictures produced by **splot** in various real-life situations. Most of them look really creepy and expose different kinds of performance problems in the programs whose behavior they depict. The author already does not remember the precise reasons for the problems — think of it as a horror museum exposition and use your imagination. .. figure:: pics/splot/gradient.png :width: 50% :align: center Blue: working, white: waiting. The task queue's performance was gradually becoming worse and worse, until the job completed and another job started. .. figure:: pics/splot/creepy-startup.png :width: 50% :align: center Orange: working, gray: starting, white: waiting. The first time a program loads, it takes a long time. Loads on the same machine take the same time — .NET DLL caching (here the sort-by-track-name option is used, currently absent until reimplemented). The gray "gradient" is actually just a moire pattern. .. figure:: pics/splot/4-rabbits.png :width: 50% :align: center Colors encode which shard of a task queue was being used. Initially the shards are fast, then they run slower but smoothly. Everything's fine. .. figure:: pics/splot/rabbit-collision.png :width: 50% :align: center Colors encode which shard of a task queue was being used. Green and purple have problems, yellow and orange don't. Turned out green and purple corresponded to the same physical queue server which had to sustain double load. .. figure:: pics/splot/workstealing.png :width: 50% :align: center Gray: normal tasks, blue: tasks picked by workstealing. After some period, workstealing begins and many workers start processing the job — slowing down the queue, but total throughput is higher. .. figure:: pics/splot/creepy-prefetch.png :width: 50% :align: center Orange: working, red: preempted and lost work. The task queue prefetch feature was broken, leading to very strange patterns of task queue utilization. .. figure:: pics/splot/simple-2stage-job.png :width: 50% :align: center A simple 2-stage job: a couple of insufficiently parallel data preparation steps, then a long stream of tasks utilizing the cluster well. Two worker machines died in the process. .. figure:: pics/splot/8threads.png :width: 50% :align: center Bars correspond to a web service being called from different threads. Apparently there are periods when it takes longer, and periods when it's not called at all. .. figure:: pics/splot/njobs-then-one.png :width: 50% :align: center Several jobs run concurrently and saturate the cluster. Then all but one finish, and the one remaining runs in bursts of tasks, these bursts being not parallel enough to saturate the cluster (I recall it was a 480-core cluster and 160-task bursts). .. figure:: pics/splot/very-slow-memcached.png :width: 50% :align: center Orange: working, blue: fetching from memcached. We see that early calls to memcached take a ridiculous amount of time. Later calls sometimes take long too, but not that long (I recall the reason was a broken retry mechanism). .. figure:: pics/splot/spray.png :width: 50% :align: center There are 1900 tracks (cluster cores) here. The tasks that we put into the shared queue are by far too short, so the queue (and perhaps the task producer) becomes the bottleneck. We also see that the queue feeds tasks to workers in round-robin. And we also see that it slows down over time. And small pauses in the queue or task producer cost an awful amount of computing time because everyone is waiting for them.