...where time becomes a loop
Last year, I
mentionned that it was time for the Python community to embrace
dtrace. I've gotten questions left and right, at user groups, through email etc as to what is dtrace and how it ties in with Python.
This week, a few posts on the Argentinian and Venezuelan Python lists on debugging Python and a total absence of a mention of dtrace and I knew I had to do a writeup. But before we get into the details, let's step back a bit.
Party like it's 1999 2004
Back in the 1990s I was using
Povray (there is a Python API) to do photo quality rendering of
made to order products. Eventually, I had to switch to OpenGL, C++, Sun Studio and hardware acceleration in order to keep up with the demand (over
20,000 during normal business hours and there are less than 30,000 seconds during that period of time). A few years later, at peak hours I was serving on the web over
100 renders per second.
Even if I had switched to OpenGL on that particular system, I continued working with Povray in other areas, particularly to design optical systems and build stuff that required visual quality over quantity. While Povray under Windows was fast enough, it felt much slower under Solaris and Linux (
whereas my own code ran much faster on Solaris than Windows).
dtrace: First Contact
I posted the following to the solarisx86 Yahoo group in November of 2004:
I finally ran Povray 3.5 benchmark 1.02 on the exact same hardware and
here are the results:
Hardware:
Dell GX260
Pentium 4 2.4GHz
512MB ram
Hitachi 40 GB 7200 rpm
ATI Radeo 7500
Under Windows 2000 SP4, official Povray 3.5 win32 release:
Time for Parse: 2 seconds
Time for Photon: 54 seconds
Time for Trace: 36 min 47 seconds
Total time: 37 min 43 seconds
Under Solaris 10 B69, Blastwave Povray 3.5 x86
Time for Parse: 7 seconds
Time for Photon: 1 min 12 seconds
Time for Trace: 53 min 14 seconds
Total time: 54 min 33 seconds
Al Hopper suggested
dtrace. I knew what it was (
at the time, a Solaris only feature, now also available on Mac OS/X, FreeBSD, SmartOS, OpenIndiana and other IllumOS based OSes, and now Linux with dtrace4linux), but I hadn't taken the time to use it in real world cases. So I looked into it.
Instrumentation TNG

Which one is my blood pressure?? 
Here is what I posted back then:
I finally had a few minutes to play around with povray and dtrace this
afternoon. I followed the suggestion made by Adam Leventhal on his Blog
to run:
# dtrace n 'pid$target:::entry{ @[probefunc] = count() }' p <processid>
(replace <processid> by the pid of povray)
So what I did is run povray, get its process id with ps, then run the
above. Once it rendered line 1, I ctrlc. dtrace then spit out what I
needed to know. I dont have any reference to compare as to what
optimisation was done exactly on the windows build. However, running the
above dtrace command on that process does reveal something:
I know it is spending a lot of time in DNoise / Noise, because it's
called a bazillion times. Actually almost 10million times for the pair 
the only other call that is called as much is memcpy, haven't
investigated yet from where, but there might be an opportunity to
combine memcpy. It also points out that a fast FSB and faster memory will definitely pull in front for equal cpu.
Anyway, back to Povray, looking at texture.cpp (line 169 and following):
/*****************************************************************************/
/* Platform specific faster noise functions
support */
/* (Profiling revealed that the noise functions can take up to 50%
of */
/* all the time required when rendering and current compilers
cannot */
/* easily optimise them efficiently without some help from
programmers!) */
/*****************************************************************************/
#if USE_FASTER_NOISE
#include "fasternoise.h"
#ifndef FASTER_NOISE_INIT
#define FASTER_NOISE_INIT()
#endif
#else
#define OriNoise Noise
#define OriDNoise DNoise
#define FASTER_NOISE_INIT()
#endif
Haha! Fasternoise.h (only found in the Windows source, not the Unix
source) includes
/*****************************************************************************/
/* Intel SSE2
support */
/*****************************************************************************/
#ifndef WIN_FASTERNOISE_H
#define WIN_FASTERNOISE_H
#ifdef USE_INTEL_SSE2
int SSE2ALREADYDETECTED = 0 ;
DBL OriNoise(VECTOR EPoint, TPATTERN *TPat) ;
void OriDNoise(VECTOR result, VECTOR EPoint) ;
#include "emmintrin.h"
#include "intelsse2.h"
#undef ALIGN16
#define ALIGN16 __declspec(align(16))
#endif
#endif
BTW, with DTrace it only took me a few minutes total (including running
povray for 3 minutes) to identify the culprit. Under windows it would
have taken me hours.
So on Solaris x86, need to add USE_INTEL_SSE2 and USE_FASTER_NOISE as compile switches and add fasternoise and the various other includes from
the windows source to the unix source.
And that is how a one liner dtrace script helped in debugging my performance problem with Povray. The nice thing is that you didn't even need the source to know what exactly what was going on in the code, in real time, without step by step debugging.
Instrumenting Python
So, what is the Python connection? When running a dtrace equipped version of Python, you can use dtrace on your Python scripts. On Solaris, this has been available
since 2007 (in OpenSolaris).
Unfortunately on other OSes, it hasn't been
the case. Jesus Cea has been
trying to get John Levon's patches integrated with CPython since before 2.7 and 3.2 came out. Not sure what is needed to make this happen, but it is
long overdue. At least some distro builders should have the patched Python available, but it really needs to be incorporated into the official build.
Anyway if you look at a Solaris 11 system, it has Python 2.6 as the default, and it is ready for dtrace:
$ dtrace lP python*
ID PROVIDER MODULE FUNCTION NAME
1044 python1905 libpython2.6.so.1.0 PyEval_EvalFrameEx functionreturn
1045 python1905 libpython2.6.so.1.0 dtrace_return functionreturn
1046 python1939 libpython2.6.so.1.0 PyEval_EvalFrameEx functionentry
1047 python1939 libpython2.6.so.1.0 dtrace_entry functionentry
1048 python1939 libpython2.6.so.1.0 PyEval_EvalFrameEx functionreturn
1049 python1939 libpython2.6.so.1.0 dtrace_return functionreturn
1050 python1945 libpython2.6.so.1.0 PyEval_EvalFrameEx functionentry
1083 python7640 libpython2.6.so.1.0 PyEval_EvalFrameEx functionreturn
1084 python7640 libpython2.6.so.1.0 dtrace_return functionreturn
1100 python11695 libpython2.6.so.1.0 dtrace_entry functionentry
1101 python11695 libpython2.6.so.1.0 PyEval_EvalFrameEx functionreturn
1102 python11695 libpython2.6.so.1.0 dtrace_return functionreturn
1103 python11699 libpython2.6.so.1.0 PyEval_EvalFrameEx functionentry
1214 python11693 libpython2.6.so.1.0 PyEval_EvalFrameEx functionentry
1215 python11693 libpython2.6.so.1.0 dtrace_entry functionentry
1219 python11699 libpython2.6.so.1.0 dtrace_entry functionentry
1220 python11699 libpython2.6.so.1.0 PyEval_EvalFrameEx functionreturn
1221 python11699 libpython2.6.so.1.0 dtrace_return functionreturn
1226 python11693 libpython2.6.so.1.0 PyEval_EvalFrameEx functionreturn
1227 python11693 libpython2.6.so.1.0 dtrace_return functionreturn
1228 python11695 libpython2.6.so.1.0 PyEval_EvalFrameEx functionentry
2248 python1905 libpython2.6.so.1.0 PyEval_EvalFrameEx functionentry
2249 python1905 libpython2.6.so.1.0 dtrace_entry functionentry
2250 python23832 libpython2.6.so.1.0 PyEval_EvalFrameEx functionentry
2251 python23832 libpython2.6.so.1.0 dtrace_entry functionentry
2260 python7640 libpython2.6.so.1.0 PyEval_EvalFrameEx functionentry
2261 python7640 libpython2.6.so.1.0 dtrace_entry functionentry
2315 python23832 libpython2.6.so.1.0 PyEval_EvalFrameEx functionreturn
2316 python23832 libpython2.6.so.1.0 dtrace_return functionreturn
7670 python2936 libpython2.6.so.1.0 dtrace_entry functionentry
7671 python2936 libpython2.6.so.1.0 PyEval_EvalFrameEx functionreturn
7672 python2936 libpython2.6.so.1.0 dtrace_return functionreturn
7750 python14523 libpython2.6.so.1.0 PyEval_EvalFrameEx functionentry
7751 python14523 libpython2.6.so.1.0 dtrace_entry functionentry
7752 python14523 libpython2.6.so.1.0 PyEval_EvalFrameEx functionreturn
7753 python14523 libpython2.6.so.1.0 dtrace_return functionreturn
12339 python1945 libpython2.6.so.1.0 dtrace_entry functionentry
12340 python1945 libpython2.6.so.1.0 PyEval_EvalFrameEx functionreturn
12341 python1945 libpython2.6.so.1.0 dtrace_return functionreturn
12345 python2936 libpython2.6.so.1.0 PyEval_EvalFrameEx functionentry
12347 python1219 libpython2.6.so.1.0 PyEval_EvalFrameEx functionentry
12348 python1219 libpython2.6.so.1.0 dtrace_entry functionentry
12349 python1219 libpython2.6.so.1.0 PyEval_EvalFrameEx functionreturn
12350 python1219 libpython2.6.so.1.0 dtrace_return functionreturn
The probes are specific to Python. In the original dtrace call I had used for debugging a C++ binary (Povray), I was using the
pid provider and
entry probe:
# dtrace n 'pid$target:::entry{ @[probefunc] = count() }' p <processid>
But we are no longer in 2004 and I'm not interested in the performance of Povray right now, I just want to figure out how many checksums my
pingpong.py script is doing (a little module to keep a tab on my machines and their latencies), and what else is going on at the top.
So, this time I'll modify it to use the
python provider instead of the
pid provider and using the
functionentry probe (still doing a count):
# dtrace qZn 'python$target:::functionentry{ @[copyinstr ( arg1 )] = count() }' c ./pingpong.py
[...]
register 10
abstractmethod 15
__new__ 17
<genexpr> 35
<module> 48
exists 49
S_IFMT 53
S_ISDIR 53
isdir 56
makepath 100
normcase 100
abspath 113
isabs 113
join 113
normpath 113
ping_pong 1000
checksum 4000
close 4000
do_one 4000
fileno 4000
receive_one_ping 4000
send_one_ping 4000
__init__ 4007
With 1000 ping_pong() I was expecting 3000 checksums. Ah, I see I'm sending 4000 pings, so apparently I have a
for that is not properly bounded (on purpose, to illustrate an example of what even something as simple as this can tell us).
To infinity and beyond
But that is not even touching the tip of the iceberg. How about creating
heatmaps (node.js and dtrace in this case)? To do something like that with Python and dtrace, a good starting point is
Brian Cantrill's Thijs Metsch's
pythondtrace on Pypi (
sorry for the wrong attribution, Brian's name was at the top of the page on Pypi).
You should also check out the following tutorial
http://dtracehol.com/#Exercise_11 that was part of a tutorial session on dtrace at Java One last year (yep, Python at Java One).
François
@f_dion