Recently I"ve been messing around with python bytecode, as a result I now know approximately how the python virtual virtual machine works at a bytecode level. I've found this quite interesting but other then satisfying my own curiosity had no benefit from this knowledge. Until today I was wondering which code would be more efficient:
a += 1 if b is not None: a += 2
if b is None: a += 1 else: a += 3
From a style point of view I'd prefer to write the first: I find it slightly more readable since it's got fewer indented blocks and it uses one line less on my screen. But my gut feeling tells me the second is faster, particularly if b is not None (this because in the first sample we'd do 2 add operations instead of one if b is not None). But now I can verify my gut feeling! All I need to do is compile both fragments and use the disassembler to investigate:
>>> co1 = compile(sample1, '<stdin>', 'exec') >>> dis.dis(co1) 2 0 LOAD_NAME 0 (a) 3 LOAD_CONST 0 (1) 6 INPLACE_ADD 7 STORE_NAME 0 (a) 3 10 LOAD_NAME 1 (b) 13 LOAD_CONST 2 (None) 16 COMPARE_OP 9 (is not) 19 POP_JUMP_IF_FALSE 35 4 22 LOAD_NAME 0 (a) 25 LOAD_CONST 1 (2) 28 INPLACE_ADD 29 STORE_NAME 0 (a) 32 JUMP_FORWARD 0 (to 35) >> 35 LOAD_CONST 2 (None) 38 RETURN_VALUE >>> co2 = compile(sample2, '<stdin>', 'exec') >>> dis.dis(co2) 2 0 LOAD_NAME 0 (b) 3 LOAD_CONST 2 (None) 6 COMPARE_OP 8 (is) 9 POP_JUMP_IF_FALSE 25 3 12 LOAD_NAME 2 (a) 15 LOAD_CONST 0 (1) 18 INPLACE_ADD 19 STORE_NAME 2 (a) 22 JUMP_FORWARD 10 (to 35) 5 >> 25 LOAD_NAME 2 (a) 28 LOAD_CONST 1 (3) 31 INPLACE_ADD 32 STORE_NAME 2 (a) >> 35 LOAD_CONST 2 (None) 38 RETURN_VALUE
My analysis of this is pretty simple: count the number of instructions for when b is None and for when b is not None.
|b is None||b is not None|
So ultimately the best performance depends on whether b will be None or not. However the difference in the best case is only one instruction, but the difference in the worst case is a whole mighty 4 instructions! This would seem to confirm my gut feeling: the ugly code is better. It also makes me wonder if this is not the sort of optimisation a compiler should be doing: create the bytecode for sample2 regardless of the source code (I'm not a compiler guy and do realise it might not be as simple as that since python is a dynamic language which allows you to change stuff, including executed code, at runtime, yada yada).
There's one more catch tough: I seriously doubt each python instruction can be executed in the same time! So let's use timeit to actually verify this. I'm omitting the trivial code, but this is the result:
$ python3 test.py sample1, b is None: 0.128307104111 sample2, b is None: 0.128338098526 sample1, b is not None: 0.244062900543 sample2, b is not None: 0.12109208107
To be honest, the result is exactly as speculated: the first sample is slower when b is not None, all others are pretty much the same. The one odd thing is the pretty much doubling of time for the bad case, this suggest the python virtual machine is spending most of the time doing the INPLACE_ADD instruction while all others are probably very quick.
Anyway, in conclusion I guess you can speculate about performance and get an idea by knowing what bytecode will be generated. But at the end of the day you'll get a simple and better answer by simply using timeit. So knowing something about python bytecode still hasn't gained me any benefit. It was still an interesting exercise tough.