I can’t find any references to anything stating that a multiply is any
more than 1 instruction on a PII/PIII/P4. Not only that, you haven’t
considered that you’ll stall one of the pipelines if the add relies on the
multiply (which it will in this case).
Well, check out
First go to page 22 (2-15) ad lookup the table 2-3. It clearly states that
multiplication has latency of 4 cycles while adds and shifts have only 1
Secondly, check the page 47 (3-23) for information on which instructions can
be paired (thus executed in paralel).
Those are the only two bits of information that I based my message on. I
don’t refer here to any specific code example. You’re right that in case of
y*w+x, the addition must wait for the mul to finish, just as it needs to
wait in case of y<<p+x. However, in a tight and unrolled loop, the compiler
might optimize it in such a way that the shift will take place at the same
time as the addition from previous iteration.
Of course it does not make sense to design a generic approach to substitute
each mul with a lots of branching, shifts and additions. If you do however
have a previous knowledge about your code, and you know that a given mul
will always be by power of 2, just use it (even though nonimmediate shift is
not pairable just as mul, it has lower latency). Shifts are cheaper by
definition and this way you give your compiler a chance to optimize things
better (even if current compiler does provide any difference, the future one
FWIW, I compiled my emulator that does lots of raster graphics, removed
all throttling, and changed shifts to multiplies. It made zero difference
(and yes, I shut off optimization and did a disassembly of the code to
ensure it wasn’t turning my multiplies into shifts) between multiplies and
shifts. This is on a 1Ghz PIII, and the graphics processing takes 60% of
overall execution time.
I agree with your results, especially as you removed optimizations. That
means in neither case did compiler try to take advantage of OoO execution,
pairing of instructions etc. For such a code there really is no difference
between shift and mul.
It is just a matter of providing the optimization routines with the
knowledge about your code that only you as a programmer possess. Writing:
You say: "this is an operation that can be done by shifting value of x"
If you write:
You say: "this is a multiplication of x by any possible number p"
If however in your code n would always be a power of 2 (it is let’s say the
GL texture size), you essentially withold this information from your
compiler. It may not make a difference on your machine, with your compiler
and on your operating system. If however sometime later somebody decides to
compile your code on some embeded system that must put a systemwide trap on
each multiplication (don’t ask me why, just imagine), this small piece of
information may provide to be crucial for the speed of the code.
There are more factors than just the instructions themselves. Depending
upon the shift, it may require reloading the cl register which can do who
knows what to the optimization capabilities of the compiler. At this point
it’s a wash.
And isn’t reloading a register a simple matter of renaming now ? I thought
it is since P4. So different part of the code occupying the pipeline may
even see different versions of CL and be happy with it. But I agree with
you saying that there is no way of telling what would optimizations do with
your code. That means for me that you should program things in such a way
that the optimizer will have the most available knowledge about your code.
If it thinks that flushing CL is a bad idea, it can always express any shift
with appropriate multiplication (I know, it’s additional xor, and setbit,
but you suggested a disaster because of grabbing CL, so it will be worth
it). It is not possible to efficiently change mul into shift without external
And anyway, we’re way out of topic here
JacekOn Wed, Jul 17, 2002 at 02:12:13PM -0700, Neil Bradley wrote:
|from: J.C.Wojdel |
| J.C.Wojdel at cs.tudelft.nl |