Slowing down animation

I am using a 3-clip animation,which is using a frame counter to set a clip to render,like this:

Code:

m_frame++;
//Loop the animation
if( m_frame >= 3 )
{
m_frame = 0;
}
Draw( offSet, 100, indicator, screen, &clips[ m_frame ] );

my program is running at 25 fps…so is this animation,which is too fast.
How can I set it up in order to animate at,for example,1 clip every half a second?

dekyco

I am using a 3-clip animation,which is using a frame counter to set a
clip to render,like this:

Code:

m_frame++;
//Loop the animation
if( m_frame >= 3 )
{
m_frame = 0;
}
Draw( offSet, 100, indicator, screen, &clips[ m_frame ] );

my program is running at 25 fps…so is this animation,which is too fast.
How can I set it up in order to animate at,for example,1 clip every half a
second?

For one frame every 500 milliseconds:

m_frame = (SDL_GetTicks() / 500) % 3;On 11 August 2010 15:24, dekyco wrote:

without getting too complicated, you could add in another variable that
controls when you increment animation:

//frameIndex is will ensure you only change the animation every 1/2 second
(about 13 frames)
frameIndex++;
if (frameIndex > 12)
{
frameIndex = 0;

 //m_frame now just tells you which animation state is being rendered
 m_frame++;
 if (m_frame > 2) m_frame = 0;

}
Draw(…, &clips[m_frame]);

Make sense?

JohnOn Wed, Aug 11, 2010 at 2:24 PM, dekyco wrote:

I am using a 3-clip animation,which is using a frame counter to set a
clip to render,like this:

Code:

m_frame++;
//Loop the animation
if( m_frame >= 3 )
{
m_frame = 0;
}
Draw( offSet, 100, indicator, screen, &clips[ m_frame ] );

my program is running at 25 fps…so is this animation,which is too fast.
How can I set it up in order to animate at,for example,1 clip every half a
second?

dekyco


SDL mailing list
SDL at lists.libsdl.org
http://lists.libsdl.org/listinfo.cgi/sdl-libsdl.org

dekyco wrote:

I am using a 3-clip animation,which is using a frame counter to set a
clip to render,like this:

Code:

m_frame++;
//Loop the animation
if( m_frame >= 3 )
{
m_frame = 0;
}
Draw( offSet, 100, indicator, screen, &clips[ m_frame ] );

my program is running at 25 fps…so is this animation,which is too fast.
How can I set it up in order to animate at,for example,1 clip every half
a second?

dekyco

…while (!done)
…{
…// Poll for keypresses for 30 milliseconds
…// and then update the screen, this should
…// give us a frame rate of 30 fps.
…//
…// For different frame rates, adjust the ‘30’
…// to be the number of milliseconds you desire.
…//
…ticks = SDL_GetTicks();
…ticks += 30;
…while (SDL_GetTicks() < ticks)
…{
…// Process keyPresses etc
…// if ESC pressed, done = true
…}
…Draw(clips[m_frame]);
…m_frame = ++m_frame > 2 ? 0 : m_frame;
…}

without getting too complicated, you could add in another variable that
controls when you increment animation:

//frameIndex is will ensure you only change the animation every 1/2 second
(about 13 frames)
frameIndex++;
if (frameIndex > 12)
{
frameIndex = 0;

 //m_frame now just tells you which animation state is being rendered
 m_frame++;
 if (m_frame > 2) m_frame = 0;

}
Draw(…, &clips[m_frame]);

Make sense?

John

What you have here still depends on a specific framerate. If you run this
code on two different computers, you’ll have the animation run at two
different speeds.
You really do need to use a timer or clock ticks.

This will work just fine on any computer that isn’t ridiculously slow:
m_frame = (SDL_GetTicks() / 500) % 3;
Draw( offSet, 100, indicator, screen, &clips[ m_frame ] );

or, if you want to avoid drawing it more often than necessary, do this:
m_frame = (SDL_GetTicks() / 500) % 3;
if (m_frame != old_m_frame) {
Draw( offSet, 100, indicator, screen, &clips[ m_frame ] );
}
old_m_frame = m_frame;On 11 August 2010 15:52, John Magnotti <john.magnotti at auburn.edu> wrote:

try this instead:

ticks = SDL_GetTicks(); /* find our starting point /
while (running) { /
avoid a ! by using “running” instead of “done” */

ticks += 30; /* find when the next frame needs to be drawn */

/* handle events until it’s time to draw the next frame */
while ((now = SDL_GetTicks()) < ticks &&
SDL_WaitEventTimeout(&event, ticks - now)) {

/* handle various close signals... */
switch (event.type) {
case SDL_WINDOWEVENT:
  if (event.window.event == SDL_WINDOWEVENT_CLOSE) {
    running = false;
  }
  break;
case SDL_QUIT:
  running = false;
  break;
case SDL_KEYDOWN:
  if (event.key.keysym.sym == SDLK_ESCAPE) {
    running = false;
  }
  break;
}

}

/* draw the frame /
Draw(clips[m_frame]);
m_frame = (m_frame + 1) % 3; /
?: is expensive, so are ifs. Use %
instead. */
}On 11 August 2010 23:04, CWC wrote:

…while (!done)
…{
…// Poll for keypresses for 30 milliseconds
…// and then update the screen, this should
…// give us a frame rate of 30 fps.
…//
…// For different frame rates, adjust the ‘30’
…// to be the number of milliseconds you desire.
…//
…ticks = SDL_GetTicks();
…ticks += 30;
…while (SDL_GetTicks() < ticks)
…{
…// Process keyPresses etc
…// if ESC pressed, done = true
…}
…Draw(clips[m_frame]);
…m_frame = ++m_frame > 2 ? 0 : m_frame;
…}

Kenneth Bull wrote:> On 11 August 2010 23:04, CWC <@charlesw> wrote:

…while (!done)
…{
…// Poll for keypresses for 30 milliseconds
…// and then update the screen, this should
…// give us a frame rate of 30 fps.
…//
…// For different frame rates, adjust the ‘30’
…// to be the number of milliseconds you desire.
…//
…ticks = SDL_GetTicks();
…ticks += 30;
…while (SDL_GetTicks() < ticks)
…{
…// Process keyPresses etc
…// if ESC pressed, done = true
…}
…Draw(clips[m_frame]);
…m_frame = ++m_frame > 2 ? 0 : m_frame;
…}

try this instead:

ticks = SDL_GetTicks(); /* find our starting point /
while (running) { /
avoid a ! by using “running” instead of “done” */

ticks += 30; /* find when the next frame needs to be drawn */

/* handle events until it’s time to draw the next frame */
while ((now = SDL_GetTicks()) < ticks &&
SDL_WaitEventTimeout(&event, ticks - now)) {

/* handle various close signals... */
switch (event.type) {
case SDL_WINDOWEVENT:
  if (event.window.event == SDL_WINDOWEVENT_CLOSE) {
    running = false;
  }
  break;
case SDL_QUIT:
  running = false;
  break;
case SDL_KEYDOWN:
  if (event.key.keysym.sym == SDLK_ESCAPE) {
    running = false;
  }
  break;
}

}

/* draw the frame /
Draw(clips[m_frame]);
m_frame = (m_frame + 1) % 3; /
?: is expensive, so are ifs. Use %
instead. */
}

I don’t know what kind of microprocessor you’re using, but I’m fairly
confident you cant write a modulus function in machine language that’s
going to outclock a simple if, then, else structure. And since all code
boils down to machine language, I sincerely question the validity of
this statement:
/* ?: is expensive, so are ifs. Use % instead. */

Yes,this one line does the trick:

Code:

m_frame = (SDL_GetTicks() / 500) % 3;

Thank you very much Kenneth!

dekyco

?: and if are branching statements. % is not.

They may run in fewer cycles (though not likely), but they may also
cause the processor to dump any cached instructions depending on the
result of the associated condition. Also, % always takes the same
amount of time to run, a branching statement takes more or less time
depending on the condition (though this doesn’t matter much since
timing is handled elsewhere).

Anyway, at least on IA32, 80x86, etc. % is a single instruction.
It’s handled by the DIV opcode.

If you want to argue about something, a better target might be running
vs. !done, since any decent compiler will produce the same number of
opcodes, and use the same number of cycles to run them. I use running
just in case you’re dealing with a monumentally stupid compiler in
debug mode.On 12 August 2010 02:05, CWC wrote:

?/* draw the frame /
?Draw(clips[m_frame]);
?m_frame = (m_frame + 1) % 3; ?/
?: is expensive, so are ifs. ?Use %
instead. */
}

I don’t know what kind of microprocessor you’re using, but I’m fairly
confident you cant write a modulus function in machine language that’s going
to outclock a simple if, then, else structure. And since all code boils down
to machine language, I sincerely question the validity of this statement:
/* ?: is expensive, so are ifs. ?Use % instead. */

Kenneth Bull wrote:

/* draw the frame /
Draw(clips[m_frame]);
m_frame = (m_frame + 1) % 3; /
?: is expensive, so are ifs. Use %
instead. /
}
I don’t know what kind of microprocessor you’re using, but I’m fairly
confident you cant write a modulus function in machine language that’s going
to outclock a simple if, then, else structure. And since all code boils down
to machine language, I sincerely question the validity of this statement:
/
?: is expensive, so are ifs. Use % instead. */

?: and if are branching statements. % is not.

They may run in fewer cycles (though not likely), but they may also
cause the processor to dump any cached instructions depending on the
result of the associated condition. Also, % always takes the same
amount of time to run, a branching statement takes more or less time
depending on the condition (though this doesn’t matter much since
timing is handled elsewhere).

Anyway, at least on IA32, 80x86, etc. % is a single instruction.
It’s handled by the DIV opcode.

I’m confident that the whole thing can be written in high level C code
using if statements and the entire logic sequence will be complete
before the modulus operator even returns.

If you want to argue about something, a better target might be running
vs. !done, since any decent compiler will produce the same number of
opcodes, and use the same number of cycles to run them. I use running
just in case you’re dealing with a monumentally stupid compiler in
debug mode.

In this case, you’re also barking up a non existent tree since both
logical constructs will produce nearly identical code syntax wise, and
totally identical code clock cycle wise.

I will give you style points for moving the initial assignment of
’ticks’ outside the loop. This will generate efficiency and lock the
frame rate to exactly 30 ms (though it may suffer from the fact that
absolute granularity to the ms is not guaranteed, so it’s possible that
the loop may fall behind such that the process becomes event starved and
never catches up).

It’s also not a bad thing to remember, this is not about efficiency,
it’s about slowing the framerate.
It will be good for you to realize that the modulus operator is very
expensive, if’s and ?: scream in comparison.> On 12 August 2010 02:05, CWC <@charlesw> wrote:

FYI - SDL_gfx has a “framerate manager” for such tasks:
http://www.ferzkopp.net/Software/SDL_gfx-2.0/Docs/html/_s_d_l__framerate_8c.html

–AndreasOn 8/11/10 10:03 PM, Kenneth Bull wrote:

On 11 August 2010 23:04, CWC wrote:

…while (!done)
…{
…// Poll for keypresses for 30 milliseconds
…// and then update the screen, this should
…// give us a frame rate of 30 fps.
…//
…// For different frame rates, adjust the ‘30’
…// to be the number of milliseconds you desire.
…//
…ticks = SDL_GetTicks();
…ticks += 30;
…while (SDL_GetTicks()< ticks)
…{
…// Process keyPresses etc
…// if ESC pressed, done = true
…}
…Draw(clips[m_frame]);
…m_frame = ++m_frame> 2 ? 0 : m_frame;
…}
try this instead:

ticks = SDL_GetTicks(); /* find our starting point /
while (running) { /
avoid a ! by using “running” instead of “done” */

ticks += 30; /* find when the next frame needs to be drawn */

/* handle events until it’s time to draw the next frame */
while ((now = SDL_GetTicks())< ticks&&
SDL_WaitEventTimeout(&event, ticks - now)) {

 /* handle various close signals... */
 switch (event.type) {
 case SDL_WINDOWEVENT:
   if (event.window.event == SDL_WINDOWEVENT_CLOSE) {
     running = false;
   }
   break;
 case SDL_QUIT:
   running = false;
   break;
 case SDL_KEYDOWN:
   if (event.key.keysym.sym == SDLK_ESCAPE) {
     running = false;
   }
   break;
 }

}

/* draw the frame /
Draw(clips[m_frame]);
m_frame = (m_frame + 1) % 3; /
?: is expensive, so are ifs. Use %
instead. */
}


SDL mailing list
SDL at lists.libsdl.org
http://lists.libsdl.org/listinfo.cgi/sdl-libsdl.org

I’m confident that the whole thing can be written in high level C code using
if statements and the entire logic sequence will be complete before the
modulus operator even returns.

In this case, you’re also barking up a non existent tree since both logical
constructs will produce nearly identical code syntax wise, and totally
identical code clock cycle wise.

That is what I just said, yes.

I will give you style points for moving the initial assignment of 'ticks’
outside the loop. This will generate efficiency and lock the frame rate to
exactly 30 ms (though it may suffer from the fact that absolute granularity
to the ms is not guaranteed, so it’s possible that the loop may fall behind
such that the process becomes event starved and ?never catches up).

This depends on how much you’re trying to draw per frame. If you’re
not drawing much then it won’t matter since the CPU will easily keep
pace, but when making large changes under heavy load it can bog down.
When CPU load drops off, it will catch up, but this isn’t pretty.
You can add an additional check to make sure it only increments ticks
if now-ticks < 30 which will eliminate this issue.

It’s also not a bad thing to remember, this is not about efficiency, it’s
about slowing the framerate.
?It will be good for you to realize that the modulus operator is very
expensive, if’s and ?: scream in comparison.

Let’s compare the code you posted to what I posted myself:

m_frame = ++m_frame > 2 ? 0 : m_frame;

vs.

m_frame = (m_frame + 1) % 3;

first, let’s expand these to one instruction per line in C:

++m_frame;
if
(m_frame > 2) {
m_frame = 0;
}
else {
m_frame = m_frame; /* hopefully optimized out… */
}

vs.

register unsigned tmp;
tmp = m_frame + 1;
tmp %= 3;
m_frame = tmp;

/* this could be optimized to:
++m_frame;
m_frame %= 3;
*/

now let’s get the instruction count for each:

5 instructions, 4 processed per branch (or 4 instructions one third of
the time and 3 instructions two thirds of the time if optimized)

vs.

3 instructions (or 2 instructions if optimized)

but this is in C, we can’t get cycles and latency unless we use
assembly language, so in IA32 assembly…

INC [m_frame]
CMP [m_frame], 2
JLE else_label
MOV [m_frame], 0
JMP end_label
else_label:
MOV [m_frame], [m_frame] ; this is basically a NOP, though it sets some flags
end_label:
;…

; optimized, this would be…
; MOV EAX, [m_frame] ; we’ll ignore this instruction since m_frame is
likely already in a register
; CMP EAX, 2
; JLE else_label
; XOR EAX, EAX
; MOV [m_frame], EAX ; ditto
; else_label:
; ;…

vs.

MOV EAX, [m_frame]
INC EAX
XOR EDX,EDX
MOV EBX,3
DIV EBX
MOV [m_frame], EDX

; this really doesn’t optimize much, unless EDX/EBX have previously
been set to appropriate values, but we can trim off one instruction:
; three:
; dd 3
; ; …
; MOV EAX,[m_frame]
; INC EAX
; XOR EDX,EDX
; DIV [three]
; MOV [m_frame], EDX
;
; or if m_frame is already in EAX:
; three:
; dd 3
; ; …
; INC EAX
; XOR EDX,EDX
; DIV [three]
; MOV EAX,EDX

now the instruction count:

6 instructions, with 4 executed two thirds of the time and 5 executed
one third of the time (optimized, this is 5 instructions, with 3
executed two thirds of the time and 5 executed one third of the time,
or 3 instructions, with 2 executed two thirds of the time and 3
executed one third of the time if m_frame is in EAX)

vs.

6 instructions (optimized this is 5 instructions, or 4 instructions if
m_frame is in EAX)

At this point, you’re winning, but we haven’t figured out cycles and
latency yet…

A conditional jump takes longer when it actually jumps than when
execution falls through to the next instruction. There’s two reasons
for this:

  1. the CPU has to do more work to do a jump than a nop
  2. jumping will clear any cached instructions, so the CPU will have
    to wait for the cache to refill.

On old CPUs (8086, 8088, etc), the difference is 1 cycle vs. 4 cycles.
On newer CPUs, the difference is much larger and depends on the size
of the cache, though much of this won’t show up right away, but will
instead result in the instructions following the jump executing slower
than they otherwise would.

Unfortunately, for modern processors, it’s no longer possible to
simply look up the cycle count for each instruction and add them up
since the time each instruction takes depends not only on the
instruction itself, but also the instructions before it. For example,
XOR EAX,EAX INC EAX will execute slower than XOR EDX,EDX INC EAX,
since in the second case the two instructions are independent and can
be executed asynchronously.

So undetermined? Maybe, but the result really depends on what code is
executed next. If your next instruction clears the cache anyway, then
your conditional jump doesn’t matter quite so much, otherwise avoid it
and use DIV (or AND) instead.On 12 August 2010 04:19, CWC wrote:

Actually, just do this:
ticks += (((SDL_GetTicks() - ticks) / 30) + 1) * 30;

Or this:
ticks += (((SDL_GetTicks() - ticks) & ~31) + 32;

Or this:
ticks = (SDL_GetTicks() & ~ 31) + 32;

Instead of this:
ticks += 30;On 12 August 2010 17:57, Kenneth Bull <@Kenneth_Bull> wrote:

I will give you style points for moving the initial assignment of 'ticks’
outside the loop. This will generate efficiency and lock the frame rate to
exactly 30 ms (though it may suffer from the fact that absolute granularity
to the ms is not guaranteed, so it’s possible that the loop may fall behind
such that the process becomes event starved and ?never catches up).

This depends on how much you’re trying to draw per frame. ?If you’re
not drawing much then it won’t matter since the CPU will easily keep
pace, but when making large changes under heavy load it can bog down.
When CPU load drops off, it will catch up, but this isn’t pretty.
You can add an additional check to make sure it only increments ticks
if now-ticks < 30 which will eliminate this issue.

Kenneth Bull wrote:

I’m confident that the whole thing can be written in high level C code using
if statements and the entire logic sequence will be complete before the
modulus operator even returns.

In this case, you’re also barking up a non existent tree since both logical
constructs will produce nearly identical code syntax wise, and totally
identical code clock cycle wise.

That is what I just said, yes.

I will give you style points for moving the initial assignment of 'ticks’
outside the loop. This will generate efficiency and lock the frame rate to
exactly 30 ms (though it may suffer from the fact that absolute granularity
to the ms is not guaranteed, so it’s possible that the loop may fall behind
such that the process becomes event starved and never catches up).

This depends on how much you’re trying to draw per frame. If you’re
not drawing much then it won’t matter since the CPU will easily keep
pace, but when making large changes under heavy load it can bog down.
When CPU load drops off, it will catch up, but this isn’t pretty.
You can add an additional check to make sure it only increments ticks
if now-ticks < 30 which will eliminate this issue.

It’s also not a bad thing to remember, this is not about efficiency, it’s
about slowing the framerate.
It will be good for you to realize that the modulus operator is very
expensive, if’s and ?: scream in comparison.

Let’s compare the code you posted to what I posted myself:

m_frame = ++m_frame > 2 ? 0 : m_frame;

vs.

m_frame = (m_frame + 1) % 3;

first, let’s expand these to one instruction per line in C:

++m_frame;
if
(m_frame > 2) {
m_frame = 0;
}
else {
m_frame = m_frame; /* hopefully optimized out… */
}

vs.

register unsigned tmp;
tmp = m_frame + 1;
tmp %= 3;
m_frame = tmp;

/* this could be optimized to:
++m_frame;
m_frame %= 3;
*/

now let’s get the instruction count for each:

5 instructions, 4 processed per branch (or 4 instructions one third of
the time and 3 instructions two thirds of the time if optimized)

vs.

3 instructions (or 2 instructions if optimized)

but this is in C, we can’t get cycles and latency unless we use
assembly language, so in IA32 assembly…

INC [m_frame]
CMP [m_frame], 2
JLE else_label
MOV [m_frame], 0
JMP end_label
else_label:
MOV [m_frame], [m_frame] ; this is basically a NOP, though it sets some flags
end_label:
;…

The line of code was just to show the concept, as programmers, we don’t
keep concepts, we optimize as we go

The higher clock counts assume we’re working with memory, so let’s go
with those:

INC m_frame (1/3 clock)
CMP m_frame, 2 (1/2 clock)
JLE FINIS (1 clock on no branch 3 on branch)
mov m_frame, 0 (1 clock)
FINIS:

Worst case performance, 9 clock cycles. all scenarios 2 @ 9 + 1 @ 7 25
clock cycles consumed to traverse all three states.

vs.

MOV EAX, [m_frame]
INC EAX
XOR EDX,EDX
MOV EBX,3
DIV EBX
MOV [m_frame], EDX

; this really doesn’t optimize much, unless EDX/EBX have previously
been set to appropriate values, but we can trim off one instruction:
; three:
; dd 3
; ; …
; MOV EAX,[m_frame]
; INC EAX
; XOR EDX,EDX
; DIV [three]
; MOV [m_frame], EDX
;
; or if m_frame is already in EAX:
; three:
; dd 3
; ; …
; INC EAX
; XOR EDX,EDX
; DIV [three]
; MOV EAX,EDX

We’ll take your heavily optimized version
INC EAX (1 clock)
XOR EDX, EDX (1 clock)
DIV [three] (40 clocks) (why using ‘three’ as an offset into the data
segment?)
MOV EAX, EDX (2 clock)

constant 44 clocks or 132 clocks to traverse all three states.

Any way you slice it (and I don’t know why you avoid immediate values)
The DIV instruction alone consumes 40 clock cycles, so in the worst
case, the previous code will traverse all three states times before DIV
completes.

now the instruction count:

6 instructions, with 4 executed two thirds of the time and 5 executed
one third of the time (optimized, this is 5 instructions, with 3
executed two thirds of the time and 5 executed one third of the time,
or 3 instructions, with 2 executed two thirds of the time and 3
executed one third of the time if m_frame is in EAX)

vs.

6 instructions (optimized this is 5 instructions, or 4 instructions if
m_frame is in EAX)

At this point, you’re winning, but we haven’t figured out cycles and
latency yet…

A conditional jump takes longer when it actually jumps than when
execution falls through to the next instruction. There’s two reasons
for this:

  1. the CPU has to do more work to do a jump than a nop
  2. jumping will clear any cached instructions, so the CPU will have
    to wait for the cache to refill.

On old CPUs (8086, 8088, etc), the difference is 1 cycle vs. 4 cycles.
On newer CPUs, the difference is much larger and depends on the size
of the cache, though much of this won’t show up right away, but will
instead result in the instructions following the jump executing slower
than they otherwise would.

Unfortunately, for modern processors, it’s no longer possible to
simply look up the cycle count for each instruction and add them up
since the time each instruction takes depends not only on the
instruction itself, but also the instructions before it. For example,
XOR EAX,EAX INC EAX will execute slower than XOR EDX,EDX INC EAX,
since in the second case the two instructions are independent and can
be executed asynchronously.

So undetermined? Maybe, but the result really depends on what code is
executed next. If your next instruction clears the cache anyway, then
your conditional jump doesn’t matter quite so much, otherwise avoid it
and use DIV (or AND) instead.

You can do that if you really think slow code without branches
outperforms fast code with branches, but in this case we really don’t
need to guess, you can actually execute each algorithm a half million
times and check the stop watch.> On 12 August 2010 04:19, CWC <@charlesw> wrote:

The line of code was just to show the concept, as programmers, we don’t keep
concepts, we optimize as we go

The higher clock counts assume we’re working with memory, so let’s go with
those:

What I’ve been talking about is instruction cache, not memory cache.

INC m_frame ? ? (1/3 clock)
CMP m_frame, 2 ?(1/2 clock)
JLE FINIS ? ? ? (1 clock on no branch 3 on branch)
mov m_frame, 0 ?(1 clock)
FINIS:

Worst case performance, 9 clock cycles. all scenarios 2 @ 9 + 1 @ 7 25 clock
cycles consumed to traverse all three states.

Each instruction is at least one cycle unless run asynchronously,
which cannot be done here.
Again, for modern processors, we can’t just count clock cycles. They
don’t quite work that way anymore.

We’ll take your heavily optimized version
INC EAX ? ? ?(1 clock)
XOR EDX, EDX (1 clock)
DIV [three] ?(40 clocks) (why using ‘three’ as an offset into the data
? ? ? ? ? ? ? ? ? ? ? ?segment?)
MOV EAX, EDX (2 clock)

constant 44 clocks or 132 clocks to traverse all three states.

Any way you slice it (and I don’t know why you avoid immediate values) The
DIV instruction alone consumes 40 clock cycles, so in the worst case, the
previous code will traverse all three states times before DIV completes.

There is no immediate version of DIV. Anyway, using [three] isn’t
really very efficient. The register version probably would be faster
despite the additional instruction.
Also, last I checked, for processors old enough that you actually can
count clock cycles, DIV was 4 cycles, not 40. Might you be thinking
of FDIV or DIVPD, etc?

So undetermined? ?Maybe, but the result really depends on what code is
executed next. ?If your next instruction clears the cache anyway, then
your conditional jump doesn’t matter quite so much, otherwise avoid it
and use DIV (or AND) instead.

You can do that if you really think slow code without branches outperforms
fast code with branches, but in this case we really don’t need to guess, you
can actually execute each algorithm a half million times and check the stop
watch.

True. Gotta be careful though since with the obvious way of testing
that your loop would clear the instruction cache anyway. You’d need a
fairly sizable chunk of code following the line you’re testing.On 12 August 2010 20:40, CWC wrote:

Kenneth Bull wrote:

The line of code was just to show the concept, as programmers, we don’t keep
concepts, we optimize as we go

The higher clock counts assume we’re working with memory, so let’s go with
those:

What I’ve been talking about is instruction cache, not memory cache.

INC m_frame (1/3 clock)
CMP m_frame, 2 (1/2 clock)
JLE FINIS (1 clock on no branch 3 on branch)
mov m_frame, 0 (1 clock)
FINIS:

Worst case performance, 9 clock cycles. all scenarios 2 @ 9 + 1 @ 7 25 clock
cycles consumed to traverse all three states.

Each instruction is at least one cycle unless run asynchronously,
which cannot be done here.
Again, for modern processors, we can’t just count clock cycles. They
don’t quite work that way anymore.

We’ll take your heavily optimized version
INC EAX (1 clock)
XOR EDX, EDX (1 clock)
DIV [three] (40 clocks) (why using ‘three’ as an offset into the data
segment?)
MOV EAX, EDX (2 clock)

constant 44 clocks or 132 clocks to traverse all three states.

Any way you slice it (and I don’t know why you avoid immediate values) The
DIV instruction alone consumes 40 clock cycles, so in the worst case, the
previous code will traverse all three states times before DIV completes.

There is no immediate version of DIV. Anyway, using [three] isn’t
really very efficient. The register version probably would be faster
despite the additional instruction.
Also, last I checked, for processors old enough that you actually can
count clock cycles, DIV was 4 cycles, not 40. Might you be thinking
of FDIV or DIVPD, etc?

The first computers ever built were tested against a Chinese man with an
abacus. The computer could add, subtract and multiply faster than the
Chinese man, but try as they might, the man with the abacus could
consistently divide faster than the computer.

70 years later, computers can still add subtract and multiply many times
over while still waiting for a division to complete. Division in 4 clock
cycles? Intel wishes.

So undetermined? Maybe, but the result really depends on what code is
executed next. If your next instruction clears the cache anyway, then
your conditional jump doesn’t matter quite so much, otherwise avoid it
and use DIV (or AND) instead.
You can do that if you really think slow code without branches outperforms
fast code with branches, but in this case we really don’t need to guess, you
can actually execute each algorithm a half million times and check the stop
watch.

True. Gotta be careful though since with the obvious way of testing
that your loop would clear the instruction cache anyway. You’d need a
fairly sizable chunk of code following the line you’re testing.
I’m not worried about the cache, my concern is that when you give a
fellow programmer advice like /* ?: is expensive, so are ifs. Use %
instead. */, you owe it to them (if not yourself) not to have your facts
backwards.> On 12 August 2010 20:40, CWC <@charlesw> wrote:

Actually, after some checking, you’re right.
On Itanium and up, Intel apparently recommends either using a bit
shifting algorithm for 8-bit or converting to floating point,
dividing, then converting back for larger values. Odds are pretty
good that the compiler won’t actually generate a DIV instruction at
all. The algorithms used for division on both AMD and Intel
(Goldschmidt and Newton-Raphson respectively) are both rather time
consuming and not really well suited to integer operations.
It’s likely that the code used for % or / will itself contain
conditional expressions or loops (like these ones:
http://www.bearcave.com/software/divide.htm
ftp://download.intel.com/software/opensource/divsqrt.pdf ).

So, my remaining argument is “readability”, which more or less boils
down to “mine’s prettier”. At the very least you should add
parentheses around the condition, and you really should split it into
two statements to avoid m_frame = m_frame;. ie:

m_frame = (++m_frame > 2) ? 0 : m_frame;

or, better:

++m_frame;
if (m_frame > 2) m_frame = 0;On 12 August 2010 22:06, CWC wrote:

The first computers ever built were tested against a Chinese man with an
abacus. The computer could add, subtract and multiply faster than the
Chinese man, but try as they might, the man with the abacus could
consistently divide faster than the computer.

70 years later, computers can still add subtract and multiply many times
over while still waiting for a division to complete. Division in 4 clock
cycles? Intel wishes.

Kenneth Bull wrote:> On 12 August 2010 22:06, CWC <@charlesw> wrote:

The first computers ever built were tested against a Chinese man with an
abacus. The computer could add, subtract and multiply faster than the
Chinese man, but try as they might, the man with the abacus could
consistently divide faster than the computer.

70 years later, computers can still add subtract and multiply many times
over while still waiting for a division to complete. Division in 4 clock
cycles? Intel wishes.

Actually, after some checking, you’re right.
On Itanium and up, Intel apparently recommends either using a bit
shifting algorithm for 8-bit or converting to floating point,
dividing, then converting back for larger values. Odds are pretty
good that the compiler won’t actually generate a DIV instruction at
all. The algorithms used for division on both AMD and Intel
(Goldschmidt and Newton-Raphson respectively) are both rather time
consuming and not really well suited to integer operations.
It’s likely that the code used for % or / will itself contain
conditional expressions or loops (like these ones:
http://www.bearcave.com/software/divide.htm
ftp://download.intel.com/software/opensource/divsqrt.pdf ).

So, my remaining argument is “readability”, which more or less boils
down to “mine’s prettier”. At the very least you should add
parentheses around the condition, and you really should split it into
two statements to avoid m_frame = m_frame;. ie:

m_frame = (++m_frame > 2) ? 0 : m_frame;

or, better:

++m_frame;
if (m_frame > 2) m_frame = 0;

You’re right that the tertiary instruction is tougher to read and
suffers from an unnecessary self assignment. If it were my project, I’d
go with:

// For readability:
//
Draw(Clips[m_frame++]);
if (m_frame > 2)
m_frame = 0;

// For best performance:
//
Draw(Clips[++m_frame]);
if (m_frame > 1)
m_frame = -1;

m_frame = 0; optimizes better than m_frame = -1; (XOR instead of MOV),
and you’re better off using unsigned where possible.

You would get at least slightly better performance like this:

Draw(Clips[m_frame]);
++m_frame;
if (m_frame > 2)
m_frame = 0;

Though your readable version would probably still be slower than your
performance version due to the temporary value from the post
increment.On 13 August 2010 02:10, CWC wrote:

// For best performance:
//
Draw(Clips[++m_frame]);
if (m_frame > 1)
? ? ? ?m_frame = -1;

Kenneth Bull wrote:

True. Gotta be careful though since with the obvious way of testing
that your loop would clear the instruction cache anyway. You’d need a
fairly sizable chunk of code following the line you’re testing.

There’s been mention of these branches (conditions) causing loss of some
cache, which can be partly true depending how loosely you use the term
cache (and how far you get from typical practice). But it is bordering
on totally wrong to think that a missed branch would cause clearing or
flushing of what is commonly known as the instruction cache (which is
many thousands of instructions in size usually). In simple modern
processors, the penalty for a missed branch is a loss of some of the
pipelined instructions. This is nowhere near the same as purging an
instruction cache.
In the modern x86 case there are penalties as dispatched and
speculatively executed micro-ops get purged. These penalties aren’t
usually more than a few cycles and don’t involve purging any cache.
Finally, these penalties are mitigated by branch prediction. It’s
incorrect to say that a branch penalty is paid when the jump (branch) is
taken (as seems to have been suggested earlier). A branch penalty is
paid when the branch is mispredicted.

A conditional jump takes longer when it actually jumps than when
execution falls through to the next instruction. There’s two reasons
for this:

This just isn’t correct by my reading.

  1. the CPU has to do more work to do a jump than a nop

The OOO CPU may have taken the jump in the normal course of matters
assisted by a branch target buffer. If the branch is not mispredicted,
it makes no difference which way the branch went.

  1. jumping will clear any cached instructions, so the CPU will have
    to wait for the cache to refill.
    I think it is misleading or at least confusing to use the term "caching"
    for pipelining. On modern OOO implementations (certainly x86), the
    various buffers do not get purged wholesale for mispredictions either.

Also, as has been mentioned, while multiplication can be parallelized
quite easily (at increasing chip real estate), division cannot.