Somewhat OT: need help with assembly

Is there anybody with assembly/MMX experience that can help optimize my SDL bump-mapping routine? I have most of it implemented, there is just one part that is too slow. I’m sorry that this is somewhat off-topic, so please reply privately.

Here’s the code if you want to see: (It’s a mess)
http://www.wolsi.com/~dwl/bumptest.cc--
@Daniel_W_Lemon