performance - Fast sine/cosine for ARMv7+NEON: looking for testers… -


Can anyone access the iPhone 3GS or Pandora, please test the regularity of the following assembly that I just wrote?

This neon vector is believed to be calculating the speed and cosine on the FPU really fast. I know that it is compiled properly, but without sufficient hardware I can not test it If you can calculate certain edges and cosine and compare the results with sinf () and cosf () then it will really be helpful.

Thank you!

  #include & lt; Math.h> /// Calculates the sign and cosine of two angles /// in it: angle = two angle, expressed in radians, in the [-pi, pi] category. /// Out: Results = Vector in which [sin (angle [0]), cos (angle [0]), sin (angle [1]), cos (angle [angle]] [static inline zero vincos (const float angles ) [2], the result of the float [4]) {constant constant float constant [] = {/ * q1 * / 0, M_PI_2, 0, M_PI_2, / * q2 * / M_PI, M_PI, M_PI, M_PI, / * q3 * / 4.f / M_PI, 4.f / M_PI, 4.f / M_PI, 4.f / M_PI, / * q4 * / -4.f / (M_PI * M_PI), -4 F / (M_PI * M_PI), -4.F / (M_PI * M_PI), -4.F / (M_PI * M_PI), / * q5 * / 2.f, 2.f, 2.f, 2.f , / * Q6 * / / 225f, .225f, .225f, .225f}; "Vldmia% 1, {d3} \ n \ t" "vdup.f32 d0, d3 [0] \ n \ t" "vdup.f32 with asm volatile (// load q0 [angle1, angle1, angle2, angle2] D1, d3 [1] \ n \ t "// load q1-q6 with constant" vldmia% 2, {q1-q6} \ n \ t "// cos (x) = sin (x + pi / 2), therefore, // q0 = [angle 1, angle 1 + pi / 2, angle 2, angle 2 + pi / 2] "vadd.f32 q0, q0, q1 \ n \ t" // if angle 1 + pi / 2> Pi, decrease 2 * PI // q0 - q = (q0> Pi)? 2 * PI: 0 "vcge.f32 q1, q0, q2 \ n \ t vand.f32 q1, q1, q2 \ N \ t vmls.f32 q0, q1, q5 \ n \ t "// q0 = (4 / pi) * q0 - q0 * abs (q0) * 4 / (pi * pi)" vabs.f32 q1, q0 \ N \ t "" vmul.f32 Q1, q0, q1 \ n \ t "" vmul.f32 q0, q0, q3 \ n \ t "" vmul.f32 q1, q1, q4 \ n \ t \ "vadd F32 q0, q0, q1 \ n \ t "// q0 + = 225 * (q0 * abs (q0) -q0)" vabs.f32 q1, q0 \ n \ t "" vmul.f32 q1, q0, q1 \ N \ t "" Vsub.f32 q1, q0 \ n \ t "" vmla.f32 q0, q1, q6 \ n \ t vstmia% 0, {q0} \ n \ t "::" r "(result)," r "(angle ), "R" (constant): "memory", "cc", "q0", "q1", "q2", "q3", "q4", "q5", "q6"); }  

Just tested on my beagle board .. As in the comment said: one Only CPU

Your code is about 15 times faster than the club .. Well done!

I have measured 1260 for every call of implementation and 82 for the four C-Big calls. Note that I have compiled with the soft float ABI and my OMAP3 is the initial silicone, so in the C-BB version each call has neon stall of at least 40 cycles.

I have zipped together the result ..

Display-counter stuff will not work on most of the iPhone.

Hope is what you want.


Comments

Popular posts from this blog

c# - How to capture HTTP packet with SharpPcap -

jquery - SimpleModal Confirm fails to submit form -

php - Multiple Select with Explode: only returns the word "Array" -