Floating divisions terribly slow on Intel I5 & E5 xeon processors

Question

0.00/5 (No votes)

See more:

I made a performance test in my computer (win10) with a Intel xeon CPU E5-1620 v3 @3.5Ghz obtaining simitar results than raspberry pi performance. My bear grew waiting.

I obtained:

integer sums: 2184 Mops (megaoperations/second) as expected
double divisions: 15.6 - 18.32 Mops
double multiplications: 344 -430 Mops
double sums: 881 - 1178 Mops
float divisions: 17.3 - 19.1 Mops

Updated: I tested on a I5 and divisions where slower than 21 MOPs

The question is: ¿does intel E5 has coprocessor?
Can I use a compiler directive to run it faster?
It would work faster in a I7 processor?

What I have tried:

This is my code. Please run it in any computer as it run very well!:

#include <iostream>
#include <time.h>	//clock(), time_t
#pragma warning(disable:4996) //disable deprecateds
using namespace std;


time_t start,stop;char null_char='\0';
//Use empty timer() to reset start time:
void timer(char *title=&null_char,int data_size=1){    	stop=clock();	if (*title) cout<<title<< " time ="<<(double) (stop-start)/(double) CLOCKS_PER_SEC<< " = " << 1e-6*data_size/( (double)(stop-start)/(double)CLOCKS_PER_SEC ) <<  " Mops/seg"   <<endl; 	start=clock(); }


int main()
{
	cout << "Perform test in Release mode. Results will be wrong in debug mode" <<endl;
	int isum=0,size=100*1024*1024;
	timer();//void timer resets timer!
	for (int i=0;i<size;i++)
		isum+=i;
	timer("Time for 100 Mega int sums       ",size);
	double dsum=1.0;
	for (int i=0;i<size;i++)
		dsum=dsum/1.1111;
	timer("Time for 100 Mega double divisions",size);double d2=1.111;dsum+=0.1;
	for (int i=0;i<size;i++)
		dsum/=d2;
	timer("Time for 100 Mega double divisions-2",size);
	for (int i=0;i<size;i++)
		dsum=dsum*d2;
	timer("Time for 100 Mega double multiplications",size);
	for (int i=0;i<size;i++)
		dsum=dsum+d2;
	timer("Time for 100 Mega sums   multiplications",size);

	float fsum=1.0f;
	for (int i=0;i<size;i++)
		fsum=fsum/1.1111f;
	timer("Time for 100 Mega float  divisions",size);

	cout<<endl<<" Reject following line data (done to force for loops be performed after compiler optimizations):"<<endl;;
	cout<<isum<<dsum<<fsum<<endl;//to force for() be done on isum
	cout<<"=== FIN ==="<<endl;getchar();
	return 1;
}

Posted 25-Sep-17 1:26am

Javier Luis Lopez

Updated 25-Sep-17 20:23pm

v2

Add a Solution

Comments

Richard MacCutchan 25-Sep-17 7:45am

Have you looked at the Intel documentation for this processor?

Kornfeld Eliyahu Peter 25-Sep-17 9:06am

This processor has no F16C extension, and that may explain the slow floating-point calculations...
A 4 gen i7 has it but also the E5-1650 has it, so probably both will outperform the E5-1620...
If you are looking for a CPU to use in computation intensive setup you must do a serious research (and not only on Intel's table)...

Javier Luis Lopez 25-Sep-17 9:06am

I am not an Intel architecture espertise, only made the tests. The E5 V3 has
3 integer ALUs and 2 vector ALUs (to use in AVX). Also it has a sandy bridge microarchitecture with schedulers that can be used to parallelize vector operations as can be seen here:
https://www.realworldtech.com/includes/images/articles/sandy-bridge-5.png
It can be seen the full diagram here: https://www.realworldtech.com/includes/images/articles/sandy-bridge-7.png?x51911). So I think (can be wrong) Intel tried to parallelize more vector operations at the cost of make division operations serially as in the first historic processors.

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Jochen Arndt · Accepted Answer · 2017-09-25T23:19:00

Solution 1

Quote:
The question is: ¿does intel E5 has coprocessor?

Yes. All x86 based CPUs have a build-in x87 FPU and vector units (SSE, AVX).

Quote:
Can I use a compiler directive to run it faster?

Yes, but it depends on the compiler and if you can accept reduced error handling and not being strict IEEE compliant. Most compilers have some kind of fast-math options for this purpose. Depending on the used CPU, you can also enable the usage of scalar instructions (SSE) instead of the FPU.

Quote:
It would work faster in a I7 processor?

It depends on the clock rate of the x87 FPU / x86 CPU (for SSE). Each instruction requires a defined number of clock cycles.

Floating point divisions require far more clock cycles than additions or multiplications (8 - 20 times compared with multiplications). This applies to all kind of FPUs, not only to x86 types. They should be avoided when high performance is required (e.g. by multiplying with the reciprocal value within loops).

From the Intel® 64 and IA-32 Architectures Optimization Reference Manual

Quote:
Assembly/Compiler Coding Rule 4. (M impact, M generality) Favor SSE floating-point instructions over x87 floating point instructions.
Assembly/Compiler Coding Rule 5. (MH impact, M generality) Run with exceptions masked and the DAZ and FTZ flags set (whenever possible).
Tuning Suggestion 5. Use the perfmon counters MACHINE_CLEARS.FP_ASSIST to see if floating exceptions are impacting program performance

Posted 25-Sep-17 23:19pm

Jochen Arndt

Comments

Javier Luis Lopez 26-Sep-17 5:40am

But in older processors multiplications, sums and divisions required the same clock cycles as long as "old" floating point ALUs performed all in one cycle.
The impact now is:
Multiplications: 2 times slower than sums
Divisions: 100 times slower!!
It seems that divisions are performed by SW as the first processors (that needed coprocessors)
This makes processors floating point at lower level than raspberry pi 40$ micros
Some engineering programs could get unserviceable.

Then, there are any processors with floating point divisions similar to sums performance?

Jochen Arndt 26-Sep-17 7:04am

"But in older processors multiplications, sums and divisions required the same clock cycles as long as "old" floating point ALUs performed all in one cycle."

Please tell me which CPU.

It was always the same relation between add, mul, and div (with some deviation) for floating points. There where never (and probably will never be) a FPU that requires the same number of cycles for add/mul, and div.

It applies to the Pi too (it has a FPU but that provides only the basic operations and has no exponential and trigonometric functions).
Throughput and latency for double precision with the ARM used by the Pi 1:
FADD, FSUB: 1 + 8
FMUL, FMAC: 2 + 9
FDIV, FSQRT: 29 + 33

Using a software implementation for div would make it even more slow. Divisions are just not as simple as add and mul.

"Then, there are any processors with floating point divisions similar to sums performance?"

I don't know one and I think there is none and never will be.

Javier Luis Lopez 26-Sep-17 10:38am

Unfortunately I lost data from 10 years ago, so I have to withdrawn about old cpus.
The problem is perhaps thai in my code the divide operations can not pipelined, as long as it is needed to calculate one division and then the next.

I have tested also this line and had 520MOPs:
for (int i=0;i<size/20;i++)
dsum=1.1/(dsum+2.2/(dsum+2.3/(dsum+2.4/(dsum+2.5/(dsum+2.6/(dsum+2.7/(dsum+2.8/(dsum+2.9/(dsum+2.1/(dsum+2.2/(dsum+2.3/(dsum+2.4/(dsum+2.5/(dsum+2.6/(dsum+2.7/(dsum+2.8/(dsum+2.9/(dsum+3.1/(dsum)))))))))))))))))));

I read also about xeon architecture and said that the division performance IS VERY POOR, as can be seen in this recomended article: https://gmplib.org/~tege/x86-timing.pdf
And also here: https://stackoverflow.com/questions/4125033/floating-point-division-vs-floating-point-multiplication in the Peter Cordes reply (but explain only integer one).
The best of all was the AMD ZEN architecture with double throughput.

Javier Luis Lopez 26-Sep-17 13:21pm

" Please tell me which CPU "
At least I found results in an old drive backup:
Pentium 2 4 GHz: double prec: (d+d)/d+cte*d 49.6 Mops
Pentium D 3Ghz : 59.8 Mops

I think that the result was very fast due perhaps that operations could be pipelined.

Jochen Arndt 27-Sep-17 2:48am

My question regarding the CPU referred to the same clock cycles for mul and div and the single clock cycle.

All I can suggest (like Intel) is using SSE instead of the FPU.

Thank you for accepting my solution and sorrey for the late reply (my DSL at home is down since Friday).

Javier Luis Lopez 2-Oct-17 5:47am

It should be easier if SSEs whould use the same assigning values, operations and vector types than OpenCL