View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0024285 | Open CASCADE | OCCT:Foundation Classes | public | 2013-10-24 09:47 | 2015-10-16 17:01 |
Reporter | Assigned To | bugmaster | |||
Priority | high | Severity | minor | ||
Status | closed | Resolution | fixed | ||
Product Version | 6.6.0 | ||||
Target Version | 6.9.0 | Fixed in Version | 6.9.0 | ||
Summary | 0024285: Updates of PLib::EvalPolynomial for code acceleration | ||||
Description | Some changes on PLib::EvalPolynomial related to loop nesting may perform code execution faster. | ||||
Steps To Reproduce | perf bspline intersect boolean bcut_complex Q9 boolean bsection P4 | ||||
Tags | No tags attached. | ||||
Test case number | Not needed | ||||
|
Branch CR24285 added |
|
No remarks, please test |
|
Dear BugMaster, Branch CR24285 (and products from GIT master) was compiled on Linux and Windows platforms and tested. SHA-1: 1f4674920d52faedb25eb1dce34facd985de5e2e Number of compiler warnings: occt component : Linux: 319 (323 on master) Windows: 0 (0 on master) products component : Linux: 189 (189 on master) Windows: 287 (287 on master) Regressions/Differences: No regressions/differences Testing cases: Not needed Testing on Linux: Total MEMORY difference: 353301752 / 356295124 Total CPU difference: 42535.01000000177 / 40672.21000000072 Testing on Windows: Total MEMORY difference: 407352400 / 406499704 Total CPU difference: 27747.078125 / 34773.5 There are not differences in images found by testdiff. |
|
Waiting for OLA |
|
Branch CR24285_3 has been created by azv. SHA-1: e1eaf5e85f3443e7e9338d7b4f48fe45cf731a2b Detailed log of new commits: Author: azv Date: Tue Jan 20 17:06:03 2015 +0300 0024285: Updates of PLib::EvalPolynomial for code acceleration 1. Loop nesting in functions PLib::EvalPolynomial and PLib::NoDerivativeEvalPolynomial was changed 2. Avoided pointer arithmetic 3. There is an afford for automatic generation of SSE instructions by modern compilers |
|
Results of performance test (perf/bspline/intersect) for different parameters of compilation, please, find below. ===== Machine configuration: Hardware: Intel Core i5-3470 3.2GHz, 8GB RAM OS: Windows 7 Professional x64 SP1 Compiler: Microsoft Visual C++ 17.00.50727.1 x64 (Visual Studio 2012) ===== Abbreviations: Build No1 (BM1): master (version Dec, 26, 2014), disabled optimization (/Od) for project TKMath Build No2 (BM2): master (version Dec, 26, 2014), maximize speed (/O2) for TKMath Build No3 (BM3): master (version Dec, 26, 2014), maximize speed (/O2) and AVX enabled (/arch:AVX) for TKMath Build No4 (BO1): branch CR24285_3, disabled optimization (/Od) for project TKMath Build No5 (BO2): branch CR24285_3, maximize speed (/O2) for TKMath Build No6 (BO3): branch CR24285_3, maximize speed (/O2) and AVX enabled (/arch:AVX) for TKMath ===== Results (sec.): there were made 5 runs of perf/bspline/intersect test case and calculated average time | BM1 | BM2 | BM3 1 | 91.7753883 | 38.9690498 | 36.6290348 2 | 91.0421836 | 37.4558401 | 36.6446349 3 | 91.0421836 | 37.3622395 | 36.5354342 4 | 91.0889839 | 37.7210418 | 36.7226354 5 | 91.1825845 | 37.6898416 | 36.8006359 Avg | 91.22626478 | 37.83960256 | 36.66647504 | BO1 | BO2 | BO3 1 | 78.936506 | 36.8474362 | 34.3982205 2 | 79.1393073 | 36.8318361 | 33.2438131 3 | 79.7789114 | 36.7538356 | 33.1190123 4 | 79.092507 | 36.9566369 | 33.4622145 5 | 79.0613068 | 36.5822345 | 33.2126129 Avg | 79.2017077 | 36.79439586 | 33.48717466 ===== Relative average speedup: BM2/BM1 58.52% BM3/BM2 3.1 % BO2/BO1 53.54% BO3/BO2 8.99% BO1/BM1 13.18% BO2/BM2 2.76% BO3/BM3 8.67% |
|
Dear Andrey, Mikhail, Please review branch CR24285_3. |
|
Have you been able to observe some acceleration on this branch? As for me, first test shown degradation of performance (59 min vs. 56 on master). Will try more... |
|
I suppose, you compiled x86 version, didn't you? I had observed that the performance of x86 compilation (with SSE2 enabled) is too much floating. During testing you may get different results: on my machine I have average speedup 0.42% only (on perf/bspline/intersect test case). So, I propose to use x64 compilation for performance checking because it gives more stable results. Results of full non-regression testing both x86 and x64 builds did not change too much on my machine. |
|
No, I have built x64 version naturally, vc10 compiler. Sorry I have overlooked your report on performance above... I will check results on my system more carefully and report here this evening. |
|
Sorry I was wrong saying about downgrade: that looks like a time variation on one test run, and consequent runs showed almost the same time as master. That is, I see no variation of time in general. On test perf bspline intersect, the times are: master: 28.8 sec CR24285_2: 25.8 sec CR24285_3: 26.5 sec |
|
I would like to know what part of time of that test is occupied by run of PLib::EvalPolynomial. |
|
According to Intel Amplifier the running PLib::EvalPolynomial takes 25% of full time of testing perf/bspline/intersect on the current master. |
|
Well, 25% of 28.8 = 7.2 sec. Gain is 28.8-26.5 = 2.3 sec. So, the method was speed-ed up on 2.3/7.2=32% According to the statements of Roman and Istvan, they speed-ed up it on 1000% and more. So, the target of this improvement was not hit. |
|
Mikhail, your calculations are very strange for me. If you calculate acceleration as ratio to original value (as you did), then you get 32% - the fraction of time spend by accelerated algorithm from original one. That is, you have three times less time spent (68% acceleration). Note that with that approach you cannot get more than 100% of acceleration. If you calculate acceleration as ratio to new value (where you can have 1000% or more), you should calculate it as 7.2/2.3 = 310% |
|
You are right. All in all, 310% is not 1000%. |
2015-02-14 19:28 manager |
eval_poly_template_metaprogram.zip (4,544 bytes) |
|
I suggest we shall try variant of evaluator implemented by Istvan Csanady (see http://dev.opencascade.org/index.php?q=node/1043 , archive with his sources is attached, eval_poly_template_metaprogram.zip). It uses templates metaprogramming and according to Istvan yields higher acceleration than plain C++ code. In order to try this on desktop, the intrinsics used in the code need to be adopted for the platform. |
|
Branch CR24285_azn has been created by azn. SHA-1: 6b46853f59c2907d55bfa825bdd4ca9297a4cf41 Detailed log of new commits: Author: azn Date: Tue Feb 17 14:39:04 2015 +0300 0024285: Updates of PLib::EvalPolynomial for code acceleration SSE2 & temaplte implementation have been integrated to PLib::EvalPolynomial. |
|
Branch CR24285_azn has been updated by azn. SHA-1: ecfb84ee8479ee8bed8b787a78e7eba5c178132f Detailed log of new commits: Author: azn Date: Thu Mar 5 14:23:15 2015 +0300 0024285: Updates of PLib::EvalPolynomial for code acceleration - Full template refactoring has been performed. - AVX instructions has been integrated. |
|
Branch CR24285_azn has been updated by azn. SHA-1: fc22d2906a133ba08d76fa4db51def5458d5d31e Detailed log of new commits: Author: azn Date: Wed Mar 11 09:00:14 2015 +0300 0024285: Updates of PLib::EvalPolynomial for code acceleration - Change order of loops. - Using SSE2 instructions. |
|
I have compared three versions of the improvement (from branches CR24285, CR24285_3, and CR24285_azn) on current master, running 3 times test bspline intersect on each build. Here are CPU times, sec: master: 29.; 29.2; 29.1 CR24285: 25.6, 25.8, 25.7 CR24285_3: 25.8; 25.7; 25.8 CR24285_azn: 27.; 26.75; Then I additionally used OSD_PerfMeter to see how many calls are made to EvalPolynomial vs. NoDerivEvalPolynomial: NoDerivEvalPolynomial : 102788324 4.35 0.04 EvalPolynomial : 129180814 10.36 0.08 Note that (a) this measurement is made on version _azn (based on templates made by Istvan Csanady), and (b) measurement severely affects performance (total time grows up to 42 sec). Nevertheless, it shows that both functions are called with almost equal frequency. Finally I propose version 3 should be taken. |
|
Statistics on calls of EvalPolynomial with different values of DerivativeRequest and Dimension, on test bspline intersect: Perf meter results : enters seconds microsec/enter EP:1,g : 39003481 2.84 0.07 EP:1,3 : 47607306 1.78 0.04 EP:1,1 : 25502944 0.83 0.03 EP:1,9 : 8730145 0.47 0.05 EP:1,6 : 9200905 0.47 0.05 EP:2,g : 5435 0.02 2.87 EP:2,3 : 4868 0.00 0.00 EP:2,1 : 2370 0.00 0.00 EP:2,9 : 973 0.00 0.00 EP:2,6 : 900 0.00 0.00 and for test boolean bsection P4 which is accelerated by 25% by this fix: Perf meter results : enters seconds microsec/enter EP:1,g : 4924084 1.62 0.33 EP:1,3 : 7812651 0.42 0.05 EP:1,6 : 189456 0.00 0.00 EP:2,g : 2149101 1.06 0.49 EP:2,3 : 2168582 0.16 0.07 EP:2,6 : 38663 0.00 0.00 |
|
Branch CR24285_4 has been created by abv. SHA-1: 7c38fa1081ab47fba268f0006e114ebc4bf3bf65 Detailed log of new commits: Author: azv Date: Tue Jan 20 17:06:03 2015 +0300 0024285: Updates of PLib::EvalPolynomial for code acceleration Functions PLib::EvalPolynomial and PLib::NoDerivativeEvalPolynomial are refactored to allow generation of faster code: 1. Iteration by degree is made in outer loop 2. Avoided pointer arithmetic 3. Recursive templates are used to expand loop by dimension in specific cases (1-15) |
|
I have pushed new version of the fix to branch CR24285_4, please review. This version is based on CR24285_3, with fragments specialized for particular dimension of the vector implemented by recursive template. Use of templates allows covering more special cases, and to optimize also function NoDerivativeEvalPolynomial() in the same way, keeping the same overall amount of code. Result of my testing vs. master is (for most notable cases): CPU boolean bcut_complex Q9: 6.6768428 / 10.1244649 [-34.05%] CPU boolean bsection P4: 6.3648408 / 9.9684639 [-36.15%] CPU bugs moddata_2 bug6862_3: 10.3116661 / 12.1212777 [-14.93%] CPU bugs moddata_2 bug6862_4: 10.2648658 / 12.0588773 [-14.88%] CPU bugs moddata_2 bug6862_6: 24.6793582 / 30.0769928 [-17.95%] CPU de step_3 F2: 313.3748088 / 361.5791178 [-13.33%] CPU de step_4 I1: 217.8865967 / 253.1428227 [-13.93%] ... CPU of test perf bspline intersect is 25.1-25.4 sec, i.e. better than other versions. Overall CPU difference by all cases vs. master is reported as [-2%] |
|
Reviewed without remarks. Please test. |
|
When reporting results of testing, please include results of testdiff on the test cases listed in 0024285:0040031 |
|
I have tested the effect of the fix in Debug mode. Surprisingly, performance also has improved: | Test case | master | CR24285_4 | | boolean bcut_complex Q9 | 56.4 sec | 44.5 sec | | boolean bsection P4 | 54.5 sec | 12.9 sec | | bugs moddata_2 bug6862_6 | 236 sec | 38 sec | |
|
Dear BugMaster, Branch CR24285_4 from occt git-repository (and master from products git-repository) was compiled on Linux, MacOS and Windows platforms and tested on Release mode. SHA-1: 7c38fa1081ab47fba268f0006e114ebc4bf3bf65 Number of compiler warnings: occt component : Linux: 18 (18 on master) Windows: 0 (0 on master) products component : Linux: 4 (4 on master) Windows: 0 (0 on master) Regressions/Differences: No regressions/differences Testing cases: Not needed Testing on Linux: occt component : Total MEMORY difference: 95201625 / 94523880 [+0.72%] Total CPU difference: 52416.46999999925 / 52229.55999999946 [+0.36%] CPU boolean bcut_complex Q9: 24.92 / 31.98 [-22.08%] CPU boolean bsection P4: 24.69 / 28.68 [-13.91%] CPU de step_3 F2: 1282.14 / 1454.54 [-11.85%] CPU de step_4 I1: 887.69 / 982.29 [-9.63%] products component : Total MEMORY difference: 23754993 / 23685903 [+0.29%] Total CPU difference: 17585.62 / 17782.939999999973 [-1.11%] Testing on Windows: occt component : Total MEMORY difference: 57155174 / 57165566 [-0.02%] Total CPU difference: 15625.855765098884 / 16283.477980599077 [-4.04%] CPU boolean bcut_complex Q9: 7.9092507 / 12.2616786 [-35.50%] CPU boolean bsection P4: 7.4412477 / 11.8248758 [-37.07%] CPU bugs moddata_2 bug6862_3: 12.2148783 / 14.2584914 [-14.33%] CPU bugs moddata_2 bug6862_4: 12.1836781 / 14.3364919 [-15.02%] CPU bugs moddata_2 bug6862_6: 29.2033872 / 36.0050308 [-18.89%] CPU de step_3 F2: 372.6551888 / 433.4019782 [-14.02%] CPU de step_4 I1: 258.6652581 / 304.6387528 [-15.09%] products component : Total MEMORY difference: 15561014 / 15564896 [-0.02%] Total CPU difference: 6210.836612799964 / 6585.207812599981 [-5.69%] There are no differences in images found by testdiff. |
|
Some remarks: I had biggest performance gains in cases when the polynomails that I was evaluating were low order (degree 4-6). I believe that was due to the compiler was able to optimize and place most of/all the data to the SIMD units/registers, thus eliminating almost every memory access. Also try to compile with optimization set to -fast (fastest, aggressive optimizations) with Clang, in my case that allowed the compiler to optimize a lot more compared to -o3. |
|
Branch CR24285 has been deleted by inv. SHA-1: 1f4674920d52faedb25eb1dce34facd985de5e2e |
|
Branch CR24285_1 has been deleted by inv. SHA-1: 751c824ac9c6055a38f7e67f8e01298ae4e809fe |
|
Branch CR24285_2 has been deleted by inv. SHA-1: 09aaf89911592e15edd6b72d57a05b6c1d58d27c |
|
Branch CR24285_3 has been deleted by inv. SHA-1: e1eaf5e85f3443e7e9338d7b4f48fe45cf731a2b |
|
Branch CR24285_4 has been deleted by inv. SHA-1: 7c38fa1081ab47fba268f0006e114ebc4bf3bf65 |
|
Branch CR24285_azn has been deleted by inv. SHA-1: fc22d2906a133ba08d76fa4db51def5458d5d31e |
|
Branch CR24285_azn_test has been created by azn. SHA-1: 3ab7cd619bb5af261d05a21cbacdaf4336ca0ef9 Detailed log of new commits: Author: azn Date: Tue Feb 17 14:39:04 2015 +0300 0024285: Updates of PLib::EvalPolynomial for code acceleration SSE2 & temaplte implementation (by Istvan Csanady) have been integrated to PLib::EvalPolynomial. |
|
Branch CR24285_Istvan has been created by azn. SHA-1: 3ab7cd619bb5af261d05a21cbacdaf4336ca0ef9 No new revisions were added by this update. |
|
Branch CR24285_azn has been created by azn. SHA-1: 881bc61f39e7bac99b76e36898283b2db38363ba Detailed log of new commits: Author: azn Date: Tue Feb 17 14:39:04 2015 +0300 0024285: Updates of PLib::EvalPolynomial for code acceleration - Temaplte implementation. - Using SSE2 instructions. |
|
Branch CR24285_Istvan has been deleted by kgv. SHA-1: 3ab7cd619bb5af261d05a21cbacdaf4336ca0ef9 |
|
Branch CR24285_azn has been deleted by kgv. SHA-1: 881bc61f39e7bac99b76e36898283b2db38363ba |
|
Branch CR24285_azn_test has been deleted by kgv. SHA-1: 3ab7cd619bb5af261d05a21cbacdaf4336ca0ef9 |
occt: master d721c8eb 2015-01-20 14:06:03
Committer: bugmaster Details Diff |
0024285: Updates of PLib::EvalPolynomial for code acceleration Functions PLib::EvalPolynomial and PLib::NoDerivativeEvalPolynomial are refactored to allow generation of faster code: 1. Iteration by degree is made in outer loop 2. Avoided pointer arithmetic 3. Recursive templates are used to expand loop by dimension in specific cases (1-15) |
Affected Issues 0024285 |
|
mod - src/PLib/PLib.cxx | Diff File | ||
add - tests/perf/bspline/intersect | Diff File | ||
mod - tests/perf/grids.list | Diff File |
Date Modified | Username | Field | Change |
---|---|---|---|
2013-10-24 09:47 |
|
New Issue | |
2013-10-24 09:47 |
|
Assigned To | => azv |
2013-10-24 09:51 |
|
Note Added: 0026248 | |
2013-10-24 09:51 |
|
Assigned To | azv => abv |
2013-10-24 09:51 |
|
Status | new => resolved |
2013-10-28 18:07 |
|
Note Added: 0026308 | |
2013-10-28 18:07 |
|
Assigned To | abv => bugmaster |
2013-10-28 18:07 |
|
Status | resolved => reviewed |
2013-10-29 06:51 |
|
Assigned To | bugmaster => mkv |
2013-10-29 11:44 |
|
Assigned To | mkv => azv |
2013-10-29 11:44 |
|
Status | reviewed => assigned |
2013-10-29 11:57 |
|
Assigned To | azv => mkv |
2013-10-29 11:57 |
|
Status | assigned => resolved |
2013-10-29 11:57 |
|
Status | resolved => reviewed |
2013-10-30 08:15 |
|
Note Added: 0026345 | |
2013-10-30 08:16 |
|
Test case number | => Not needed |
2013-10-30 08:16 |
|
Assigned To | mkv => bugmaster |
2013-10-30 08:16 |
|
Status | reviewed => tested |
2013-12-19 11:50 |
|
Note Added: 0027258 | |
2013-12-19 11:50 |
|
Status | tested => feedback |
2013-12-19 11:53 |
|
Assigned To | bugmaster => abv |
2014-04-04 17:42 |
|
Target Version | => 6.8.0 |
2014-09-11 09:23 |
|
Target Version | 6.8.0 => 7.1.0 |
2015-01-20 17:12 | git | Note Added: 0036272 | |
2015-01-20 17:12 |
|
Assigned To | abv => azv |
2015-01-20 17:12 |
|
Status | feedback => assigned |
2015-01-20 17:37 |
|
Note Added: 0036274 | |
2015-01-20 17:39 |
|
Note Edited: 0036274 | |
2015-01-20 17:41 |
|
Note Edited: 0036274 | |
2015-01-20 17:43 |
|
Note Added: 0036276 | |
2015-01-20 17:43 |
|
Assigned To | azv => abv |
2015-01-20 17:43 |
|
Status | assigned => resolved |
2015-01-20 17:43 |
|
Steps to Reproduce Updated | |
2015-01-21 07:14 |
|
Note Added: 0036283 | |
2015-01-21 07:33 |
|
Note Added: 0036284 | |
2015-01-21 07:38 |
|
Note Edited: 0036284 | |
2015-01-21 08:17 |
|
Note Added: 0036285 | |
2015-01-23 06:51 |
|
Note Added: 0036424 | |
2015-01-23 09:23 |
|
Note Added: 0036428 | |
2015-01-27 07:11 |
|
Note Added: 0036682 | |
2015-01-27 10:47 |
|
Note Added: 0036688 | |
2015-01-27 11:28 |
|
Note Added: 0036692 | |
2015-01-27 11:30 |
|
Note Edited: 0036692 | |
2015-01-27 13:23 |
|
Note Added: 0036703 | |
2015-02-14 19:28 |
|
File Added: eval_poly_template_metaprogram.zip | |
2015-02-14 19:32 |
|
Note Added: 0037543 | |
2015-02-14 19:32 |
|
Assigned To | abv => azn |
2015-02-14 19:32 |
|
Status | resolved => assigned |
2015-02-17 12:04 |
|
Note Edited: 0037543 | |
2015-02-20 15:02 | git | Note Added: 0037748 | |
2015-03-05 14:24 | git | Note Added: 0038151 | |
2015-03-12 16:26 | git | Note Added: 0038299 | |
2015-04-18 22:04 |
|
Note Added: 0040028 | |
2015-04-19 07:39 |
|
Note Added: 0040029 | |
2015-04-19 09:44 |
|
Steps to Reproduce Updated | |
2015-04-19 18:35 | git | Note Added: 0040030 | |
2015-04-19 19:02 |
|
Note Added: 0040031 | |
2015-04-19 19:02 |
|
Assigned To | azn => azv |
2015-04-19 19:02 |
|
Priority | low => high |
2015-04-19 19:02 |
|
Status | assigned => resolved |
2015-04-19 19:02 |
|
Target Version | 7.1.0 => 6.9.0 |
2015-04-20 08:16 |
|
Note Added: 0040034 | |
2015-04-20 08:16 |
|
Assigned To | azv => bugmaster |
2015-04-20 08:16 |
|
Status | resolved => reviewed |
2015-04-20 08:52 |
|
Note Added: 0040035 | |
2015-04-20 11:58 |
|
Project | Internal => Open CASCADE |
2015-04-20 13:51 |
|
Note Added: 0040046 | |
2015-04-20 16:16 |
|
Relationship added | related to 0026110 |
2015-04-21 16:30 |
|
Assigned To | bugmaster => mkv |
2015-04-22 17:08 |
|
Note Added: 0040113 | |
2015-04-22 17:08 |
|
Assigned To | mkv => bugmaster |
2015-04-22 17:08 |
|
Status | reviewed => tested |
2015-04-25 17:37 | bugmaster | Changeset attached | => occt master d721c8eb |
2015-04-25 17:37 | bugmaster | Status | tested => verified |
2015-04-25 17:37 | bugmaster | Resolution | open => fixed |
2015-04-28 11:25 |
|
View Status | private => public |
2015-04-28 12:01 | Istvan Csanady | Note Added: 0040321 | |
2015-05-14 15:28 |
|
Status | verified => closed |
2015-05-14 15:30 |
|
Fixed in Version | => 6.9.0 |
2015-05-14 16:25 | git | Note Added: 0040941 | |
2015-05-14 16:25 | git | Note Added: 0040942 | |
2015-05-14 16:25 | git | Note Added: 0040943 | |
2015-05-14 16:25 | git | Note Added: 0040944 | |
2015-05-14 16:25 | git | Note Added: 0040945 | |
2015-05-14 16:25 | git | Note Added: 0040946 | |
2015-05-20 11:04 | git | Note Added: 0041300 | |
2015-05-20 11:05 | git | Note Added: 0041302 | |
2015-05-20 11:40 | git | Note Added: 0041305 | |
2015-10-16 17:01 | git | Note Added: 0047109 | |
2015-10-16 17:01 | git | Note Added: 0047110 | |
2015-10-16 17:01 | git | Note Added: 0047111 |