15,892,005 members
Sign in
Sign in
Email
Password
Forgot your password?
Sign in with
home
articles
Browse Topics
>
Latest Articles
Top Articles
Posting/Update Guidelines
Article Help Forum
Submit an article or tip
Import GitHub Project
Import your Blog
quick answers
Q&A
Ask a Question
View Unanswered Questions
View All Questions
View C# questions
View C++ questions
View Javascript questions
View Visual Basic questions
View Python questions
discussions
forums
CodeProject.AI Server
All Message Boards...
Application Lifecycle
>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
Artificial Intelligence
ASP.NET
JavaScript
Internet of Things
C / C++ / MFC
>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices
>
System Admin
Hosting and Servers
Java
Linux Programming
Python
.NET (Core and Framework)
Android
iOS
Mobile
WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
CodeProject Stuff
community
lounge
Who's Who
Most Valuable Professionals
The Lounge
The CodeProject Blog
Where I Am: Member Photos
The Insider News
The Weird & The Wonderful
help
?
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
About Us
Search within:
Articles
Quick Answers
Messages
Comments by gilgamash (Top 3 by date)
gilgamash
21-Sep-11 4:06am
View
Deleted
Hoi,
I have to contradict here a lot. Go use, for instance, Intel profiling code on O2 level optimized code which you assume to be well optimized by the compiler and you will be surprised about the amount of cache misses, false jump predictions etc. And this
"They either have no insight in optimization techniques, or they haven't upgraded their compilers since the last millennium"
is arrogant and wrong.
Best regards,
G.
gilgamash
12-Apr-11 7:36am
View
Deleted
This short paper explains it well and thoroughly:
http://symbolaris.com/course/Compilers/23-cachedep.pdf
Best regards,
G.
P.S.: My comment number 4 was a rather garbled sentence, so I edited it :-)
gilgamash
12-Apr-11 6:29am
View
Deleted
You can make it still faster:
1) Replace the divisions by defining a constant 1/255 before the loops and multiplying with that. Saves around 20ish clocks each loop at least. Put it into a register to avoid cache misses!
2) When using Intel/Amd: The SSE2 and later commands could help a lot when computing rDest, gDest, bDest, as those are all independent and offer a perfect motivation for using SIMD commands.
3) Variables av and rem are perfect candidates for cache misses. You might wanna consider register variables for those, too.
4) Using loop variables within loops frequently results in heavy cache misses, too. Reordering when necessary might reduce that.
Otherwise: Nice and quick!
G.