Fixed precision math question

Question

0.00/5 (No votes)

See more:

I'm working in C++, but the language doesn't really matter here.

My brain isn't bending the right way. I'm thinking through a fixed precision numeric library and I can't figure out if I can simply get away with a fixed shift to place the mantissa. I don't know if the math works.

C++

uint32_t val=1;
val<<=16; // turn into fixed point 16.16
uint32_t res = val>>16; // turn back into an int

If I do that, can I perform div/mul add/sub on this number in fixed point form? What should I watch out for?

What I have tried:

I started studying other fixed point libraries but the ones I've found simply aren't very clear.

Posted 28-Nov-22 22:20pm

honey the codewitch

Updated 21-Dec-22 6:03am

Add a Solution

Comments

Peter_in_2780 29-Nov-22 4:32am

Add/subtract is easy. Mul/div, you need to keep track of scaling. Lots of 16 bit shifts before/after operations. Otherwise it's just like regular integer arithmetic, with the same fun and games around overflow, divide by zero, etc.

honey the codewitch 29-Nov-22 5:00am

Thanks! I've kind of answered my own question since I asked it - funny how that seems to work so often, but now I'm interested in what others have to say. =)

CPallini 29-Nov-22 7:09am

I see a connection:
https://www.codeproject.com/Messages/5909708/Re-Gosh-I-messed-up-equals

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

John M. Dlugosz · Answer 1 · 2022-12-21T06:03:00

Well, you need to encapsulate your value inside a class, so then you can provide your own operator overloads to work with it. I'd suggest even making it a template, so you can choose the underlying representation size and the number of bits to shift; e.g.

C++

template <typename Valtype, size_t ShiftCount>
class fixed {
     Valtype value;
        ⋮
};

Addition and Subtraction between identical fixed types just work by adding/subtracting their underlying value. For mixed fixed types, e.g. if a has a shift count of 16 bits and b has a shift count of 8 bits, then you have to align them before adding, and consider what the result type should be. I avoid the latter issue by not allowing mixed types in operator+, but I do in operator+= since it is explicit that the result needs to be the type of the left operand.

For multiplication, the result has a shift count that is the sum of the arguments' shift counts.

Now division is the hard one. I don't have a general solution in my own code, but have arranged things to suit the specific needs of the code that uses it. When you divide, how many fractional bits do you want in the answer?

If you don't generalize to a template that allows different shift counts, then multiplication will necessarily chop off the extra bits, and division gives your (only) type as a result. But you'll need more complex stuff inside the operators.

If you do allow different shift counts, then you can automatically allow promotion but not conversions that lose precision. The latter can be available with an explicit conversion.

Of course, it's also good to provide ways to view these values: have a to_string function, make it work with ostream, and make it work with the fmt library. But, also consider making the underlying conversion code follow the more efficient model introduced in C++17: to_chars allows the caller to supply a local buffer, which is practical since the maximum length is known. This makes implementing the ostream and fmt output functions more efficient.