Mistake on this page?
Report an issue in GitHub or email us
ieee754.h
1 /*==============================================================================
2  ieee754.c -- floating point conversion between half, double and single precision
3 
4  Copyright (c) 2018-2019, Laurence Lundblade. All rights reserved.
5 
6  SPDX-License-Identifier: BSD-3-Clause
7 
8  See BSD-3-Clause license in README.md
9 
10  Created on 7/23/18
11  ==============================================================================*/
12 
13 #ifndef ieee754_h
14 #define ieee754_h
15 
16 #include <stdint.h>
17 
18 
19 
20 /*
21  General comments
22 
23  This is a complete in that it handles all conversion cases
24  including +/- infinity, +/- zero, subnormal numbers, qNaN, sNaN
25  and NaN payloads.
26 
27  This confirms to IEEE 754-2008, but note that this doesn't
28  specify conversions, just the encodings.
29 
30  NaN payloads are preserved with alignment on the LSB. The
31  qNaN bit is handled differently and explicity copied. It
32  is always the MSB of the significand. The NaN payload MSBs
33  (except the qNaN bit) are truncated when going from
34  double or single to half.
35 
36  TODO: what does the C cast do with NaN payloads from
37  double to single?
38 
39 
40 
41  */
42 
43 /*
44  Most simply just explicilty encode the type you want, single or double.
45  This works easily everywhere since standard C supports both
46  these types and so does qcbor. This encoder also supports
47  half precision and there's a few ways to use it to encode
48  floating point numbers in less space.
49 
50  Without losing precision, you can encode a single or double
51  such that the special values of 0, NaN and Infinity encode
52  as half-precision. This CBOR decodoer and most others
53  should handle this properly.
54 
55  If you don't mind losing precision, then you can use half-precision.
56  One way to do this is to set up your environment to use
57  ___fp_16. Some compilers and CPUs support it even though it is not
58  standard C. What is nice about this is that your program
59  will use less memory and floating point operations like
60  multiplying, adding and such will be faster.
61 
62  Another way to make use of half-precision is to represent
63  the values in your program as single or double, but encode
64  them in CBOR as half-precision. This cuts the size
65  of the encoded messages by 2 or 4, but doesn't reduce
66  memory needs or speed because you are still using
67  single or double in your code.
68 
69 
70  encode:
71  - float as float
72  - double as double
73  - half as half
74  - float as half_precision, for environments that don't support a half-precision type
75  - double as half_precision, for environments that don't support a half-precision type
76  - float with NaN, Infinity and 0 as half
77  - double with NaN, Infinity and 0 as half
78 
79 
80 
81 
82  */
83 
84 
85 
86 /*
87  Convert single precision float to half-precision float.
88  Precision and NaN payload bits will be lost. Too large
89  values will round up to infinity and too small to zero.
90  */
91 uint16_t IEEE754_FloatToHalf(float f);
92 
93 
94 /*
95  Convert half precision float to single precision float.
96  This is a loss-less conversion.
97  */
98 float IEEE754_HalfToFloat(uint16_t uHalfPrecision);
99 
100 
101 /*
102  Convert double precision float to half-precision float.
103  Precision and NaN payload bits will be lost. Too large
104  values will round up to infinity and too small to zero.
105  */
106 uint16_t IEEE754_DoubleToHalf(double d);
107 
108 
109 /*
110  Convert half precision float to double precision float.
111  This is a loss-less conversion.
112  */
113 double IEEE754_HalfToDouble(uint16_t uHalfPrecision);
114 
115 
116 
117 // Both tags the value and gives the size
118 #define IEEE754_UNION_IS_HALF 2
119 #define IEEE754_UNION_IS_SINGLE 4
120 #define IEEE754_UNION_IS_DOUBLE 8
121 
122 typedef struct {
123  uint8_t uSize; // One of IEEE754_IS_xxxx
124  uint64_t uValue;
125 } IEEE754_union;
126 
127 
128 /*
129  Converts double-precision to single-precision or half-precision if possible without
130  loss of precisions. If not, leaves it as a double. Only converts to single-precision
131  unless bAllowHalfPrecision is set.
132  */
133 IEEE754_union IEEE754_DoubleToSmallestInternal(double d, int bAllowHalfPrecision);
134 
135 /*
136  Converts double-precision to single-precision if possible without
137  loss of precision. If not, leaves it as a double.
138  */
139 static inline IEEE754_union IEEE754_DoubleToSmall(double d)
140 {
141  return IEEE754_DoubleToSmallestInternal(d, 0);
142 }
143 
144 
145 /*
146  Converts double-precision to single-precision or half-precision if possible without
147  loss of precisions. If not, leaves it as a double.
148  */
149 static inline IEEE754_union IEEE754_DoubleToSmallest(double d)
150 {
151  return IEEE754_DoubleToSmallestInternal(d, 1);
152 }
153 
154 /*
155  Converts single-precision to half-precision if possible without
156  loss of precision. If not leaves as single-precision.
157  */
158 IEEE754_union IEEE754_FloatToSmallest(float f);
159 
160 
161 #endif /* ieee754_h */
Important Information for this Arm website

This site uses cookies to store information on your computer. By continuing to use our site, you consent to our cookies. If you are not happy with the use of these cookies, please review our Cookie Policy to learn how they can be disabled. By disabling cookies, some features of the site will not work.