1 |
|
---|
2 | <HTML>
|
---|
3 |
|
---|
4 | <HEAD>
|
---|
5 | <TITLE>Berkeley TestFloat General Documentation</TITLE>
|
---|
6 | </HEAD>
|
---|
7 |
|
---|
8 | <BODY>
|
---|
9 |
|
---|
10 | <H1>Berkeley TestFloat Release 3e: General Documentation</H1>
|
---|
11 |
|
---|
12 | <P>
|
---|
13 | John R. Hauser<BR>
|
---|
14 | 2018 January 20<BR>
|
---|
15 | </P>
|
---|
16 |
|
---|
17 |
|
---|
18 | <H2>Contents</H2>
|
---|
19 |
|
---|
20 | <BLOCKQUOTE>
|
---|
21 | <TABLE BORDER=0 CELLSPACING=0 CELLPADDING=0>
|
---|
22 | <COL WIDTH=25>
|
---|
23 | <COL WIDTH=*>
|
---|
24 | <TR><TD COLSPAN=2>1. Introduction</TD></TR>
|
---|
25 | <TR><TD COLSPAN=2>2. Limitations</TD></TR>
|
---|
26 | <TR><TD COLSPAN=2>3. Acknowledgments and License</TD></TR>
|
---|
27 | <TR><TD COLSPAN=2>4. What TestFloat Does</TD></TR>
|
---|
28 | <TR><TD COLSPAN=2>5. Executing TestFloat</TD></TR>
|
---|
29 | <TR><TD COLSPAN=2>6. Operations Tested by TestFloat</TD></TR>
|
---|
30 | <TR><TD></TD><TD>6.1. Conversion Operations</TD></TR>
|
---|
31 | <TR><TD></TD><TD>6.2. Basic Arithmetic Operations</TD></TR>
|
---|
32 | <TR><TD></TD><TD>6.3. Fused Multiply-Add Operations</TD></TR>
|
---|
33 | <TR><TD></TD><TD>6.4. Remainder Operations</TD></TR>
|
---|
34 | <TR><TD></TD><TD>6.5. Round-to-Integer Operations</TD></TR>
|
---|
35 | <TR><TD></TD><TD>6.6. Comparison Operations</TD></TR>
|
---|
36 | <TR><TD COLSPAN=2>7. Interpreting TestFloat Output</TD></TR>
|
---|
37 | <TR>
|
---|
38 | <TD COLSPAN=2>8. Variations Allowed by the IEEE Floating-Point Standard</TD>
|
---|
39 | </TR>
|
---|
40 | <TR><TD></TD><TD>8.1. Underflow</TD></TR>
|
---|
41 | <TR><TD></TD><TD>8.2. NaNs</TD></TR>
|
---|
42 | <TR><TD></TD><TD>8.3. Conversions to Integer</TD></TR>
|
---|
43 | <TR><TD COLSPAN=2>9. Contact Information</TD></TR>
|
---|
44 | </TABLE>
|
---|
45 | </BLOCKQUOTE>
|
---|
46 |
|
---|
47 |
|
---|
48 | <H2>1. Introduction</H2>
|
---|
49 |
|
---|
50 | <P>
|
---|
51 | Berkeley TestFloat is a small collection of programs for testing that an
|
---|
52 | implementation of binary floating-point conforms to the IEEE Standard for
|
---|
53 | Floating-Point Arithmetic.
|
---|
54 | All operations required by the original 1985 version of the IEEE Floating-Point
|
---|
55 | Standard can be tested, except for conversions to and from decimal.
|
---|
56 | With the current release, the following binary formats can be tested:
|
---|
57 | <NOBR>16-bit</NOBR> half-precision, <NOBR>32-bit</NOBR> single-precision,
|
---|
58 | <NOBR>64-bit</NOBR> double-precision, <NOBR>80-bit</NOBR>
|
---|
59 | double-extended-precision, and/or <NOBR>128-bit</NOBR> quadruple-precision.
|
---|
60 | TestFloat cannot test decimal floating-point.
|
---|
61 | </P>
|
---|
62 |
|
---|
63 | <P>
|
---|
64 | Included in the TestFloat package are the <CODE>testsoftfloat</CODE> and
|
---|
65 | <CODE>timesoftfloat</CODE> programs for testing the Berkeley SoftFloat software
|
---|
66 | implementation of floating-point and for measuring its speed.
|
---|
67 | Information about SoftFloat can be found at the SoftFloat Web page,
|
---|
68 | <A HREF="http://www.jhauser.us/arithmetic/SoftFloat.html"><NOBR><CODE>http://www.jhauser.us/arithmetic/SoftFloat.html</CODE></NOBR></A>.
|
---|
69 | The <CODE>testsoftfloat</CODE> and <CODE>timesoftfloat</CODE> programs are
|
---|
70 | expected to be of interest only to people compiling the SoftFloat sources.
|
---|
71 | </P>
|
---|
72 |
|
---|
73 | <P>
|
---|
74 | This document explains how to use the TestFloat programs.
|
---|
75 | It does not attempt to define or explain much of the IEEE Floating-Point
|
---|
76 | Standard.
|
---|
77 | Details about the standard are available elsewhere.
|
---|
78 | </P>
|
---|
79 |
|
---|
80 | <P>
|
---|
81 | The current version of TestFloat is <NOBR>Release 3e</NOBR>.
|
---|
82 | This version differs from earlier releases 3b through 3d in only minor ways.
|
---|
83 | Compared to the original <NOBR>Release 3</NOBR>:
|
---|
84 | <UL>
|
---|
85 | <LI>
|
---|
86 | <NOBR>Release 3b</NOBR> added the ability to test the <NOBR>16-bit</NOBR>
|
---|
87 | half-precision format.
|
---|
88 | <LI>
|
---|
89 | <NOBR>Release 3c</NOBR> added the ability to test a rarely used rounding mode,
|
---|
90 | <I>round to odd</I>, also known as <I>jamming</I>.
|
---|
91 | <LI>
|
---|
92 | <NOBR>Release 3d</NOBR> modified the code for testing C arithmetic to
|
---|
93 | potentially include testing newer library functions <CODE>sqrtf</CODE>,
|
---|
94 | <CODE>sqrtl</CODE>, <CODE>fmaf</CODE>, <CODE>fma</CODE>, and <CODE>fmal</CODE>.
|
---|
95 | </UL>
|
---|
96 | This release adds a few more small improvements, including modifying the
|
---|
97 | expected behavior of rounding mode <CODE>odd</CODE> and fixing a minor bug in
|
---|
98 | the all-in-one <CODE>testfloat</CODE> program.
|
---|
99 | </P>
|
---|
100 |
|
---|
101 | <P>
|
---|
102 | Compared to Release 2c and earlier, the set of TestFloat programs, as well as
|
---|
103 | the programs’ arguments and behavior, changed some with
|
---|
104 | <NOBR>Release 3</NOBR>.
|
---|
105 | For more about the evolution of TestFloat releases, see
|
---|
106 | <A HREF="TestFloat-history.html"><NOBR><CODE>TestFloat-history.html</CODE></NOBR></A>.
|
---|
107 | </P>
|
---|
108 |
|
---|
109 |
|
---|
110 | <H2>2. Limitations</H2>
|
---|
111 |
|
---|
112 | <P>
|
---|
113 | TestFloat output is not always easily interpreted.
|
---|
114 | Detailed knowledge of the IEEE Floating-Point Standard and its vagaries is
|
---|
115 | needed to use TestFloat responsibly.
|
---|
116 | </P>
|
---|
117 |
|
---|
118 | <P>
|
---|
119 | TestFloat performs relatively simple tests designed to check the fundamental
|
---|
120 | soundness of the floating-point under test.
|
---|
121 | TestFloat may also at times manage to find rarer and more subtle bugs, but it
|
---|
122 | will probably only find such bugs by chance.
|
---|
123 | Software that purposefully seeks out various kinds of subtle floating-point
|
---|
124 | bugs can be found through links posted on the TestFloat Web page,
|
---|
125 | <A HREF="http://www.jhauser.us/arithmetic/TestFloat.html"><NOBR><CODE>http://www.jhauser.us/arithmetic/TestFloat.html</CODE></NOBR></A>.
|
---|
126 | </P>
|
---|
127 |
|
---|
128 |
|
---|
129 | <H2>3. Acknowledgments and License</H2>
|
---|
130 |
|
---|
131 | <P>
|
---|
132 | The TestFloat package was written by me, <NOBR>John R.</NOBR> Hauser.
|
---|
133 | <NOBR>Release 3</NOBR> of TestFloat was a completely new implementation
|
---|
134 | supplanting earlier releases.
|
---|
135 | The project to create <NOBR>Release 3</NOBR> (now <NOBR>through 3e</NOBR>) was
|
---|
136 | done in the employ of the University of California, Berkeley, within the
|
---|
137 | Department of Electrical Engineering and Computer Sciences, first for the
|
---|
138 | Parallel Computing Laboratory (Par Lab) and then for the ASPIRE Lab.
|
---|
139 | The work was officially overseen by Prof. Krste Asanovic, with funding provided
|
---|
140 | by these sources:
|
---|
141 | <BLOCKQUOTE>
|
---|
142 | <TABLE>
|
---|
143 | <COL>
|
---|
144 | <COL WIDTH=10>
|
---|
145 | <COL>
|
---|
146 | <TR>
|
---|
147 | <TD VALIGN=TOP><NOBR>Par Lab:</NOBR></TD>
|
---|
148 | <TD></TD>
|
---|
149 | <TD>
|
---|
150 | Microsoft (Award #024263), Intel (Award #024894), and U.C. Discovery
|
---|
151 | (Award #DIG07-10227), with additional support from Par Lab affiliates Nokia,
|
---|
152 | NVIDIA, Oracle, and Samsung.
|
---|
153 | </TD>
|
---|
154 | </TR>
|
---|
155 | <TR>
|
---|
156 | <TD VALIGN=TOP><NOBR>ASPIRE Lab:</NOBR></TD>
|
---|
157 | <TD></TD>
|
---|
158 | <TD>
|
---|
159 | DARPA PERFECT program (Award #HR0011-12-2-0016), with additional support from
|
---|
160 | ASPIRE industrial sponsor Intel and ASPIRE affiliates Google, Nokia, NVIDIA,
|
---|
161 | Oracle, and Samsung.
|
---|
162 | </TD>
|
---|
163 | </TR>
|
---|
164 | </TABLE>
|
---|
165 | </BLOCKQUOTE>
|
---|
166 | </P>
|
---|
167 |
|
---|
168 | <P>
|
---|
169 | The following applies to the whole of TestFloat <NOBR>Release 3e</NOBR> as well
|
---|
170 | as to each source file individually.
|
---|
171 | </P>
|
---|
172 |
|
---|
173 | <P>
|
---|
174 | Copyright 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018 The Regents of the
|
---|
175 | University of California.
|
---|
176 | All rights reserved.
|
---|
177 | </P>
|
---|
178 |
|
---|
179 | <P>
|
---|
180 | Redistribution and use in source and binary forms, with or without
|
---|
181 | modification, are permitted provided that the following conditions are met:
|
---|
182 | <OL>
|
---|
183 |
|
---|
184 | <LI>
|
---|
185 | <P>
|
---|
186 | Redistributions of source code must retain the above copyright notice, this
|
---|
187 | list of conditions, and the following disclaimer.
|
---|
188 | </P>
|
---|
189 |
|
---|
190 | <LI>
|
---|
191 | <P>
|
---|
192 | Redistributions in binary form must reproduce the above copyright notice, this
|
---|
193 | list of conditions, and the following disclaimer in the documentation and/or
|
---|
194 | other materials provided with the distribution.
|
---|
195 | </P>
|
---|
196 |
|
---|
197 | <LI>
|
---|
198 | <P>
|
---|
199 | Neither the name of the University nor the names of its contributors may be
|
---|
200 | used to endorse or promote products derived from this software without specific
|
---|
201 | prior written permission.
|
---|
202 | </P>
|
---|
203 |
|
---|
204 | </OL>
|
---|
205 | </P>
|
---|
206 |
|
---|
207 | <P>
|
---|
208 | THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS “AS IS”,
|
---|
209 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
---|
210 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE, ARE
|
---|
211 | DISCLAIMED.
|
---|
212 | IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
|
---|
213 | INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
|
---|
214 | BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
|
---|
215 | DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
|
---|
216 | LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE
|
---|
217 | OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
|
---|
218 | ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
---|
219 | </P>
|
---|
220 |
|
---|
221 |
|
---|
222 | <H2>4. What TestFloat Does</H2>
|
---|
223 |
|
---|
224 | <P>
|
---|
225 | TestFloat is designed to test a floating-point implementation by comparing its
|
---|
226 | behavior with that of TestFloat’s own internal floating-point implemented
|
---|
227 | in software.
|
---|
228 | For each operation to be tested, the TestFloat programs can generate a large
|
---|
229 | number of test cases, made up of simple pattern tests intermixed with weighted
|
---|
230 | random inputs.
|
---|
231 | The cases generated should be adequate for testing carry chain propagations,
|
---|
232 | and the rounding of addition, subtraction, multiplication, and simple
|
---|
233 | operations like conversions.
|
---|
234 | TestFloat makes a point of checking all boundary cases of the arithmetic,
|
---|
235 | including underflows, overflows, invalid operations, subnormal inputs, zeros
|
---|
236 | (positive and negative), infinities, and NaNs.
|
---|
237 | For the interesting operations like addition and multiplication, millions of
|
---|
238 | test cases may be checked.
|
---|
239 | </P>
|
---|
240 |
|
---|
241 | <P>
|
---|
242 | TestFloat is not remarkably good at testing difficult rounding cases for
|
---|
243 | division and square root.
|
---|
244 | It also makes no attempt to find bugs specific to SRT division and the like
|
---|
245 | (such as the infamous Pentium division bug).
|
---|
246 | Software that tests for such failures can be found through links on the
|
---|
247 | TestFloat Web page,
|
---|
248 | <A HREF="http://www.jhauser.us/arithmetic/TestFloat.html"><NOBR><CODE>http://www.jhauser.us/arithmetic/TestFloat.html</CODE></NOBR></A>.
|
---|
249 | </P>
|
---|
250 |
|
---|
251 | <P>
|
---|
252 | NOTE!<BR>
|
---|
253 | It is the responsibility of the user to verify that the discrepancies TestFloat
|
---|
254 | finds actually represent faults in the implementation being tested.
|
---|
255 | Advice to help with this task is provided later in this document.
|
---|
256 | Furthermore, even if TestFloat finds no fault with a floating-point
|
---|
257 | implementation, that in no way guarantees that the implementation is bug-free.
|
---|
258 | </P>
|
---|
259 |
|
---|
260 | <P>
|
---|
261 | For each operation, TestFloat can test all five rounding modes defined by the
|
---|
262 | IEEE Floating-Point Standard, plus possibly a sixth mode, <I>round to odd</I>
|
---|
263 | (depending on the options selected when TestFloat was built).
|
---|
264 | TestFloat verifies not only that the numeric results of an operation are
|
---|
265 | correct, but also that the proper floating-point exception flags are raised.
|
---|
266 | All five exception flags are tested, including the <I>inexact</I> flag.
|
---|
267 | TestFloat does not attempt to verify that the floating-point exception flags
|
---|
268 | are actually implemented as sticky flags.
|
---|
269 | </P>
|
---|
270 |
|
---|
271 | <P>
|
---|
272 | For the <NOBR>80-bit</NOBR> double-extended-precision format, TestFloat can
|
---|
273 | test the addition, subtraction, multiplication, division, and square root
|
---|
274 | operations at all three of the standard rounding precisions.
|
---|
275 | The rounding precision can be set to <NOBR>32 bits</NOBR>, equivalent to
|
---|
276 | single-precision, to <NOBR>64 bits</NOBR>, equivalent to double-precision, or
|
---|
277 | to the full <NOBR>80 bits</NOBR> of the double-extended-precision.
|
---|
278 | Rounding precision control can be applied only to the double-extended-precision
|
---|
279 | format and only for the five basic arithmetic operations: addition,
|
---|
280 | subtraction, multiplication, division, and square root.
|
---|
281 | Other operations can be tested only at full precision.
|
---|
282 | </P>
|
---|
283 |
|
---|
284 | <P>
|
---|
285 | As a rule, TestFloat is not particular about the bit patterns of NaNs that
|
---|
286 | appear as operation results.
|
---|
287 | Any NaN is considered as good a result as another.
|
---|
288 | This laxness can be overridden so that TestFloat checks for particular bit
|
---|
289 | patterns within NaN results.
|
---|
290 | See <NOBR>section 8</NOBR> below, <I>Variations Allowed by the IEEE
|
---|
291 | Floating-Point Standard</I>, plus the <CODE>-checkNaNs</CODE> and
|
---|
292 | <CODE>-checkInvInts</CODE> options documented for programs
|
---|
293 | <CODE>testfloat_ver</CODE> and <CODE>testfloat</CODE>.
|
---|
294 | </P>
|
---|
295 |
|
---|
296 | <P>
|
---|
297 | TestFloat normally compares an implementation of floating-point against the
|
---|
298 | Berkeley SoftFloat software implementation of floating-point, also created by
|
---|
299 | me.
|
---|
300 | The SoftFloat functions are linked into each TestFloat program’s
|
---|
301 | executable.
|
---|
302 | Information about SoftFloat can be found at the Web page
|
---|
303 | <A HREF="http://www.jhauser.us/arithmetic/SoftFloat.html"><NOBR><CODE>http://www.jhauser.us/arithmetic/SoftFloat.html</CODE></NOBR></A>.
|
---|
304 | </P>
|
---|
305 |
|
---|
306 | <P>
|
---|
307 | For testing SoftFloat itself, the TestFloat package includes a
|
---|
308 | <CODE>testsoftfloat</CODE> program that compares SoftFloat’s
|
---|
309 | floating-point against <EM>another</EM> software floating-point implementation.
|
---|
310 | The second software floating-point is simpler and slower than SoftFloat, and is
|
---|
311 | completely independent of SoftFloat.
|
---|
312 | Although the second software floating-point cannot be guaranteed to be
|
---|
313 | bug-free, the chance that it would mimic any of SoftFloat’s bugs is low.
|
---|
314 | Consequently, an error in one or the other floating-point version should appear
|
---|
315 | as an unexpected difference between the two implementations.
|
---|
316 | Note that testing SoftFloat should be necessary only when compiling a new
|
---|
317 | TestFloat executable or when compiling SoftFloat for some other reason.
|
---|
318 | </P>
|
---|
319 |
|
---|
320 |
|
---|
321 | <H2>5. Executing TestFloat</H2>
|
---|
322 |
|
---|
323 | <P>
|
---|
324 | The TestFloat package consists of five programs, all intended to be executed
|
---|
325 | from a command-line interpreter:
|
---|
326 | <BLOCKQUOTE>
|
---|
327 | <TABLE>
|
---|
328 | <TR>
|
---|
329 | <TD>
|
---|
330 | <A HREF="testfloat_gen.html"><CODE>testfloat_gen</CODE></A><CODE> </CODE>
|
---|
331 | </TD>
|
---|
332 | <TD>
|
---|
333 | Generates test cases for a specific floating-point operation.
|
---|
334 | </TD>
|
---|
335 | </TR>
|
---|
336 | <TR>
|
---|
337 | <TD>
|
---|
338 | <A HREF="testfloat_ver.html"><CODE>testfloat_ver</CODE></A>
|
---|
339 | </TD>
|
---|
340 | <TD>
|
---|
341 | Verifies whether the results from executing a floating-point operation are as
|
---|
342 | expected.
|
---|
343 | </TD>
|
---|
344 | </TR>
|
---|
345 | <TR>
|
---|
346 | <TD>
|
---|
347 | <A HREF="testfloat.html"><CODE>testfloat</CODE></A>
|
---|
348 | </TD>
|
---|
349 | <TD>
|
---|
350 | An all-in-one program that generates test cases, executes floating-point
|
---|
351 | operations, and verifies whether the results match expectations.
|
---|
352 | </TD>
|
---|
353 | </TR>
|
---|
354 | <TR>
|
---|
355 | <TD>
|
---|
356 | <A HREF="testsoftfloat.html"><CODE>testsoftfloat</CODE></A><CODE> </CODE>
|
---|
357 | </TD>
|
---|
358 | <TD>
|
---|
359 | Like <CODE>testfloat</CODE>, but for testing SoftFloat.
|
---|
360 | </TD>
|
---|
361 | </TR>
|
---|
362 | <TR>
|
---|
363 | <TD>
|
---|
364 | <A HREF="timesoftfloat.html"><CODE>timesoftfloat</CODE></A><CODE> </CODE>
|
---|
365 | </TD>
|
---|
366 | <TD>
|
---|
367 | A program for measuring the speed of SoftFloat (included in the TestFloat
|
---|
368 | package for convenience).
|
---|
369 | </TD>
|
---|
370 | </TR>
|
---|
371 | </TABLE>
|
---|
372 | </BLOCKQUOTE>
|
---|
373 | Each program has its own page of documentation that can be opened through the
|
---|
374 | links in the table above.
|
---|
375 | </P>
|
---|
376 |
|
---|
377 | <P>
|
---|
378 | To test a floating-point implementation other than SoftFloat, one of three
|
---|
379 | different methods can be used.
|
---|
380 | The first method pipes output from <CODE>testfloat_gen</CODE> to a program
|
---|
381 | that:
|
---|
382 | <NOBR>(a) reads</NOBR> the incoming test cases, <NOBR>(b) invokes</NOBR> the
|
---|
383 | floating-point operation being tested, and <NOBR>(c) writes</NOBR> the
|
---|
384 | operation results to output.
|
---|
385 | These results can then be piped to <CODE>testfloat_ver</CODE> to be checked for
|
---|
386 | correctness.
|
---|
387 | Assuming a vertical bar (<CODE>|</CODE>) indicates a pipe between programs, the
|
---|
388 | complete process could be written as a single command like so:
|
---|
389 | <BLOCKQUOTE>
|
---|
390 | <PRE>
|
---|
391 | testfloat_gen ... <<I>type</I>> | <<I>program-that-invokes-op</I>> | testfloat_ver ... <<I>function</I>>
|
---|
392 | </PRE>
|
---|
393 | </BLOCKQUOTE>
|
---|
394 | The program in the middle is not supplied by TestFloat but must be created
|
---|
395 | independently.
|
---|
396 | If for some reason this program cannot take command-line arguments, the
|
---|
397 | <CODE>-prefix</CODE> option of <CODE>testfloat_gen</CODE> can communicate
|
---|
398 | parameters through the pipe.
|
---|
399 | </P>
|
---|
400 |
|
---|
401 | <P>
|
---|
402 | A second method for running TestFloat is similar but has
|
---|
403 | <CODE>testfloat_gen</CODE> supply not only the test inputs but also the
|
---|
404 | expected results for each case.
|
---|
405 | With this additional information, the job done by <CODE>testfloat_ver</CODE>
|
---|
406 | can be folded into the invoking program to give the following command:
|
---|
407 | <BLOCKQUOTE>
|
---|
408 | <PRE>
|
---|
409 | testfloat_gen ... <<I>function</I>> | <<I>program-that-invokes-op-and-compares-results</I>>
|
---|
410 | </PRE>
|
---|
411 | </BLOCKQUOTE>
|
---|
412 | Again, the program that actually invokes the floating-point operation is not
|
---|
413 | supplied by TestFloat but must be created independently.
|
---|
414 | Depending on circumstance, it may be preferable either to let
|
---|
415 | <CODE>testfloat_ver</CODE> check and report suspected errors (first method) or
|
---|
416 | to include this step in the invoking program (second method).
|
---|
417 | </P>
|
---|
418 |
|
---|
419 | <P>
|
---|
420 | The third way to use TestFloat is the all-in-one <CODE>testfloat</CODE>
|
---|
421 | program.
|
---|
422 | This program can perform all the steps of creating test cases, invoking the
|
---|
423 | floating-point operation, checking the results, and reporting suspected errors.
|
---|
424 | However, for this to be possible, <CODE>testfloat</CODE> must be compiled to
|
---|
425 | contain the method for invoking the floating-point operations to test.
|
---|
426 | Each build of <CODE>testfloat</CODE> is therefore capable of testing
|
---|
427 | <EM>only</EM> the floating-point implementation it was built to invoke.
|
---|
428 | To test a new implementation of floating-point, a new <CODE>testfloat</CODE>
|
---|
429 | must be created, linked to that specific implementation.
|
---|
430 | By comparison, the <CODE>testfloat_gen</CODE> and <CODE>testfloat_ver</CODE>
|
---|
431 | programs are entirely generic;
|
---|
432 | one instance is usable for testing any floating-point implementation, because
|
---|
433 | implementation-specific details are segregated in the custom program that
|
---|
434 | follows <CODE>testfloat_gen</CODE>.
|
---|
435 | </P>
|
---|
436 |
|
---|
437 | <P>
|
---|
438 | Program <CODE>testsoftfloat</CODE> is another all-in-one program specifically
|
---|
439 | for testing SoftFloat.
|
---|
440 | </P>
|
---|
441 |
|
---|
442 | <P>
|
---|
443 | Programs <CODE>testfloat_ver</CODE>, <CODE>testfloat</CODE>, and
|
---|
444 | <CODE>testsoftfloat</CODE> all report status and error information in a common
|
---|
445 | way.
|
---|
446 | As it executes, each of these programs writes status information to the
|
---|
447 | standard error output, which should be the screen by default.
|
---|
448 | In order for this status to be displayed properly, the standard error stream
|
---|
449 | should not be redirected to a file.
|
---|
450 | Any discrepancies that are found are written to the standard output stream,
|
---|
451 | which is easily redirected to a file if desired.
|
---|
452 | Unless redirected, reported errors will appear intermixed with the ongoing
|
---|
453 | status information in the output.
|
---|
454 | </P>
|
---|
455 |
|
---|
456 |
|
---|
457 | <H2>6. Operations Tested by TestFloat</H2>
|
---|
458 |
|
---|
459 | <P>
|
---|
460 | TestFloat can test all operations required by the original 1985 IEEE
|
---|
461 | Floating-Point Standard except for conversions to and from decimal.
|
---|
462 | These operations are:
|
---|
463 | <UL>
|
---|
464 | <LI>
|
---|
465 | conversions among the supported floating-point formats, and also between
|
---|
466 | integers (<NOBR>32-bit</NOBR> and <NOBR>64-bit</NOBR>, signed and unsigned) and
|
---|
467 | any of the floating-point formats;
|
---|
468 | <LI>
|
---|
469 | for each floating-point format, the usual addition, subtraction,
|
---|
470 | multiplication, division, and square root operations;
|
---|
471 | <LI>
|
---|
472 | for each format, the floating-point remainder operation defined by the IEEE
|
---|
473 | Standard;
|
---|
474 | <LI>
|
---|
475 | for each format, a “round to integer” operation that rounds to the
|
---|
476 | nearest integer value in the same format; and
|
---|
477 | <LI>
|
---|
478 | comparisons between two values in the same floating-point format.
|
---|
479 | </UL>
|
---|
480 | In addition, TestFloat can also test
|
---|
481 | <UL>
|
---|
482 | <LI>
|
---|
483 | for each floating-point format except <NOBR>80-bit</NOBR>
|
---|
484 | double-extended-precision, the fused multiply-add operation defined by the 2008
|
---|
485 | IEEE Standard.
|
---|
486 | </UL>
|
---|
487 | </P>
|
---|
488 |
|
---|
489 | <P>
|
---|
490 | More information about all these operations is given below.
|
---|
491 | In the operation names used by TestFloat, <NOBR>16-bit</NOBR> half-precision is
|
---|
492 | called <CODE>f16</CODE>, <NOBR>32-bit</NOBR> single-precision is
|
---|
493 | <CODE>f32</CODE>, <NOBR>64-bit</NOBR> double-precision is <CODE>f64</CODE>,
|
---|
494 | <NOBR>80-bit</NOBR> double-extended-precision is <CODE>extF80</CODE>, and
|
---|
495 | <NOBR>128-bit</NOBR> quadruple-precision is <CODE>f128</CODE>.
|
---|
496 | TestFloat generally uses the same names for operations as Berkeley SoftFloat,
|
---|
497 | except that TestFloat’s names never include the <CODE>M</CODE> that
|
---|
498 | SoftFloat uses to indicate that values are passed through pointers.
|
---|
499 | </P>
|
---|
500 |
|
---|
501 | <H3>6.1. Conversion Operations</H3>
|
---|
502 |
|
---|
503 | <P>
|
---|
504 | All conversions among the floating-point formats and all conversions between a
|
---|
505 | floating-point format and <NOBR>32-bit</NOBR> and <NOBR>64-bit</NOBR> integers
|
---|
506 | can be tested.
|
---|
507 | The conversion operations are:
|
---|
508 | <BLOCKQUOTE>
|
---|
509 | <PRE>
|
---|
510 | ui32_to_f16 ui64_to_f16 i32_to_f16 i64_to_f16
|
---|
511 | ui32_to_f32 ui64_to_f32 i32_to_f32 i64_to_f32
|
---|
512 | ui32_to_f64 ui64_to_f64 i32_to_f64 i64_to_f64
|
---|
513 | ui32_to_extF80 ui64_to_extF80 i32_to_extF80 i64_to_extF80
|
---|
514 | ui32_to_f128 ui64_to_f128 i32_to_f128 i64_to_f128
|
---|
515 |
|
---|
516 | f16_to_ui32 f32_to_ui32 f64_to_ui32 extF80_to_ui32 f128_to_ui32
|
---|
517 | f16_to_ui64 f32_to_ui64 f64_to_ui64 extF80_to_ui64 f128_to_ui64
|
---|
518 | f16_to_i32 f32_to_i32 f64_to_i32 extF80_to_i32 f128_to_i32
|
---|
519 | f16_to_i64 f32_to_i64 f64_to_i64 extF80_to_i64 f128_to_i64
|
---|
520 |
|
---|
521 | f16_to_f32 f32_to_f16 f64_to_f16 extF80_to_f16 f128_to_f16
|
---|
522 | f16_to_f64 f32_to_f64 f64_to_f32 extF80_to_f32 f128_to_f32
|
---|
523 | f16_to_extF80 f32_to_extF80 f64_to_extF80 extF80_to_f64 f128_to_f64
|
---|
524 | f16_to_f128 f32_to_f128 f64_to_f128 extF80_to_f128 f128_to_extF80
|
---|
525 | </PRE>
|
---|
526 | </BLOCKQUOTE>
|
---|
527 | Abbreviations <CODE>ui32</CODE> and <CODE>ui64</CODE> indicate
|
---|
528 | <NOBR>32-bit</NOBR> and <NOBR>64-bit</NOBR> unsigned integer types, while
|
---|
529 | <CODE>i32</CODE> and <CODE>i64</CODE> indicate their signed counterparts.
|
---|
530 | These conversions all round according to the current rounding mode as relevant.
|
---|
531 | Conversions from a smaller to a larger floating-point format are always exact
|
---|
532 | and so require no rounding.
|
---|
533 | Likewise, conversions from <NOBR>32-bit</NOBR> integers to <NOBR>64-bit</NOBR>
|
---|
534 | double-precision or to any larger floating-point format are also exact, as are
|
---|
535 | conversions from <NOBR>64-bit</NOBR> integers to <NOBR>80-bit</NOBR>
|
---|
536 | double-extended-precision and <NOBR>128-bit</NOBR> quadruple-precision.
|
---|
537 | </P>
|
---|
538 |
|
---|
539 | <P>
|
---|
540 | For the all-in-one <CODE>testfloat</CODE> program, this list of conversion
|
---|
541 | operations requires amendment.
|
---|
542 | For <CODE>testfloat</CODE> only, conversions to an integer type have names that
|
---|
543 | explicitly specify the rounding mode and treatment of inexactness.
|
---|
544 | Thus, instead of
|
---|
545 | <BLOCKQUOTE>
|
---|
546 | <PRE>
|
---|
547 | <<I>float</I>>_to_<<I>int</I>>
|
---|
548 | </PRE>
|
---|
549 | </BLOCKQUOTE>
|
---|
550 | as listed above, operations converting to integer type have names of these
|
---|
551 | forms:
|
---|
552 | <BLOCKQUOTE>
|
---|
553 | <PRE>
|
---|
554 | <<I>float</I>>_to_<<I>int</I>>_r_<<I>round</I>>
|
---|
555 | <<I>float</I>>_to_<<I>int</I>>_rx_<<I>round</I>>
|
---|
556 | </PRE>
|
---|
557 | </BLOCKQUOTE>
|
---|
558 | The <CODE><<I>round</I>></CODE> component is one of
|
---|
559 | ‘<CODE>near_even</CODE>’, ‘<CODE>near_maxMag</CODE>’,
|
---|
560 | ‘<CODE>minMag</CODE>’, ‘<CODE>min</CODE>’, or
|
---|
561 | ‘<CODE>max</CODE>’, choosing the rounding mode.
|
---|
562 | Any other indication of rounding mode is ignored.
|
---|
563 | The operations with ‘<CODE>_r_</CODE>’ in their names never raise
|
---|
564 | the <I>inexact</I> exception, while those with ‘<CODE>_rx_</CODE>’
|
---|
565 | raise the <I>inexact</I> exception whenever the result is not exact.
|
---|
566 | </P>
|
---|
567 |
|
---|
568 | <P>
|
---|
569 | TestFloat assumes that conversions from floating-point to an integer type
|
---|
570 | should raise the <I>invalid</I> exception if the input cannot be rounded to an
|
---|
571 | integer representable in the result format.
|
---|
572 | In such a circumstance:
|
---|
573 | <UL>
|
---|
574 |
|
---|
575 | <LI>
|
---|
576 | <P>
|
---|
577 | If the result type is an unsigned integer, TestFloat normally expects the
|
---|
578 | result of the operation to be the type’s largest integer value.
|
---|
579 | In the case that the input is a negative number (not a NaN), a zero result may
|
---|
580 | also be accepted.
|
---|
581 | </P>
|
---|
582 |
|
---|
583 | <LI>
|
---|
584 | <P>
|
---|
585 | If the result type is a signed integer and the input is a number (not a NaN),
|
---|
586 | TestFloat expects the result to be the largest-magnitude integer with the same
|
---|
587 | sign as the input.
|
---|
588 | When a NaN is converted to a signed integer type, TestFloat allows either the
|
---|
589 | largest postive or largest-magnitude negative integer to be returned.
|
---|
590 | </P>
|
---|
591 |
|
---|
592 | </UL>
|
---|
593 | Conversions to integer types are expected never to raise the <I>overflow</I>
|
---|
594 | exception.
|
---|
595 | </P>
|
---|
596 |
|
---|
597 | <H3>6.2. Basic Arithmetic Operations</H3>
|
---|
598 |
|
---|
599 | <P>
|
---|
600 | The following standard arithmetic operations can be tested:
|
---|
601 | <BLOCKQUOTE>
|
---|
602 | <PRE>
|
---|
603 | f16_add f16_sub f16_mul f16_div f16_sqrt
|
---|
604 | f32_add f32_sub f32_mul f32_div f32_sqrt
|
---|
605 | f64_add f64_sub f64_mul f64_div f64_sqrt
|
---|
606 | extF80_add extF80_sub extF80_mul extF80_div extF80_sqrt
|
---|
607 | f128_add f128_sub f128_mul f128_div f128_sqrt
|
---|
608 | </PRE>
|
---|
609 | </BLOCKQUOTE>
|
---|
610 | The double-extended-precision (<CODE>extF80</CODE>) operations can be rounded
|
---|
611 | to reduced precision under rounding precision control.
|
---|
612 | </P>
|
---|
613 |
|
---|
614 | <H3>6.3. Fused Multiply-Add Operations</H3>
|
---|
615 |
|
---|
616 | <P>
|
---|
617 | For all floating-point formats except <NOBR>80-bit</NOBR>
|
---|
618 | double-extended-precision, TestFloat can test the fused multiply-add operation
|
---|
619 | defined by the 2008 IEEE Floating-Point Standard.
|
---|
620 | The fused multiply-add operations are:
|
---|
621 | <BLOCKQUOTE>
|
---|
622 | <PRE>
|
---|
623 | f16_mulAdd
|
---|
624 | f32_mulAdd
|
---|
625 | f64_mulAdd
|
---|
626 | f128_mulAdd
|
---|
627 | </PRE>
|
---|
628 | </BLOCKQUOTE>
|
---|
629 | </P>
|
---|
630 |
|
---|
631 | <P>
|
---|
632 | If one of the multiplication operands is infinite and the other is zero,
|
---|
633 | TestFloat expects the fused multiply-add operation to raise the <I>invalid</I>
|
---|
634 | exception even if the third operand is a quiet NaN.
|
---|
635 | </P>
|
---|
636 |
|
---|
637 | <H3>6.4. Remainder Operations</H3>
|
---|
638 |
|
---|
639 | <P>
|
---|
640 | For each format, TestFloat can test the IEEE Standard’s remainder
|
---|
641 | operation.
|
---|
642 | These operations are:
|
---|
643 | <BLOCKQUOTE>
|
---|
644 | <PRE>
|
---|
645 | f16_rem
|
---|
646 | f32_rem
|
---|
647 | f64_rem
|
---|
648 | extF80_rem
|
---|
649 | f128_rem
|
---|
650 | </PRE>
|
---|
651 | </BLOCKQUOTE>
|
---|
652 | The remainder operations are always exact and so require no rounding.
|
---|
653 | </P>
|
---|
654 |
|
---|
655 | <H3>6.5. Round-to-Integer Operations</H3>
|
---|
656 |
|
---|
657 | <P>
|
---|
658 | For each format, TestFloat can test the IEEE Standard’s round-to-integer
|
---|
659 | operation.
|
---|
660 | For most TestFloat programs, these operations are:
|
---|
661 | <BLOCKQUOTE>
|
---|
662 | <PRE>
|
---|
663 | f16_roundToInt
|
---|
664 | f32_roundToInt
|
---|
665 | f64_roundToInt
|
---|
666 | extF80_roundToInt
|
---|
667 | f128_roundToInt
|
---|
668 | </PRE>
|
---|
669 | </BLOCKQUOTE>
|
---|
670 | </P>
|
---|
671 |
|
---|
672 | <P>
|
---|
673 | Just as for conversions to integer types (<NOBR>section 6.1</NOBR> above), the
|
---|
674 | all-in-one <CODE>testfloat</CODE> program is again an exception.
|
---|
675 | For <CODE>testfloat</CODE> only, the round-to-integer operations have names of
|
---|
676 | these forms:
|
---|
677 | <BLOCKQUOTE>
|
---|
678 | <PRE>
|
---|
679 | <<I>float</I>>_roundToInt_r_<<I>round</I>>
|
---|
680 | <<I>float</I>>_roundToInt_x
|
---|
681 | </PRE>
|
---|
682 | </BLOCKQUOTE>
|
---|
683 | For the ‘<CODE>_r_</CODE>’ versions, the <I>inexact</I> exception
|
---|
684 | is never raised, and the <CODE><<I>round</I>></CODE> component specifies
|
---|
685 | the rounding mode as one of ‘<CODE>near_even</CODE>’,
|
---|
686 | ‘<CODE>near_maxMag</CODE>’, ‘<CODE>minMag</CODE>’,
|
---|
687 | ‘<CODE>min</CODE>’, or ‘<CODE>max</CODE>’.
|
---|
688 | The usual indication of rounding mode is ignored.
|
---|
689 | In contrast, the ‘<CODE>_x</CODE>’ versions accept the usual
|
---|
690 | indication of rounding mode and raise the <I>inexact</I> exception whenever the
|
---|
691 | result is not exact.
|
---|
692 | This irregular system follows the IEEE Standard’s particular
|
---|
693 | specification for the round-to-integer operations.
|
---|
694 | </P>
|
---|
695 |
|
---|
696 | <H3>6.6. Comparison Operations</H3>
|
---|
697 |
|
---|
698 | <P>
|
---|
699 | The following floating-point comparison operations can be tested:
|
---|
700 | <BLOCKQUOTE>
|
---|
701 | <PRE>
|
---|
702 | f16_eq f16_le f16_lt
|
---|
703 | f32_eq f32_le f32_lt
|
---|
704 | f64_eq f64_le f64_lt
|
---|
705 | extF80_eq extF80_le extF80_lt
|
---|
706 | f128_eq f128_le f128_lt
|
---|
707 | </PRE>
|
---|
708 | </BLOCKQUOTE>
|
---|
709 | The abbreviation <CODE>eq</CODE> stands for “equal” (=),
|
---|
710 | <CODE>le</CODE> stands for “less than or equal” (≤), and
|
---|
711 | <CODE>lt</CODE> stands for “less than” (<).
|
---|
712 | </P>
|
---|
713 |
|
---|
714 | <P>
|
---|
715 | The IEEE Standard specifies that, by default, the less-than-or-equal and
|
---|
716 | less-than comparisons raise the <I>invalid</I> exception if either input is any
|
---|
717 | kind of NaN.
|
---|
718 | The equality comparisons, on the other hand, are defined by default to raise
|
---|
719 | the <I>invalid</I> exception only for signaling NaNs, not for quiet NaNs.
|
---|
720 | For completeness, the following additional operations can be tested if
|
---|
721 | supported:
|
---|
722 | <BLOCKQUOTE>
|
---|
723 | <PRE>
|
---|
724 | f16_eq_signaling f16_le_quiet f16_lt_quiet
|
---|
725 | f32_eq_signaling f32_le_quiet f32_lt_quiet
|
---|
726 | f64_eq_signaling f64_le_quiet f64_lt_quiet
|
---|
727 | extF80_eq_signaling extF80_le_quiet extF80_lt_quiet
|
---|
728 | f128_eq_signaling f128_le_quiet f128_lt_quiet
|
---|
729 | </PRE>
|
---|
730 | </BLOCKQUOTE>
|
---|
731 | The <CODE>signaling</CODE> equality comparisons are identical to the standard
|
---|
732 | operations except that the <I>invalid</I> exception should be raised for any
|
---|
733 | NaN input.
|
---|
734 | Similarly, the <CODE>quiet</CODE> comparison operations should be identical to
|
---|
735 | their counterparts except that the <I>invalid</I> exception is not raised for
|
---|
736 | quiet NaNs.
|
---|
737 | </P>
|
---|
738 |
|
---|
739 | <P>
|
---|
740 | Obviously, no comparison operations ever require rounding.
|
---|
741 | Any rounding mode is ignored.
|
---|
742 | </P>
|
---|
743 |
|
---|
744 |
|
---|
745 | <H2>7. Interpreting TestFloat Output</H2>
|
---|
746 |
|
---|
747 | <P>
|
---|
748 | The “errors” reported by TestFloat programs may or may not really
|
---|
749 | represent errors in the system being tested.
|
---|
750 | For each test case tried, the results from the floating-point implementation
|
---|
751 | being tested could differ from the expected results for several reasons:
|
---|
752 | <UL>
|
---|
753 | <LI>
|
---|
754 | The IEEE Floating-Point Standard allows for some variation in how conforming
|
---|
755 | floating-point behaves.
|
---|
756 | Two implementations can sometimes give different results without either being
|
---|
757 | incorrect.
|
---|
758 | <LI>
|
---|
759 | The trusted floating-point emulation could be faulty.
|
---|
760 | This could be because there is a bug in the way the emulation is coded, or
|
---|
761 | because a mistake was made when the code was compiled for the current system.
|
---|
762 | <LI>
|
---|
763 | The TestFloat program may not work properly, reporting differences that do not
|
---|
764 | exist.
|
---|
765 | <LI>
|
---|
766 | Lastly, the floating-point being tested could actually be faulty.
|
---|
767 | </UL>
|
---|
768 | It is the responsibility of the user to determine the causes for the
|
---|
769 | discrepancies that are reported.
|
---|
770 | Making this determination can require detailed knowledge about the IEEE
|
---|
771 | Standard.
|
---|
772 | Assuming TestFloat is working properly, any differences found will be due to
|
---|
773 | either the first or last of the reasons above.
|
---|
774 | Variations in the IEEE Standard that could lead to false error reports are
|
---|
775 | discussed in <NOBR>section 8</NOBR>, <I>Variations Allowed by the IEEE
|
---|
776 | Floating-Point Standard</I>.
|
---|
777 | </P>
|
---|
778 |
|
---|
779 | <P>
|
---|
780 | For each reported error (or apparent error), a line of text is written to the
|
---|
781 | default output.
|
---|
782 | If a line would be longer than 79 characters, it is divided.
|
---|
783 | The first part of each error line begins in the leftmost column, and any
|
---|
784 | subsequent “continuation” lines are indented with a tab.
|
---|
785 | </P>
|
---|
786 |
|
---|
787 | <P>
|
---|
788 | Each error reported is of the form:
|
---|
789 | <BLOCKQUOTE>
|
---|
790 | <PRE>
|
---|
791 | <<I>inputs</I>> => <<I>observed-output</I>> expected: <<I>expected-output</I>>
|
---|
792 | </PRE>
|
---|
793 | </BLOCKQUOTE>
|
---|
794 | The <CODE><<I>inputs</I>></CODE> are the inputs to the operation.
|
---|
795 | Each output (observed or expected) is shown as a pair: the result value first,
|
---|
796 | followed by the exception flags.
|
---|
797 | </P>
|
---|
798 |
|
---|
799 | <P>
|
---|
800 | For example, two typical error lines could be
|
---|
801 | <BLOCKQUOTE>
|
---|
802 | <PRE>
|
---|
803 | -00.7FFF00 -7F.000100 => +01.000000 ...ux expected: +01.000000 ....x
|
---|
804 | +81.000004 +00.1FFFFF => +01.000000 ...ux expected: +01.000000 ....x
|
---|
805 | </PRE>
|
---|
806 | </BLOCKQUOTE>
|
---|
807 | In the first line, the inputs are <CODE>-00.7FFF00</CODE> and
|
---|
808 | <CODE>-7F.000100</CODE>, and the observed result is <CODE>+01.000000</CODE>
|
---|
809 | with flags <CODE>...ux</CODE>.
|
---|
810 | The trusted emulation result is the same but with different flags,
|
---|
811 | <CODE>....x</CODE>.
|
---|
812 | Items such as <CODE>-00.7FFF00</CODE> composed of a sign character
|
---|
813 | <NOBR>(<CODE>+</CODE>/<CODE>-</CODE>)</NOBR>, hexadecimal digits, and a single
|
---|
814 | period represent floating-point values (here <NOBR>32-bit</NOBR>
|
---|
815 | single-precision).
|
---|
816 | The two instances above were reported as errors because the exception flag
|
---|
817 | results differ.
|
---|
818 | </P>
|
---|
819 |
|
---|
820 | <P>
|
---|
821 | Aside from the exception flags, there are ten data types that may be
|
---|
822 | represented.
|
---|
823 | Five are floating-point types: <NOBR>16-bit</NOBR> half-precision,
|
---|
824 | <NOBR>32-bit</NOBR> single-precision, <NOBR>64-bit</NOBR> double-precision,
|
---|
825 | <NOBR>80-bit</NOBR> double-extended-precision, and <NOBR>128-bit</NOBR>
|
---|
826 | quadruple-precision.
|
---|
827 | The remaining five types are <NOBR>32-bit</NOBR> and <NOBR>64-bit</NOBR>
|
---|
828 | unsigned integers, <NOBR>32-bit</NOBR> and <NOBR>64-bit</NOBR>
|
---|
829 | two’s-complement signed integers, and Boolean values (the results of
|
---|
830 | comparison operations).
|
---|
831 | Boolean values are represented as a single character, either a <CODE>0</CODE>
|
---|
832 | (false) or a <CODE>1</CODE> (true).
|
---|
833 | A <NOBR>32-bit</NOBR> integer is represented as 8 hexadecimal digits.
|
---|
834 | Thus, for a signed <NOBR>32-bit</NOBR> integer, <CODE>FFFFFFFF</CODE> is
|
---|
835 | −1, and <CODE>7FFFFFFF</CODE> is the largest positive value.
|
---|
836 | <NOBR>64-bit</NOBR> integers are the same except with 16 hexadecimal digits.
|
---|
837 | </P>
|
---|
838 |
|
---|
839 | <P>
|
---|
840 | Floating-point values are written decomposed into their sign, encoded exponent,
|
---|
841 | and encoded significand.
|
---|
842 | First is the sign character <NOBR>(<CODE>+</CODE> or <CODE>-</CODE>),</NOBR>
|
---|
843 | followed by the encoded exponent in hexadecimal, then a period
|
---|
844 | (<CODE>.</CODE>), and lastly the encoded significand in hexadecimal.
|
---|
845 | </P>
|
---|
846 |
|
---|
847 | <P>
|
---|
848 | For <NOBR>16-bit</NOBR> half-precision, notable values include:
|
---|
849 | <BLOCKQUOTE>
|
---|
850 | <TABLE CELLSPACING=0 CELLPADDING=0>
|
---|
851 | <TR><TD><CODE>+00.000 </CODE></TD><TD>+0</TD></TR>
|
---|
852 | <TR><TD><CODE>+0F.000</CODE></TD><TD> 1</TD></TR>
|
---|
853 | <TR><TD><CODE>+10.000</CODE></TD><TD> 2</TD></TR>
|
---|
854 | <TR><TD><CODE>+1E.3FF</CODE></TD><TD>maximum finite value</TD></TR>
|
---|
855 | <TR><TD><CODE>+1F.000</CODE></TD><TD>+infinity</TD></TR>
|
---|
856 | <TR><TD> </TD></TR>
|
---|
857 | <TR><TD><CODE>-00.000</CODE></TD><TD>−0</TD></TR>
|
---|
858 | <TR><TD><CODE>-0F.000</CODE></TD><TD>−1</TD></TR>
|
---|
859 | <TR><TD><CODE>-10.000</CODE></TD><TD>−2</TD></TR>
|
---|
860 | <TR>
|
---|
861 | <TD><CODE>-1E.3FF</CODE></TD>
|
---|
862 | <TD>minimum finite value (largest magnitude, but negative)</TD>
|
---|
863 | </TR>
|
---|
864 | <TR><TD><CODE>-1F.000</CODE></TD><TD>−infinity</TD></TR>
|
---|
865 | </TABLE>
|
---|
866 | </BLOCKQUOTE>
|
---|
867 | Certain categories are easily distinguished (assuming the <CODE>x</CODE>s are
|
---|
868 | not all 0):
|
---|
869 | <BLOCKQUOTE>
|
---|
870 | <TABLE CELLSPACING=0 CELLPADDING=0>
|
---|
871 | <TR>
|
---|
872 | <TD><CODE>+00.xxx </CODE></TD>
|
---|
873 | <TD>positive subnormal numbers</TD>
|
---|
874 | </TR>
|
---|
875 | <TR><TD><CODE>+1F.xxx</CODE></TD><TD>positive NaNs</TD></TR>
|
---|
876 | <TR><TD><CODE>-00.xxx</CODE></TD><TD>negative subnormal numbers</TD></TR>
|
---|
877 | <TR><TD><CODE>-1F.xxx</CODE></TD><TD>negative NaNs</TD></TR>
|
---|
878 | </TABLE>
|
---|
879 | </BLOCKQUOTE>
|
---|
880 | </P>
|
---|
881 |
|
---|
882 | <P>
|
---|
883 | Likewise for other formats:
|
---|
884 | <BLOCKQUOTE>
|
---|
885 | <TABLE CELLSPACING=0 CELLPADDING=0>
|
---|
886 | <TR><TD>32-bit single</TD><TD>64-bit double</TD><TD>128-bit quadruple</TD></TR>
|
---|
887 | <TR><TD> </TD></TR>
|
---|
888 | <TR>
|
---|
889 | <TD><CODE>+00.000000 </CODE></TD>
|
---|
890 | <TD><CODE>+000.0000000000000 </CODE></TD>
|
---|
891 | <TD><CODE>+0000.0000000000000000000000000000 </CODE></TD>
|
---|
892 | <TD>+0</TD>
|
---|
893 | </TR>
|
---|
894 | <TR>
|
---|
895 | <TD><CODE>+7F.000000</CODE></TD>
|
---|
896 | <TD><CODE>+3FF.0000000000000</CODE></TD>
|
---|
897 | <TD><CODE>+3FFF.0000000000000000000000000000</CODE></TD>
|
---|
898 | <TD> 1</TD>
|
---|
899 | </TR>
|
---|
900 | <TR>
|
---|
901 | <TD><CODE>+80.000000</CODE></TD>
|
---|
902 | <TD><CODE>+400.0000000000000</CODE></TD>
|
---|
903 | <TD><CODE>+4000.0000000000000000000000000000</CODE></TD>
|
---|
904 | <TD> 2</TD>
|
---|
905 | </TR>
|
---|
906 | <TR>
|
---|
907 | <TD><CODE>+FE.7FFFFF</CODE></TD>
|
---|
908 | <TD><CODE>+7FE.FFFFFFFFFFFFF</CODE></TD>
|
---|
909 | <TD><CODE>+7FFE.FFFFFFFFFFFFFFFFFFFFFFFFFFFF</CODE></TD>
|
---|
910 | <TD>maximum finite value</TD>
|
---|
911 | </TR>
|
---|
912 | <TR>
|
---|
913 | <TD><CODE>+FF.000000</CODE></TD>
|
---|
914 | <TD><CODE>+7FF.0000000000000</CODE></TD>
|
---|
915 | <TD><CODE>+7FFF.0000000000000000000000000000</CODE></TD>
|
---|
916 | <TD>+infinity</TD>
|
---|
917 | </TR>
|
---|
918 | <TR><TD> </TD></TR>
|
---|
919 | <TR>
|
---|
920 | <TD><CODE>-00.000000 </CODE></TD>
|
---|
921 | <TD><CODE>-000.0000000000000 </CODE></TD>
|
---|
922 | <TD><CODE>-0000.0000000000000000000000000000 </CODE></TD>
|
---|
923 | <TD>−0</TD>
|
---|
924 | </TR>
|
---|
925 | <TR>
|
---|
926 | <TD><CODE>-7F.000000</CODE></TD>
|
---|
927 | <TD><CODE>-3FF.0000000000000</CODE></TD>
|
---|
928 | <TD><CODE>-3FFF.0000000000000000000000000000</CODE></TD>
|
---|
929 | <TD>−1</TD>
|
---|
930 | </TR>
|
---|
931 | <TR>
|
---|
932 | <TD><CODE>-80.000000</CODE></TD>
|
---|
933 | <TD><CODE>-400.0000000000000</CODE></TD>
|
---|
934 | <TD><CODE>-4000.0000000000000000000000000000</CODE></TD>
|
---|
935 | <TD>−2</TD>
|
---|
936 | </TR>
|
---|
937 | <TR>
|
---|
938 | <TD><CODE>-FE.7FFFFF</CODE></TD>
|
---|
939 | <TD><CODE>-7FE.FFFFFFFFFFFFF</CODE></TD>
|
---|
940 | <TD><CODE>-7FFE.FFFFFFFFFFFFFFFFFFFFFFFFFFFF</CODE></TD>
|
---|
941 | <TD>minimum finite value</TD>
|
---|
942 | </TR>
|
---|
943 | <TR>
|
---|
944 | <TD><CODE>-FF.000000</CODE></TD>
|
---|
945 | <TD><CODE>-7FF.0000000000000</CODE></TD>
|
---|
946 | <TD><CODE>-7FFF.0000000000000000000000000000</CODE></TD>
|
---|
947 | <TD>−infinity</TD>
|
---|
948 | </TR>
|
---|
949 | <TR><TD> </TD></TR>
|
---|
950 | <TR>
|
---|
951 | <TD><CODE>+00.xxxxxx</CODE></TD>
|
---|
952 | <TD><CODE>+000.xxxxxxxxxxxxx</CODE></TD>
|
---|
953 | <TD><CODE>+0000.xxxxxxxxxxxxxxxxxxxxxxxxxxxx</CODE></TD>
|
---|
954 | <TD>positive subnormals</TD>
|
---|
955 | </TR>
|
---|
956 | <TR>
|
---|
957 | <TD><CODE>+FF.xxxxxx</CODE></TD>
|
---|
958 | <TD><CODE>+7FF.xxxxxxxxxxxxx</CODE></TD>
|
---|
959 | <TD><CODE>+7FFF.xxxxxxxxxxxxxxxxxxxxxxxxxxxx</CODE></TD>
|
---|
960 | <TD>positive NaNs</TD>
|
---|
961 | </TR>
|
---|
962 | <TR>
|
---|
963 | <TD><CODE>-00.xxxxxx</CODE></TD>
|
---|
964 | <TD><CODE>-000.xxxxxxxxxxxxx</CODE></TD>
|
---|
965 | <TD><CODE>-0000.xxxxxxxxxxxxxxxxxxxxxxxxxxxx</CODE></TD>
|
---|
966 | <TD>negative subnormals</TD>
|
---|
967 | </TR>
|
---|
968 | <TR>
|
---|
969 | <TD><CODE>-FF.xxxxxx</CODE></TD>
|
---|
970 | <TD><CODE>-7FF.xxxxxxxxxxxxx</CODE></TD>
|
---|
971 | <TD><CODE>-7FFF.xxxxxxxxxxxxxxxxxxxxxxxxxxxx</CODE></TD>
|
---|
972 | <TD>negative NaNs</TD>
|
---|
973 | </TR>
|
---|
974 | </TABLE>
|
---|
975 | </BLOCKQUOTE>
|
---|
976 | </P>
|
---|
977 |
|
---|
978 | <P>
|
---|
979 | The <NOBR>80-bit</NOBR> double-extended-precision values are a little unusual
|
---|
980 | in that the leading bit of precision is not hidden as with other formats.
|
---|
981 | When canonically encoded, the leading significand bit of an <NOBR>80-bit</NOBR>
|
---|
982 | double-extended-precision value will be 0 if the value is zero or subnormal,
|
---|
983 | and will be 1 otherwise.
|
---|
984 | Hence, the same values listed above appear in <NOBR>80-bit</NOBR>
|
---|
985 | double-extended-precision as follows (note the leading <CODE>8</CODE> digit in
|
---|
986 | the significands):
|
---|
987 | <BLOCKQUOTE>
|
---|
988 | <TABLE CELLSPACING=0 CELLPADDING=0>
|
---|
989 | <TR>
|
---|
990 | <TD><CODE>+0000.0000000000000000 </CODE></TD>
|
---|
991 | <TD>+0</TD>
|
---|
992 | </TR>
|
---|
993 | <TR><TD><CODE>+3FFF.8000000000000000</CODE></TD><TD> 1</TD></TR>
|
---|
994 | <TR><TD><CODE>+4000.8000000000000000</CODE></TD><TD> 2</TD></TR>
|
---|
995 | <TR>
|
---|
996 | <TD><CODE>+7FFE.FFFFFFFFFFFFFFFF</CODE></TD>
|
---|
997 | <TD>maximum finite value</TD>
|
---|
998 | </TR>
|
---|
999 | <TR><TD><CODE>+7FFF.8000000000000000</CODE></TD><TD>+infinity</TD></TR>
|
---|
1000 | <TR><TD> </TD></TR>
|
---|
1001 | <TR><TD><CODE>-0000.0000000000000000</CODE></TD><TD>−0</TD></TR>
|
---|
1002 | <TR><TD><CODE>-3FFF.8000000000000000</CODE></TD><TD>−1</TD></TR>
|
---|
1003 | <TR><TD><CODE>-4000.8000000000000000</CODE></TD><TD>−2</TD></TR>
|
---|
1004 | <TR>
|
---|
1005 | <TD><CODE>-7FFE.FFFFFFFFFFFFFFFF</CODE></TD>
|
---|
1006 | <TD>minimum finite value</TD>
|
---|
1007 | </TR>
|
---|
1008 | <TR><TD><CODE>-7FFF.8000000000000000</CODE></TD><TD>−infinity</TD></TR>
|
---|
1009 | </TABLE>
|
---|
1010 | </BLOCKQUOTE>
|
---|
1011 | </P>
|
---|
1012 |
|
---|
1013 | <P>
|
---|
1014 | Lastly, exception flag values are represented by five characters, one character
|
---|
1015 | per flag.
|
---|
1016 | Each flag is written as either a letter or a period (<CODE>.</CODE>) according
|
---|
1017 | to whether the flag was set or not by the operation.
|
---|
1018 | A period indicates the flag was not set.
|
---|
1019 | The letter used to indicate a set flag depends on the flag:
|
---|
1020 | <BLOCKQUOTE>
|
---|
1021 | <TABLE CELLSPACING=0 CELLPADDING=0>
|
---|
1022 | <TR>
|
---|
1023 | <TD><CODE>v </CODE></TD>
|
---|
1024 | <TD><I>invalid</I> exception</TD>
|
---|
1025 | </TR>
|
---|
1026 | <TR>
|
---|
1027 | <TD><CODE>i</CODE></TD>
|
---|
1028 | <TD><I>infinite</I> exception (“divide by zero”)</TD>
|
---|
1029 | </TR>
|
---|
1030 | <TR><TD><CODE>o</CODE></TD><TD><I>overflow</I> exception</TD></TR>
|
---|
1031 | <TR><TD><CODE>u</CODE></TD><TD><I>underflow</I> exception</TD></TR>
|
---|
1032 | <TR><TD><CODE>x</CODE></TD><TD><I>inexact</I> exception</TD></TR>
|
---|
1033 | </TABLE>
|
---|
1034 | </BLOCKQUOTE>
|
---|
1035 | For example, the notation <CODE>...ux</CODE> indicates that the
|
---|
1036 | <I>underflow</I> and <I>inexact</I> exception flags were set and that the other
|
---|
1037 | three flags (<I>invalid</I>, <I>infinite</I>, and <I>overflow</I>) were not
|
---|
1038 | set.
|
---|
1039 | The exception flags are always written following the value returned as the
|
---|
1040 | result of the operation.
|
---|
1041 | </P>
|
---|
1042 |
|
---|
1043 |
|
---|
1044 | <H2>8. Variations Allowed by the IEEE Floating-Point Standard</H2>
|
---|
1045 |
|
---|
1046 | <P>
|
---|
1047 | The IEEE Floating-Point Standard admits some variation among conforming
|
---|
1048 | implementations.
|
---|
1049 | Because TestFloat expects the two implementations being compared to deliver
|
---|
1050 | bit-for-bit identical results under most circumstances, this leeway in the
|
---|
1051 | standard can result in false errors being reported if the two implementations
|
---|
1052 | do not make the same choices everywhere the standard provides an option.
|
---|
1053 | </P>
|
---|
1054 |
|
---|
1055 | <H3>8.1. Underflow</H3>
|
---|
1056 |
|
---|
1057 | <P>
|
---|
1058 | The standard specifies that the <I>underflow</I> exception flag is to be raised
|
---|
1059 | when two conditions are met simultaneously:
|
---|
1060 | <NOBR>(1) <I>tininess</I></NOBR> and <NOBR>(2) <I>loss of accuracy</I></NOBR>.
|
---|
1061 | </P>
|
---|
1062 |
|
---|
1063 | <P>
|
---|
1064 | A result is tiny when its magnitude is nonzero yet smaller than any normalized
|
---|
1065 | floating-point number.
|
---|
1066 | The standard allows tininess to be determined either before or after a result
|
---|
1067 | is rounded to the destination precision.
|
---|
1068 | If tininess is detected before rounding, some borderline cases will be flagged
|
---|
1069 | as underflows even though the result after rounding actually lies within the
|
---|
1070 | normal floating-point range.
|
---|
1071 | By detecting tininess after rounding, a system can avoid some unnecessary
|
---|
1072 | signaling of underflow.
|
---|
1073 | All the TestFloat programs support options <CODE>-tininessbefore</CODE> and
|
---|
1074 | <CODE>-tininessafter</CODE> to control whether TestFloat expects tininess on
|
---|
1075 | underflow to be detected before or after rounding.
|
---|
1076 | One or the other is selected as the default when TestFloat is compiled, but
|
---|
1077 | these command options allow the default to be overridden.
|
---|
1078 | </P>
|
---|
1079 |
|
---|
1080 | <P>
|
---|
1081 | Loss of accuracy occurs when the subnormal format is not sufficient to
|
---|
1082 | represent an underflowed result accurately.
|
---|
1083 | The original 1985 version of the IEEE Standard allowed loss of accuracy to be
|
---|
1084 | detected either as an <I>inexact result</I> or as a
|
---|
1085 | <I>denormalization loss</I>;
|
---|
1086 | however, few if any systems ever chose the latter.
|
---|
1087 | The latest standard requires that loss of accuracy be detected as an inexact
|
---|
1088 | result, and TestFloat can test only for this case.
|
---|
1089 | </P>
|
---|
1090 |
|
---|
1091 | <H3>8.2. NaNs</H3>
|
---|
1092 |
|
---|
1093 | <P>
|
---|
1094 | The IEEE Standard gives the floating-point formats a large number of NaN
|
---|
1095 | encodings and specifies that NaNs are to be returned as results under certain
|
---|
1096 | conditions.
|
---|
1097 | However, the standard allows an implementation almost complete freedom over
|
---|
1098 | <EM>which</EM> NaN to return in each situation.
|
---|
1099 | </P>
|
---|
1100 |
|
---|
1101 | <P>
|
---|
1102 | By default, TestFloat does not check the bit patterns of NaN results.
|
---|
1103 | When the result of an operation should be a NaN, any NaN is considered as good
|
---|
1104 | as another.
|
---|
1105 | This laxness can be overridden with the <CODE>-checkNaNs</CODE> option of
|
---|
1106 | programs <CODE>testfloat_ver</CODE> and <CODE>testfloat</CODE>.
|
---|
1107 | In order for this option to be sensible, TestFloat must have been compiled so
|
---|
1108 | that its internal floating-point implementation (SoftFloat) generates the
|
---|
1109 | proper NaN results for the system being tested.
|
---|
1110 | </P>
|
---|
1111 |
|
---|
1112 | <H3>8.3. Conversions to Integer</H3>
|
---|
1113 |
|
---|
1114 | <P>
|
---|
1115 | Conversion of a floating-point value to an integer format will fail if the
|
---|
1116 | source value is a NaN or if it is too large.
|
---|
1117 | The IEEE Standard does not specify what value should be returned as the integer
|
---|
1118 | result in these cases.
|
---|
1119 | Moreover, according to the standard, the <I>invalid</I> exception can be raised
|
---|
1120 | or an unspecified alternative mechanism may be used to signal such cases.
|
---|
1121 | </P>
|
---|
1122 |
|
---|
1123 | <P>
|
---|
1124 | TestFloat assumes that conversions to integer will raise the <I>invalid</I>
|
---|
1125 | exception if the source value cannot be rounded to a representable integer.
|
---|
1126 | In such cases, TestFloat expects the result value to be the largest-magnitude
|
---|
1127 | positive or negative integer or zero, as detailed earlier in
|
---|
1128 | <NOBR>section 6.1</NOBR>, <I>Conversion Operations</I>.
|
---|
1129 | If option <CODE>-checkInvInts</CODE> is selected with programs
|
---|
1130 | <CODE>testfloat_ver</CODE> and <CODE>testfloat</CODE>, integer results of
|
---|
1131 | invalid operations are checked for an exact match.
|
---|
1132 | In order for this option to be sensible, TestFloat must have been compiled so
|
---|
1133 | that its internal floating-point implementation (SoftFloat) generates the
|
---|
1134 | proper integer results for the system being tested.
|
---|
1135 | </P>
|
---|
1136 |
|
---|
1137 |
|
---|
1138 | <H2>9. Contact Information</H2>
|
---|
1139 |
|
---|
1140 | <P>
|
---|
1141 | At the time of this writing, the most up-to-date information about TestFloat
|
---|
1142 | and the latest release can be found at the Web page
|
---|
1143 | <A HREF="http://www.jhauser.us/arithmetic/TestFloat.html"><NOBR><CODE>http://www.jhauser.us/arithmetic/TestFloat.html</CODE></NOBR></A>.
|
---|
1144 | </P>
|
---|
1145 |
|
---|
1146 |
|
---|
1147 | </BODY>
|
---|
1148 |
|
---|