# 9.27 Float¶

FriCAS provides two kinds of floating point numbers. The domain Float (abbreviation FLOAT) implements a model of arbitrary precision floating point numbers. The domain DoubleFloat (abbreviation DFLOAT) is intended to make available hardware floating point arithmetic in FriCAS. The actual model of floating point that DoubleFloat provides is system-dependent. For example, on the IBM system 370 FriCAS uses IBM double precision which has fourteen hexadecimal digits of precision or roughly sixteen decimal digits. Arbitrary precision floats allow the user to specify the precision at which arithmetic operations are computed. Although this is an attractive facility, it comes at a cost. Arbitrary-precision floating-point arithmetic typically takes twenty to two hundred times more time than hardware floating point.

For more information about FriCAS’s numeric and graphic facilities, see ugGraphPage in Section ugGraphNumber , ugProblemNumeric , and DoubleFloatXmpPage .

## 9.27.1 Introduction to Float¶

Scientific notation is supported for input and output of floating point numbers. A floating point number is written as a string of digits containing a decimal point optionally followed by the letter E, and then the exponent.

We begin by doing some calculations using arbitrary precision floats. The default precision is twenty decimal digits.

1.234


 1.234

Type: Float

A decimal base for the exponent is assumed, so the number 1.234E2 denotes .

1.234E2


 123.4

Type: Float

The normal arithmetic operations are available for floating point numbers.

sqrt(1.2 + 2.3 / 3.4 ^ 4.5)


 1.0997

Type: Float

## 9.27.2 Conversion Functions¶

You can use conversion (ugTypesConvertPage in Section ugTypesConvertNumber ) to go back and forth between Integer, Fraction Integer and Float, as appropriate.

i := 3 :: Float


 3

Type: Float

i :: Integer


 3

Type: Integer

i :: Fraction Integer


 3

Type: Fraction Integer

Since you are explicitly asking for a conversion, you must take responsibility for any loss of exactness.

r := 3/7 :: Float


 0.428571

Type: Float

r :: Fraction Integer


 37

Type: Fraction Integer

This conversion cannot be performed: use truncatetruncateFloat or roundroundFloat if that is what you intend.

r :: Integer

Cannot convert from type Float to Integer for value
0.4285714285 7142857143


The operations truncatetruncateFloat and roundroundFloat truncate ...

truncate 3.6


 3

Type: Float

and round to the nearest integral Float respectively.

round 3.6


 4

Type: Float

truncate(-3.6)


 -3

Type: Float

round(-3.6)


 -4

Type: Float

The operation fractionPartfractionPartFloat computes the fractional part of x, that is, x - truncate x.

fractionPart 3.6


 0.6

Type: Float

The operation digitsdigitsFloat allows the user to set the precision. It returns the previous value it was using.

digits 40


 20

Type: PositiveInteger

sqrt 0.2


 0.447214

Type: Float

pi()$Float    3.14159 Type: Float The precision is only limited by the computer memory available. Calculations at 500 or more digits of precision are not difficult. digits 500    40 Type: PositiveInteger pi()$Float


 3.14159

Type: Float

Reset digitsdigitsFloat to its default value.

digits 20


 500

Type: PositiveInteger

Numbers of type Float are represented as a record of two integers, namely, the mantissa and the exponent where the base of the exponent is binary. That is, the floating point number (m,e) represents the number . A consequence of using a binary base is that decimal numbers can not, in general, be represented exactly.

## 9.27.3 Output Functions¶

A number of operations exist for specifying how numbers of type Float are to be displayed. By default, spaces are inserted every ten digits in the output for readability.Note that you cannot include spaces in the input form of a floating point number, though you can use underscores.

Output spacing can be modified with the outputSpacingoutputSpacingFloat operation. This inserts no spaces and then displays the value of x.

outputSpacing 0; x := sqrt 0.2


 0.447214

Type: Float

Issue this to have the spaces inserted every 5 digits.

outputSpacing 5; x


 0.447214

Type: Float

By default, the system displays floats in either fixed format or scientific format, depending on the magnitude of the number.

y := x/10^10


 0.44721359549995793928E -10

Type: Float

A particular format may be requested with the operations outputFloatingoutputFloatingFloat and outputFixedoutputFixedFloat.

outputFloating(); x


 0.44721359549995793928E 0

Type: Float

outputFixed(); y


 4.47214e-11

Type: Float

Additionally, you can ask for n digits to be displayed after the decimal point.

outputFloating 2; y


 0.45E -10

Type: Float

outputFixed 2; x


 0.45

Type: Float

This resets the output printing to the default behavior.

outputGeneral()


Type: Void

## 9.27.4 An Example: Determinant of a Hilbert Matrix¶

Consider the problem of computing the determinant of a 10 by 10 Hilbert matrix. The (i,j)-th entry of a Hilbert matrix is given by 1/(i+j+1).

First do the computation using rational numbers to obtain the exact result.

a: Matrix Fraction Integer := matrix [ [1/(i+j+1) for j in 0..9] for i


in 0..9]


 [11213141516171819110121314151617181911011113141516171819110111112141516171819110111112113151617181911011111211311416171819110111112113114115171819110111112113114115116181911011111211311411511611719110111112113114115116117118110111112113114115116117118119]

Type: Matrix Fraction Integer

This version of determinantdeterminantMatrix uses Gaussian elimination.

d:= determinant a


 146206893947914691316295628839036278726983680000000000

Type: Fraction Integer

d :: Float


 0.21641792264314918691E -52

Type: Float

Now use hardware floats. Note that a semicolon (;) is used to prevent the display of the matrix.

b: Matrix DoubleFloat := matrix [ [1/(i+j+1$DoubleFloat) for j in 0..9]  for i in 0..9]; Type: Matrix DoubleFloat The result given by hardware floats is correct only to four significant digits of precision. In the jargon of numerical analysis, the Hilbert matrix is said to be ill-conditioned. determinant b    2.16437e-53 Type: DoubleFloat Now repeat the computation at a higher precision using Float. digits 40    20 Type: PositiveInteger c: Matrix Float := matrix [ [1/(i+j+1$Float) for j in 0..9] for i in


0..9];

Type: Matrix Float

determinant c


 0.2164179226431491869060594983622617436159E -52

Type: Float

Reset digitsdigitsFloat to its default value.

digits 20


 40

Type: PositiveInteger