API
Scalar Functions
- gfloat.decode_float(fi, i)[source]
Given
FormatInfoand integer code point, decode to aFloatValue- Parameters:
fi¶ (FormatInfo) – Floating point format descriptor.
i¶ (int) – Integer code point, in the range \(0 \le i < 2^{k}\), where \(k\) =
fi.k
- Returns:
Decoded float value
- Raises:
ValueError – If
iis outside the range of valid code points infi.
- gfloat.round_float(fi, v, rnd=RoundMode.TiesToEven, sat=False)[source]
Round input to the given
FormatInfo, given rounding mode and saturation flagAn input NaN will convert to a NaN in the target. An input Infinity will convert to the largest float if
sat, otherwise to an Inf, if present, otherwise to a NaN. Negative zero will be returned if the format has negative zero, otherwise zero.- Parameters:
- Returns:
A float which is one of the values in the format.
- Raises:
ValueError – The target format cannot represent the input (e.g. converting a NaN, or an Inf when the target has no NaN or Inf, and
satis false)
- gfloat.encode_float(fi, v)[source]
Encode input to the given
FormatInfo.Will round toward zero if
vis not in the value set. Will saturate to Inf, NaN, fi.max in order of precedence. Encode -0 to 0 if not fi.has_nzFor other roundings and saturations, call
round_float()first.- Parameters:
fi¶ (FormatInfo) – Describes the target format
v¶ (float) – The value to be encoded.
- Returns:
The integer code point
Block format functions
- gfloat.decode_block(fi, block)[source]
Decode a
blockof integer codepoints in Block FormatfiThe scale is encoded in the first value of
block, with the remaining values encoding the block elements.The size of the iterable is not checked against the format descriptor.
- Parameters:
fi¶ (BlockFormatInfo) – Describes the block format
block¶ (Iterable[int]) – Input block
- Returns:
A sequence of floats representing the encoded values.
- gfloat.encode_block(fi, scale, vals, round=RoundMode.TiesToEven)[source]
Encode float
valsinto block Format described byfiThe
scaleis explicitly passed, and thevalsare assumed to already be multiplied by 1/scale. That is, this is pure encoding, scaling is computed and applied elsewhere (see e.g.quantize_block()).It is checked for overflow in the target format, and will raise an exception if it does.
- Parameters:
- Returns:
A sequence of ints representing the encoded values.
- Raises:
ValueError – The scale overflows the target scale encoding format.
- gfloat.quantize_block(fi, vals, compute_scale, round=RoundMode.TiesToEven)[source]
Encode and decode a block of
valsof bytes into block format described byfi- Parameters:
fi¶ (BlockFormatInfo) – Describes the target block format
vals¶ (numpy.array) – Input block
compute_scale¶ ((float, ArrayLike) -> float) – Callable to compute the scale, defaults to
compute_scale_amax()round¶ (RoundMode) – Rounding mode to use, defaults to TiesToEven
- Returns:
An array of floats representing the quantized values.
- Raises:
ValueError – The scale overflows the target scale encoding format.
- gfloat.compute_scale_amax(emax, vals)[source]
Compute a scale factor such that
valscan be scaled to the range [0, 2**emax]. That is, scale is computed such that the largest exponent in the array vals * scale will be emax.The scale is clipped to the range 2**[-127, 127].
If all values are zero, any scale value smaller than emax would be accurate, but returning the smallest possible means that quick checks on the magnitude to identify near-zero blocks will also find the all-zero blocks.
- Parameters:
- Returns:
A float such that vals * scale has exponents less than or equal to emax.
Note
If all vals are zero, 1.0 is returned.
Classes
- class gfloat.FormatInfo[source]
Class describing a floating-point format, parametrized by width, precision, and special value encoding rules.
- name
Short name for the format, e.g. binary32, bfloat16
- k
Number of bits in the format
- precision
Number of significand bits (including implicit leading bit)
- emax
Largest exponent, emax, which shall equal floor(log_2(maxFinite))
- has_nz
Set if format encodes -0 at (sgn=1,exp=0,significand=0). If False, that encoding decodes to a NaN labelled NaN_0
- has_infs
Set if format includes +/- Infinity. If set, the non-nan value with the highest encoding for each sign (s) is replaced by (s)Inf.
- num_high_nans
Number of NaNs that are encoded in the highest encodings for each sign
- has_subnormals
Set if format encodes subnormals
- is_signed
Set if the format has a sign bit
- is_twos_complement
Set if the format uses two’s complement encoding for the significand
- property tSignificandBits
The number of trailing significand bits, t
- property expBits
The number of exponent bits, w
- property signBits
The number of sign bits, s
- property expBias
The exponent bias derived from (p,emax)
- This is the bias that should be applied so that
\(floor(log_2(maxFinite)) = emax\)
- property bits
The number of bits occupied by the type.
- property eps
The difference between 1.0 and the smallest representable float larger than 1.0. For example, for 64-bit binary floats in the IEEE-754 standard,
eps = 2**-52, approximately 2.22e-16.
- property epsneg
The difference between 1.0 and the largest representable float less than 1.0. For example, for 64-bit binary floats in the IEEE-754 standard,
epsneg = 2**-53, approximately 1.11e-16.
- property iexp
The number of bits in the exponent portion of the floating point representation.
- property machep
The exponent that yields eps.
- property max
The largest representable number.
- property maxexp
The smallest positive power of the base (2) that causes overflow.
- property min
The smallest representable number, typically
-max.
- property num_nans
The number of code points which decode to NaN
- property code_of_nan
Return a codepoint for a NaN
- property code_of_posinf
Return a codepoint for positive infinity
- property code_of_neginf
Return a codepoint for negative infinity
- property code_of_zero
Return a codepoint for (non-negative) zero
- property has_zero
Does the format have zero?
This is false if the mantissa is 0 width and we don’t have subnormals - essentially the mantissa is always decoded as 1. If we have subnormals, the only subnormal is zero, and the mantissa is always decoded as 0.
- property code_of_negzero
Return a codepoint for negative zero
- property code_of_max
Return a codepoint for fi.max
- property code_of_min
Return a codepoint for fi.min
- property smallest_normal
The smallest positive floating point number with 1 as leading bit in the significand following IEEE-754.
- property smallest_subnormal
The smallest positive floating point number with 0 as leading bit in the significand following IEEE-754.
- property smallest
The smallest positive floating point number.
- property is_all_subnormal
Are all encoded values subnormal?
- class gfloat.FloatClass[source]
Enum for the classification of a FloatValue.
- NORMAL = 1
A positive or negative normalized non-zero value
- SUBNORMAL = 2
A positive or negative subnormal value
- ZERO = 3
A positive or negative zero value
- INFINITE = 4
A positive or negative infinity (+/-Inf)
- NAN = 5
Not a Number (NaN)
- class gfloat.RoundMode[source]
Enum for IEEE-754 rounding modes.
Result r is obtained from input v depending on rounding mode as follows
- TowardZero = 1
\(\max \{ r ~ s.t. ~ |r| \le |v| \}\)
- TowardNegative = 2
\(\max \{ r ~ s.t. ~ r \le v \}\)
- TowardPositive = 3
\(\min \{ r ~ s.t. ~ r \ge v \}\)
- TiesToEven = 4
Round to nearest, ties to even
- TiesToAway = 5
Round to nearest, ties away from zero
- class gfloat.FloatValue[source]
A floating-point value decoded in great detail.
- code
Integer code point
- fval
Value. Assumed to be exactly round-trippable to python float. This is true for all <64bit formats known in 2023.
- exp
Raw exponent without bias
- expval
Exponent, bias subtracted
- significand
Significand as an integer
- fsignificand
Significand as a float in the range [0,2)
- signbit
1 => negative, 0 => positive
- Type:
Sign bit
- fclass
See FloatClass
- class gfloat.BlockFormatInfo[source]
BlockFormatInfo(name: str, etype: gfloat.types.FormatInfo, k: int, stype: gfloat.types.FormatInfo)
- name
Short name for the format, e.g. BlockFP8
- etype
Element data type
- k
Scaling block size
- stype
Scale datatype
- property element_bits
The number of bits in each element, d
- property scale_bits
The number of bits in the scale, w
- property block_size_bytes
The number of bytes in a block
Pretty printers
- gfloat.float_pow2str(v, min_exponent=-inf)[source]
Render floating point values as exact fractions times a power of two.
Example: float_pow2str(127.0) is “127/64*2^6”,
That is (a significand between 1 and 2) times (a power of two).
If min_exponent is supplied, then values with exponent below min_exponent, are printed as fractions less than 1, with exponent set to min_exponent. This is typically used to represent subnormal values.