Python3 implementation #3

Merged
sezieru merged 3 commits from feat/kb20-py3 into main 2025-10-31 12:35:26 +00:00
Owner

Initial implementation of base-20 in Python3.

  • Supports UTF-8.
  • Supports piping.
  • Supports custom code point mapping.
  • Heavily documented.
Initial implementation of base-20 in Python3. - [x] Supports UTF-8. - [x] Supports piping. - [x] Supports custom code point mapping. - [x] Heavily documented.
sezieru self-assigned this 2025-10-30 10:39:16 +00:00
- Text mode now emits and accepts UTF-8 (support Kaktovik and other Unicode digit alphabets)
- Enforced alphabet hygiene: --alphabet must be exactly 20 DISTINCT Unicode characters
- Made digit parsing strict: invalid/non-digit characters now raise errors (parity with C/C++)
- Added safety checks:
  - bigint_to_base20_digits(): guard negative values
  - write_bigint_to_bytes_be(): reject out-of-range values (no silent truncation)
- Cleaned CLI help text and removed unused imports

Documentation & teachability:
- Added module header describing wire formats, block rules, and modes
- Added docstrings for:
  - bytes_to_bigint_be(), bigint_to_bytes_be()
  - bigint_to_base20_digits(), base20_digits_to_bigint()
  - digits_to_ascii_offset()/..._strict() and ..._map()/..._map_strict()
  - _require(), kb20_encode_ascii(), kb20_decode_ascii()
- Standardized terminology to 'MSB-first (big-endian)' for order
- Noted Horner folding (v = v * B + d)
Author
Owner

Finalize Python KB20 reference: UTF-8 text mode, strict parsing, and full docs

Summary

This PR brings the Python reference implementation to parity with the C90/C++20 versions and makes it a
solid teaching resource. Text mode now emits/accepts UTF-8 for Unicode digit alphabets (e.g. Kaktovik),
parsing is strict and well-documented, and helpers are fully annotated.

What/Why

  • Supports Unicode digits in text mode (UTF-8 on the wire) to match Kaktovik use cases.
  • Enforce alphabet hygiene (exactly 20 distinct symbols) to avoid ambiguous streams.
  • Make digit parsing strict (errors on unknown chars) for spec clarity and cross-impl parity.
  • Add docstrings and consistent terminology (MSB-first / big-endian, Horner folding) to aid learners.

Changes

  • Text mode: switch to UTF-8 encode/decode.
  • --alphabet: require 20 distinct Unicode scalars.
  • Strict mappers: ascii_to_digits_*_strict raise on invalid inputs.
  • Safety checks:
    • bigint_to_base20_digits() guards negative input.
    • write_bigint_to_bytes_be() rejects out-of-range values (no truncation).
  • Docstrings across all helpers; CLI help polished; unused imports removed.
  • Naming: _be suffix on helpers whose contract is big-endian.

Compatibility

  • No format changes vs C90/C++20: same header/payload, same separators ('-' for text, 0xFF for binary-digits), same block rules (5 -> 10 digits padded; tail minimal).

Tests / Verification

# ASCII alphabet
echo -n 'this is some test text' \
    | python3 kb20.py encode --zero A \
    | python3 kb20.py decode --zero A \
    | diff -u <(echo -n 'this is some test text') -

# Kaktovik digits (UTF-8)
Z=$'\U0001D2C0'
echo -n 'abcXYZ' \
    | python3 kb20.py encode --zero "$Z" \
    | python3 kb20.py decode --zero "$Z" \
    | diff -u <(echo -n 'abcXYZ') -

# Explicit alphabet (20 distinct chars)
ALPHA='0123456789ABCDEFGHIJ'
echo -n 'EdgeCase42!' \
    | python3 kb20.py encode --alphabet "$ALPHA" \
    | python3 kb20.py decode --alphabet "$ALPHA" \
    | diff -u <(echo -n 'EdgeCase42!') -

# Binary-digits mode (raw 0..19)
head -c 257 /dev/urandom >/tmp/r
python3 kb20.py encode --binary-digits -i /tmp/r -o /tmp/e
python3 kb20.py decode --binary-digits -i /tmp/e -o /tmp/d
cmp /tmp/r /tmp/d

Notes for reviewers

  • Terminology standardized to MSB-first to avoid confusion about "left-to-right".
  • Python keeps arbitrary-precision ints; behavior matches the "fixed-width big-endian header -> minimal base-20 digits" model used in C/C++.
  • _be suffix is intentional to make byte order explicit in reusable helpers.
## Finalize Python KB20 reference: UTF-8 text mode, strict parsing, and full docs ### Summary This PR brings the Python reference implementation to parity with the C90/C++20 versions and makes it a solid teaching resource. Text mode now emits/accepts UTF-8 for Unicode digit alphabets (e.g. Kaktovik), parsing is strict and well-documented, and helpers are fully annotated. ### What/Why - Supports **Unicode digits** in text mode (UTF-8 on the wire) to match Kaktovik use cases. - Enforce **alphabet hygiene** (exactly 20 distinct symbols) to avoid ambiguous streams. - Make digit parsing **strict** (errors on unknown chars) for spec clarity and cross-impl parity. - Add docstrings and consistent terminology (**MSB-first / big-endian**, Horner folding) to aid learners. ### Changes - Text mode: switch to UTF-8 encode/decode. - `--alphabet`: require 20 **distinct** Unicode scalars. - Strict mappers: `ascii_to_digits_*_strict` raise on invalid inputs. - Safety checks: - `bigint_to_base20_digits()` guards negative input. - `write_bigint_to_bytes_be()` rejects out-of-range values (no truncation). - Docstrings across all helpers; CLI help polished; unused imports removed. - Naming: `_be` suffix on helpers whose contract is big-endian. ### Compatibility - No format changes vs C90/C++20: same header/payload, same separators (`'-'` for text, `0xFF` for binary-digits), same block rules (5 -> 10 digits padded; tail minimal). ### Tests / Verification ```shell # ASCII alphabet echo -n 'this is some test text' \ | python3 kb20.py encode --zero A \ | python3 kb20.py decode --zero A \ | diff -u <(echo -n 'this is some test text') - # Kaktovik digits (UTF-8) Z=$'\U0001D2C0' echo -n 'abcXYZ' \ | python3 kb20.py encode --zero "$Z" \ | python3 kb20.py decode --zero "$Z" \ | diff -u <(echo -n 'abcXYZ') - # Explicit alphabet (20 distinct chars) ALPHA='0123456789ABCDEFGHIJ' echo -n 'EdgeCase42!' \ | python3 kb20.py encode --alphabet "$ALPHA" \ | python3 kb20.py decode --alphabet "$ALPHA" \ | diff -u <(echo -n 'EdgeCase42!') - # Binary-digits mode (raw 0..19) head -c 257 /dev/urandom >/tmp/r python3 kb20.py encode --binary-digits -i /tmp/r -o /tmp/e python3 kb20.py decode --binary-digits -i /tmp/e -o /tmp/d cmp /tmp/r /tmp/d ``` ### Notes for reviewers - Terminology standardized to **MSB-first** to avoid confusion about "left-to-right". - Python keeps arbitrary-precision ints; behavior matches the "fixed-width big-endian header -> minimal base-20 digits" model used in C/C++. - `_be` suffix is intentional to make byte order explicit in reusable helpers.
sezieru changed title from WIP: Python3 implementation to Python3 implementation 2025-10-31 12:35:05 +00:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: fosster/libb20#3
No description provided.