Python3 implementation #3

Merged

sezieru merged 3 commits from feat/kb20-py3 into main

2025-10-31 12:35:26 +00:00

sezieru commented

2025-10-30 10:39:16 +00:00

Owner

Initial implementation of base-20 in Python3.

Supports UTF-8.
Supports piping.
Supports custom code point mapping.
Heavily documented.

Initial implementation of base-20 in Python3. - [x] Supports UTF-8. - [x] Supports piping. - [x] Supports custom code point mapping. - [x] Heavily documented.

sezieru self-assigned this

2025-10-30 10:39:16 +00:00

sezieru added 1 commit

2025-10-30 10:39:16 +00:00

feat(python): introduce kb20.py as initial script for Python library implementation 5872af1af8

sezieru added 2 commits

2025-10-31 12:15:59 +00:00

refactor(py3+style): style adjustments for logical operation separation + minor [ironic] Unicode->ASCII changes 49900dc7c6

feat(py): finalize KB20.py reference with UTF-8 text mode, strict parsing, and complete docs 356f6c6e7f

- Text mode now emits and accepts UTF-8 (support Kaktovik and other Unicode digit alphabets)
- Enforced alphabet hygiene: --alphabet must be exactly 20 DISTINCT Unicode characters
- Made digit parsing strict: invalid/non-digit characters now raise errors (parity with C/C++)
- Added safety checks:
  - bigint_to_base20_digits(): guard negative values
  - write_bigint_to_bytes_be(): reject out-of-range values (no silent truncation)
- Cleaned CLI help text and removed unused imports

Documentation & teachability:
- Added module header describing wire formats, block rules, and modes
- Added docstrings for:
  - bytes_to_bigint_be(), bigint_to_bytes_be()
  - bigint_to_base20_digits(), base20_digits_to_bigint()
  - digits_to_ascii_offset()/..._strict() and ..._map()/..._map_strict()
  - _require(), kb20_encode_ascii(), kb20_decode_ascii()
- Standardized terminology to 'MSB-first (big-endian)' for order
- Noted Horner folding (v = v * B + d)

sezieru commented

2025-10-31 12:34:17 +00:00

Author

Owner

Finalize Python KB20 reference: UTF-8 text mode, strict parsing, and full docs

Summary

This PR brings the Python reference implementation to parity with the C90/C++20 versions and makes it a
solid teaching resource. Text mode now emits/accepts UTF-8 for Unicode digit alphabets (e.g. Kaktovik),
parsing is strict and well-documented, and helpers are fully annotated.

What/Why

Supports Unicode digits in text mode (UTF-8 on the wire) to match Kaktovik use cases.
Enforce alphabet hygiene (exactly 20 distinct symbols) to avoid ambiguous streams.
Make digit parsing strict (errors on unknown chars) for spec clarity and cross-impl parity.
Add docstrings and consistent terminology (MSB-first / big-endian, Horner folding) to aid learners.

Changes

Text mode: switch to UTF-8 encode/decode.
--alphabet: require 20 distinct Unicode scalars.
Strict mappers: ascii_to_digits_*_strict raise on invalid inputs.
Safety checks:
- bigint_to_base20_digits() guards negative input.
- write_bigint_to_bytes_be() rejects out-of-range values (no truncation).
Docstrings across all helpers; CLI help polished; unused imports removed.
Naming: _be suffix on helpers whose contract is big-endian.

Compatibility

No format changes vs C90/C++20: same header/payload, same separators ('-' for text, 0xFF for binary-digits), same block rules (5 -> 10 digits padded; tail minimal).

Tests / Verification

# ASCII alphabet
echo -n 'this is some test text' \
    | python3 kb20.py encode --zero A \
    | python3 kb20.py decode --zero A \
    | diff -u <(echo -n 'this is some test text') -

# Kaktovik digits (UTF-8)
Z=$'\U0001D2C0'
echo -n 'abcXYZ' \
    | python3 kb20.py encode --zero "$Z" \
    | python3 kb20.py decode --zero "$Z" \
    | diff -u <(echo -n 'abcXYZ') -

# Explicit alphabet (20 distinct chars)
ALPHA='0123456789ABCDEFGHIJ'
echo -n 'EdgeCase42!' \
    | python3 kb20.py encode --alphabet "$ALPHA" \
    | python3 kb20.py decode --alphabet "$ALPHA" \
    | diff -u <(echo -n 'EdgeCase42!') -

# Binary-digits mode (raw 0..19)
head -c 257 /dev/urandom >/tmp/r
python3 kb20.py encode --binary-digits -i /tmp/r -o /tmp/e
python3 kb20.py decode --binary-digits -i /tmp/e -o /tmp/d
cmp /tmp/r /tmp/d

Notes for reviewers

Terminology standardized to MSB-first to avoid confusion about "left-to-right".
Python keeps arbitrary-precision ints; behavior matches the "fixed-width big-endian header -> minimal base-20 digits" model used in C/C++.
_be suffix is intentional to make byte order explicit in reusable helpers.

## Finalize Python KB20 reference: UTF-8 text mode, strict parsing, and full docs ### Summary This PR brings the Python reference implementation to parity with the C90/C++20 versions and makes it a solid teaching resource. Text mode now emits/accepts UTF-8 for Unicode digit alphabets (e.g. Kaktovik), parsing is strict and well-documented, and helpers are fully annotated. ### What/Why - Supports **Unicode digits** in text mode (UTF-8 on the wire) to match Kaktovik use cases. - Enforce **alphabet hygiene** (exactly 20 distinct symbols) to avoid ambiguous streams. - Make digit parsing **strict** (errors on unknown chars) for spec clarity and cross-impl parity. - Add docstrings and consistent terminology (**MSB-first / big-endian**, Horner folding) to aid learners. ### Changes - Text mode: switch to UTF-8 encode/decode. - `--alphabet`: require 20 **distinct** Unicode scalars. - Strict mappers: `ascii_to_digits_*_strict` raise on invalid inputs. - Safety checks: - `bigint_to_base20_digits()` guards negative input. - `write_bigint_to_bytes_be()` rejects out-of-range values (no truncation). - Docstrings across all helpers; CLI help polished; unused imports removed. - Naming: `_be` suffix on helpers whose contract is big-endian. ### Compatibility - No format changes vs C90/C++20: same header/payload, same separators (`'-'` for text, `0xFF` for binary-digits), same block rules (5 -> 10 digits padded; tail minimal). ### Tests / Verification ```shell # ASCII alphabet echo -n 'this is some test text' \ | python3 kb20.py encode --zero A \ | python3 kb20.py decode --zero A \ | diff -u <(echo -n 'this is some test text') - # Kaktovik digits (UTF-8) Z=$'\U0001D2C0' echo -n 'abcXYZ' \ | python3 kb20.py encode --zero "$Z" \ | python3 kb20.py decode --zero "$Z" \ | diff -u <(echo -n 'abcXYZ') - # Explicit alphabet (20 distinct chars) ALPHA='0123456789ABCDEFGHIJ' echo -n 'EdgeCase42!' \ | python3 kb20.py encode --alphabet "$ALPHA" \ | python3 kb20.py decode --alphabet "$ALPHA" \ | diff -u <(echo -n 'EdgeCase42!') - # Binary-digits mode (raw 0..19) head -c 257 /dev/urandom >/tmp/r python3 kb20.py encode --binary-digits -i /tmp/r -o /tmp/e python3 kb20.py decode --binary-digits -i /tmp/e -o /tmp/d cmp /tmp/r /tmp/d ``` ### Notes for reviewers - Terminology standardized to **MSB-first** to avoid confusion about "left-to-right". - Python keeps arbitrary-precision ints; behavior matches the "fixed-width big-endian header -> minimal base-20 digits" model used in C/C++. - `_be` suffix is intentional to make byte order explicit in reusable helpers.

sezieru changed title from ~~WIP: Python3 implementation~~ to Python3 implementation

2025-10-31 12:35:05 +00:00

sezieru merged commit 01ff4af7bf into main

2025-10-31 12:35:26 +00:00

sezieru referenced this pull request from a commit

2025-10-31 12:35:27 +00:00

Merge pull request 'Python3 implementation' (#3) from feat/kb20-py3 into main