avinal.github.io/public/talks/devconf-2026/slides.md at main

avinal/avinal.github.io

mirror of https://github.com/avinal/avinal.github.io.git synced 2026-07-03 23:30:09 +05:30

Files

T

avinal 96ea6019ae Add Reveal.js slides for DevConf.CZ 2026 talk

"Lost in Transliteration: Why strlen("Dvořák") Returns 8"
Scheduled June 18, 2026 at 10:15 in room E104.

- Self-contained Reveal.js 5.1.0 deck loaded from CDN
- Markdown-based slides (slides.md) with HTML shell (index.html)
- IBM Carbon Design System theme with custom syntax highlighting
- Mermaid diagrams for gconv pipeline and iconv flow
- Speaker notes with full forms, translations, and delivery instructions
- Served at /talks/devconf-2026/ as a static page

Signed-off-by: Avinal Kumar <avinal.xlvii@gmail.com>

2026-06-16 13:37:28 +05:30

29 KiB

Raw Permalink Blame History

$ printf 'Dvořák' | wc -c

Note: Do: Walk on stage, put terminal on screen, no output yet. Pause 3-4 seconds. Ask: "What do you think this prints?"

wc = word count; -c = count bytes (not characters)
Dvořák = Czech composer surname, pronounced "DVOR-zhahk"

$ printf 'Dvořák' | wc -c
8

Note: Do: Reveal the 8. Pause. Dvořák has 6 visible letters — why 8? Don't explain yet.

wc -c counts bytes, not characters — this is POSIX behavior, not a bug

$ printf 'Dvořák' | wc -c
8

How many people think this is wrong?

Note: Do: Ask the question. Wait 5 seconds. Let hands go up. Do NOT answer yet.

$ printf 'Dvořák' | wc -c
8

$ python3 -c "print(len('Dvořák'))"
6

Note: Two different answers for the same string. Let the confusion build.

Python 3 len() counts Unicode code points, not bytes
Exception: Python 2 len() counted bytes — this changed in 2→3

$ printf '😀' | wc -c
4

$ python3 -c "print(len('😀'))"
1

Note: An emoji: 4 bytes vs 1 character.

😀 = U+1F600 "Grinning Face." Needs 4 bytes in UTF-8 (F0 9F 98 80) because it's above the BMP (Basic Multilingual Plane, U+0000–U+FFFF)
Exception: On macOS, echo appends a newline — use printf to avoid off-by-one

Which one is correct?

All of them.

Understanding why is basically the entire talk.

Note: Do: Pause before "All of them." Then: "They're counting different things. wc counts bytes. Python counts code points. Both correct."

Key thesis: bytes ≠ characters ≠ code points

Lost in Transliteration

Why strlen("Dvořák") Returns 8

Avinal Kumar · glibc contributor

DevConf.CZ 2026

Note: Do: Brief intro, under 30 seconds: "I'm Avinal. I contribute to glibc — the GNU C Library. I got into character encodings through an iconv bug at the glibc workshop here at DevConf. Today I'll take you through that journey."

glibc = GNU C Library — the standard C library on most Linux distros
iconv = POSIX API for converting text between character encodings

Today we'll answer

Why does strlen("Dvořák") return 8?
Why does Unicode exist?
How does the C library handle text?
How does iconv convert between encodings?
Does any of this still matter in 2026?

Note: Do: Read out loud. Give the audience a roadmap. Don't linger.

strlen = "string length" — counts bytes before the null terminator, NOT characters

There is no such thing as plain text.

If you remember one thing from this talk, remember that sentence.

Note: Do: Say this slowly. Pause. "If you remember one thing, remember that sentence."

"Plain text" implies no encoding — but every byte sequence has an encoding. If you don't know it, you're guessing. Wrong guess = mojibake (文字化け, Japanese for garbled text, pronounced "mo-ji-ba-keh")

How we ended up with this mess

ASCII: The 7-bit world

128 characters (0–127)
7 bits per character
English letters, digits, punctuation
Bit 8 was "spare"

0x41 = A
0x61 = a
0x30 = 0
0x20 = (space)
0x0A = (newline)

"And all was good — if you spoke English."

Note:

ASCII = American Standard Code for Information Interchange (1963)
7 bits = 128 values. The 8th bit was for parity checking on noisy telegraph lines
Only covers English — no accented chars, no Cyrillic, no CJK, no Arabic

How we ended up with this mess

Code Pages: Everyone fills bit 8 differently

If I send byte 0xE9 from Paris to Moscow, what character arrives?

Byte	CP-1252 (Western)	CP-866 (Russian)	CP-862 (Hebrew)
`0xE9`	é	щ	ט
`0xC4`	Ä	─	ד
`0xF1`	ñ	ё	ס

CJK needed thousands — multi-byte encodings (Shift-JIS, EUC-KR, GB2312) where you can't even move backward in a string.

Note: Do: Ask "If I send byte 0xE9 from Paris to Moscow, what character arrives?" before revealing the table.

CP = Code Page. CP-1252 = Windows Western. CP-866 = DOS Russian. CP-862 = DOS Hebrew
Same byte, different characters — the bytes are correct, the interpretation is wrong
CJK = Chinese, Japanese, Korean
Shift-JIS = Shift Japanese Industrial Standards. EUC-KR = Extended Unix Code for Korean. GB2312 = Chinese National Standard
Exception: Multi-byte encodings have a "forward-only" problem — you can't tell if a byte is byte 1 or byte 2 of a character

How we ended up with this mess

Unicode: One number per character

U+0041 = A      U+00E9 = é      U+010D = č
U+0639 = ع      U+4E16 = 世      U+1F600 = 😀

Code points are abstract numbers, not bytes
Not "16-bit characters" — that's the myth
154,998 characters across 168 scripts

Unicode separated the idea of a character from how it's stored.

Note:

Unicode = Universal Coded Character Set (1991, Unicode Consortium)
Code points are abstract numbers — how you store them is a separate question (that's what encodings answer)
Exception: "Unicode is 16-bit" myth comes from Unicode 1.0 (1991) which only planned 65,536 chars. Unicode 2.0 (1996) expanded beyond 16 bits. Java and Windows adopted UTF-16 before that expansion, and are now stuck with it
BMP = Basic Multilingual Plane (U+0000–U+FFFF). Characters above it (emoji, rare scripts) are in supplementary planes

How we ended up with this mess

Encodings: Serialization formats

UTF-8

1–4 bytes
ASCII-compatible
98% of the web

UTF-16

2 or 4 bytes
Needs BOM
Windows, Java

UTF-32

Fixed 4 bytes
Simple but wasteful
glibc internal

Note:

UTF = Unicode Transformation Format
UTF-8: Designed 1992 by Ken Thompson & Rob Pike. ASCII bytes are identical — this is why it won. 98.2% of websites (W3Techs, 2024)
UTF-16: Uses surrogate pairs above U+FFFF. BOM = Byte Order Mark (U+FEFF) — indicates endianness
UTF-32: Also called UCS-4 (Universal Coded Character Set, 4-byte). "hello" = 20 bytes instead of 5
Exception: UTF-32 and UCS-4 are technically from different standards (ISO 10646 vs Unicode), but identical in practice

How we ended up with this mess

Encodings: Serialization formats

"Dvořák" in UTF-8: 44 76 6F C5 99 C3 A1 6B 8 bytes
"Dvořák" in UTF-32: 00000044 00000076 0000006F 00000159 000000E1 0000006B 24 bytes

There is no such thing as plain text.

Note: UTF-8 breakdown:

D, v, o, k = 1 byte each (ASCII range)
ř = C5 99 (2 bytes, U+0159)
á = C3 A1 (2 bytes, U+00E1)
Total: 4×1 + 2×2 = 8 bytes for 6 characters

UTF-32: every char = 4 bytes → 6×4 = 24 bytes. Same string, 3× the size.

Part 2

Text in C: What actually happens

Note: Do: "Now we understand WHY bytes and characters differ. Let's see how C deals with it."

Text in C

C has two ways to see a string

`char` — bytes

1 byte per element, no encoding info
strlen("Dvořák") → 8
strlen("😀") → 4
Indexing gives you bytes, not characters

`wchar_t` — code points

4 bytes on Linux, 2 on Windows
wcslen(L"Dvořák") → 6
wcslen(L"😀") → 1
Indexing gives you characters

mbrtowc() bridges between them. setlocale() tells it which encoding to expect.

Note:

wchar_t = "wide character type." Linux: 4 bytes (UCS-4). Windows: 2 bytes (UTF-16)
wcslen = "wide character string length"
L"..." prefix = wide string literal
mbrtowc = "multibyte restartable to wide character" — converts one multibyte char to one wchar_t
setlocale with LC_CTYPE tells mbrtowc the encoding. Without it → "C" locale = ASCII only
Exception: On Windows, wcslen(L"😀") returns 2 (surrogate pair), not 1

Text in C

What does "Dvořák" look like in memory?

 Character:   D     v     o     ř        á        k
 UTF-8 hex:  44    76    6F   C5 99    C3 A1     6B
 Bytes:       1     1     1     2        2        1   =  8 bytes
 Code points: 1     1     1     1        1        1   =  6 characters

strlen counts the top row. wcslen counts the bottom row.

Now you know why strlen("Dvořák") returns 8.

Note: Do: Point at the diagram: "strlen counts bytes: 1+1+1+2+2+1 = 8. wcslen counts characters: always 1 each = 6. Both correct."

This is the answer to the opening mystery.

Text in C

`iconv` — converting between encodings

$ echo 'Dvořák' | iconv -f UTF-8 -t ASCII
iconv: illegal input sequence at position 3

Note:

iconv = both a C API (iconv_open/iconv/iconv_close in <iconv.h>) and a CLI tool
-f = from, -t = to
Position 3 = 4th byte (0-indexed) = where ř starts. ASCII only has 0–127; C5 = 197 → fails
EILSEQ = "illegal sequence" errno value

Text in C

`iconv` — converting between encodings

$ echo 'Dvořák' | iconv -f UTF-8 -t ASCII
iconv: illegal input sequence at position 3

$ echo 'Dvořák' | iconv -f UTF-8 -t ASCII//TRANSLIT
Dvorak

$ echo 'Dvořák' | iconv -f UTF-8 -t ASCII//IGNORE
Dvok

//TRANSLIT — approximate: ř→r, á→a
//IGNORE — drop what doesn't fit

Note:

//TRANSLIT = transliteration. Appended to target encoding. Finds closest match: ř→r, á→a, ö→o, ñ→n
//IGNORE = silently drop unconvertible chars. Notice "Dvok" — both ř AND á dropped
Exception: //TRANSLIT is glibc-specific, not POSIX. musl libc (Alpine Linux) doesn't support it

Text in C

Real encoding pairs from across the world

$ echo '東京' | iconv -f UTF-8 -t SHIFT_JIS | hexdump -C
00000000  93 8c 8b 9e 0a                          |.....|

$ echo 'こんにちは世界' | iconv -f UTF-8 -t EUC-JP | hexdump -C
00000000  a4 b3 a4 f3 a4 cb a4 c1  a4 cf c0 a4 b3 a6 0a  |...............|

$ echo 'Ελληνικά κείμενο' | iconv -f UTF-8 -t ISO-8859-7 | hexdump -C
00000000  c5 eb eb e7 ed e9 ea dc  20 ea e5 df ec e5 ed ef  |........ .......|

Same characters, completely different bytes — depending on the encoding.

Note:

東京 = Tōkyō (Tokyo)
こんにちは世界 = "Konnichiwa Sekai" = "Hello World"
Ελληνικά κείμενο = "Elliniká keímeno" = "Greek text"
hexdump -C = canonical hex+ASCII dump. Non-ASCII shows as dots
Same text in Shift-JIS vs EUC-JP → completely different bytes. Without knowing the encoding, unreadable

Text in C

When conversion fails

$ echo 'مرحبا' | iconv -f UTF-8 -t ISO-8859-1
iconv: illegal input sequence at position 0

$ echo 'Résumé' | iconv -f UTF-8 -t CP866
iconv: illegal input sequence at position 1

$ echo -ne '\xEF\xBB\xBFhello' | hexdump -C
00000000  ef bb bf 68 65 6c 6c 6f                  |...hello|

$ echo -ne '\xEF\xBB\xBFhello' | iconv -f UTF-8 -t ASCII//TRANSLIT
hello

Arabic → Latin-1: impossible — the encoding can't hold it
French Résumé → Russian CP866: é doesn't exist in that code page
BOM: 3 invisible bytes at the start — your first "character" is garbage

Note:

مرحبا = "marhaba" = "hello" in Arabic
ISO-8859-1 = Latin-1. Zero Arabic chars → fails at position 0
CP866 = DOS Cyrillic. é doesn't map → fails at position 1 (R is fine, é isn't)
BOM = Byte Order Mark (U+FEFF, encoded EF BB BF in UTF-8). Windows Notepad adds it. Breaks JSON parsers, shell shebangs, and string comparisons

Text in C

Longer text, bigger difference

$ printf 'Příliš žluťoučký kůň úpěl ďábelské ódy' | wc -c
53

$ python3 -c "print(len('Příliš žluťoučký kůň úpěl ďábelské ódy'))"
38

$ echo 'Příliš žluťoučký kůň úpěl ďábelské ódy' \
    | iconv -f UTF-8 -t ASCII//TRANSLIT
Prilis zlutoucky kun upel dabelske ody

A Czech pangram: 38 characters, 53 bytes — a 40% difference.

//TRANSLIT strips all diacritics and produces valid ASCII.

Note:

Translation: "Too yellow a horse groaned devilish odes" — a Czech pangram (like "The quick brown fox" but for testing diacritics)
15 extra bytes from accented characters: each adds 1 byte in UTF-8
Czech diacritics: háček (ˇ) = caron (ř, š, č, ž, ň, ď, ť, ě), čárka (´) = acute (á, é, í, ó, ú), kroužek (°) = ring (ů)
Do: DevConf is in Brno — the audience will recognize this pangram

Text in C

How many encodings?

$ iconv -l | wc -l
1180

$ find /usr/lib64/gconv -name '*.so' | wc -l
253

1180 encoding names served by 253 shared libraries.

How does glibc manage this without writing thousands of converters?

Note: Do: LIVE DEMO if possible.

iconv -l = list all encodings. 1180 includes aliases (SHIFT-JIS, SJIS, MS_KANJI = same encoding)
/usr/lib64/gconv/ = where glibc stores converter .so files (Fedora/RHEL). Debian: /usr/lib/x86_64-linux-gnu/gconv/
.so = shared object (dynamically loaded library)
1180 names, 253 plugins — far fewer than the 39,800 needed for N×N

Part 3

Inside glibc's iconv

Note: Do: "We've seen what iconv does from the outside. Now let's look under the hood."

gconv = glibc's internal conversion framework ("g" = GNU, "conv" = conversion)

Inside glibc

The naive approach: N×N converters

Suppose I support 200 encodings. How many converters do I need?

Shift-JIS → UTF-8       UTF-8 → Shift-JIS
Shift-JIS → EUC-KR      EUC-KR → Shift-JIS
UTF-8 → EUC-KR          EUC-KR → UTF-8
...

5 encodings = 20 converters. 200 encodings?

200 × 199 = 39,800 converters. That's not going to work.

Note: Do: Ask "How many converters do I need?" before revealing. Let them guess.

Formula: N × (N-1) for directed pairs
Nobody will write 39,800 converters

Inside glibc

The smart approach: one universal pivot

What if every encoding just learned to convert to one common format?

Shift-JIS  →  ???  →  UTF-8

Note: Hub-and-spoke architecture — same principle as airline routing through hub airports.

Inside glibc

The smart approach: one universal pivot

glibc's gconv framework uses an internal UCS-4 based representation as the pivot.

Shift-JIS  →  UCS-4  →  UTF-8

Now you need just 2 converters per encoding (to UCS-4 and from UCS-4).

200 encodings × 2 = 400 converters instead of 39,800.

Note:

UCS-4 = Universal Coded Character Set, 4-byte form (ISO 10646). Essentially UTF-32
glibc calls it INTERNAL in gconv-modules config
2 converters per encoding → 400 total. 99% reduction
Exception: glibc says "UCS-4 based" — the internal representation has nuances around stateful encodings

Inside glibc

The lookup table: `gconv-modules`

# iconvdata/gconv-modules
#   from             to              module     cost
module  ISO-8859-1// INTERNAL        ISO8859-1   1
module  INTERNAL     ISO-8859-1//    ISO8859-1   1

# iconvdata/gconv-modules-extra.conf
module  SJIS//       INTERNAL        SJIS        1
module  INTERNAL     SJIS//          SJIS        1

INTERNAL = the UCS-4 pivot

Each line maps an encoding to a .so plugin. iconv_open reads this file, loads the right plugins, and chains them.

Note: These are actual files from the glibc source tree.

Format: module FROM// TO MODULE_NAME COST
INTERNAL = glibc's name for UCS-4
Cost = routing weight when multiple paths exist (lower = preferred)
Each encoding has exactly 2 lines — one each direction. Hub-and-spoke in practice

Inside glibc

The conversion pipeline

flowchart TB
    A["Shift-JIS bytes"] --> B["SJIS.so\n(gconv module)"]
    B --> C["UCS-4\n(internal pivot)"]
    C --> D["UTF-8 converter\n(built-in)"]
    D --> E["UTF-8 bytes"]
    style C fill:#0f62fe,stroke:#78a9ff,color:#fff
    style B fill:#393939,stroke:#78a9ff,color:#c6c6c6
    style D fill:#393939,stroke:#78a9ff,color:#c6c6c6
    style A fill:#262626,stroke:#525252,color:#f1c21b
    style E fill:#262626,stroke:#525252,color:#42be65

Adding a new encoding = writing one .so plugin.

Note: Do: THIS IS THE MONEY SLIDE. Spend time here. Point at each box:

"Shift-JIS bytes come in"
"SJIS.so converts to UCS-4"
"UTF-8 converter turns UCS-4 into UTF-8"
"UTF-8 bytes come out"

Adding a new encoding = one .so that converts to/from UCS-4. People will photograph this.

Inside glibc

The iconv flow

sequenceDiagram
    participant App as Your Code
    participant glibc as glibc internals
    App->>glibc: iconv_open("UTF-8", "SJIS")
    Note right of glibc: look up gconv-modules
    Note right of glibc: load SJIS.so + UTF-8
    Note right of glibc: build step chain
    glibc-->>App: return descriptor
    App->>glibc: iconv(cd, &in, ...)
    Note right of glibc: step[0]: SJIS → UCS-4
    Note right of glibc: step[1]: UCS-4 → UTF-8
    glibc-->>App: advance pointers
    App->>glibc: iconv_close(cd)
    Note right of glibc: free chain, unload modules

Three calls. That's the entire API.

Note: The API in three calls:

iconv_open → returns descriptor (pointer to gconv_info struct with step chain)
iconv → walks the chain. Both in/out pointers advance. Errors: EILSEQ (illegal sequence), E2BIG (output buffer full — flush and retry, not a real error), EINVAL (incomplete sequence)
iconv_close → free chain, unload modules

Highlight: E2BIG is the #1 mistake — people call iconv once and assume it's done

Part 4

Does this still matter?

Note: Do: "Modern languages have Unicode strings by default. So why should anyone care about iconv in 2026?"

Relevance today

How modern languages handle encoding

Language	Strings are...	Encoding conversion
Python 3	Unicode internally	Built-in codecs
Go	UTF-8 by definition	`golang.org/x/text`
Rust	Always valid UTF-8	`encoding_rs` crate
Java	UTF-16 internally	`java.nio.charset`
C/C++	Just bytes — no encoding	`iconv`

Modern languages solved this by making strings Unicode-native. C didn't — and can't, because it would break 50 years of code.

Note:

C can't change because char = 1 byte is baked into the language spec and ABI (Application Binary Interface)
Even modern languages need encoding conversion at I/O boundaries — files, sockets, C library calls via FFI (Foreign Function Interface)
Python's codecs, Go's x/text, Rust's encoding_rs all exist because the outside world isn't always UTF-8

Relevance today

Encoding bugs are alive and well

The Turkish İ problem

Locale	`toupper('i')`
en_US	I
tr_TR	İ (dotted!)

Tests pass in English, break in Turkish.

`//IGNORE` inconsistency

$ echo 'héllo' | iconv \
  -f UTF-8 -t ASCII//IGNORE

Some modules skip the bad byte. Some stop with an error. Same flag, different behavior.

Every time a language reads a file, parses a socket, or calls a C library — encoding conversion still happens. These bugs still bite.

Note: Turkish İ:

Turkish has 4 i's: i, İ, ı, I. toupper('i') → İ (U+0130), not I
Any case-insensitive comparison using toupper/tolower is locale-dependent

//IGNORE:

Behavior depends on which gconv module runs — inconsistent across encodings
This is a real unfixed glibc bug. This is what got me into the codebase

glibc Development Workshop — Third Edition

Led by Arjun Shankar (Red Hat, glibc developer)

Tomorrow, Friday June 19 · 10:15 AM · Room A218

Pick a bug, get a cheat sheet, ship a patch.

6 patches in 2024 · 15+ in 2025 · yours in 2026?

Note: Do: Tell the personal story: "Two years ago I walked into this workshop at DevConf. Arjun gave me a small iconv task. I got curious, fell down the rabbit hole, and that became this talk. That one task turned into 14 patches in glibc."

Arjun Shankar = Red Hat engineer, glibc developer. Runs this workshop yearly at DevConf.CZ
Format: show up, get a cheat sheet with a small bug + pointers, experienced contributors help you submit
Room A218, capacity 20. First come, first served
"If anything in this talk made you curious, room A218 tomorrow morning."

Questions? · Resources

Joel Spolsky — "The Absolute Minimum Every Software Developer Must Know About Unicode"
GNU C Library Manual — "Character Set Handling" chapter
unicode.org — the specification

avinal.space · @avinal

Attendance at DevConf.CZ 2026 was supported by the GNU Toolchain Fund, a part of the FSF's Working Together for Free Software Fund.

Note: Do: Leave this up during Q&A.

Joel Spolsky's article (2003) — the classic intro, entertaining
glibc manual — authoritative API reference (sourceware.org/glibc/manual)
GNU Toolchain Fund = part of the FSF's (Free Software Foundation) "Working Together for Free Software" fund

29 KiB Raw Permalink Blame History Unescape Escape

Lost in Transliteration

Today we'll answer

ASCII: The 7-bit world

Code Pages: Everyone fills bit 8 differently

Unicode: One number per character

Encodings: Serialization formats

UTF-8

UTF-16

UTF-32

Encodings: Serialization formats

Text in C: What actually happens

C has two ways to see a string

char — bytes

wchar_t — code points

What does "Dvořák" look like in memory?

iconv — converting between encodings

iconv — converting between encodings

Real encoding pairs from across the world

When conversion fails

Longer text, bigger difference

How many encodings?

Inside glibc's iconv

The naive approach: N×N converters

The smart approach: one universal pivot

The smart approach: one universal pivot

The lookup table: gconv-modules

The conversion pipeline

The iconv flow

Does this still matter?

How modern languages handle encoding

Encoding bugs are alive and well

The Turkish İ problem

//IGNORE inconsistency

glibc Development Workshop — Third Edition

Questions? · Resources

29 KiB

Raw Permalink Blame History

`char` — bytes

`wchar_t` — code points

`iconv` — converting between encodings

`iconv` — converting between encodings

The lookup table: `gconv-modules`

`//IGNORE` inconsistency