From 96ea6019ae29fe79e3f51253a39f291173977704 Mon Sep 17 00:00:00 2001 From: Avinal Kumar Date: Tue, 16 Jun 2026 13:28:38 +0530 Subject: [PATCH] Add Reveal.js slides for DevConf.CZ 2026 talk MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit "Lost in Transliteration: Why strlen("Dvořák") Returns 8" Scheduled June 18, 2026 at 10:15 in room E104. - Self-contained Reveal.js 5.1.0 deck loaded from CDN - Markdown-based slides (slides.md) with HTML shell (index.html) - IBM Carbon Design System theme with custom syntax highlighting - Mermaid diagrams for gconv pipeline and iconv flow - Speaker notes with full forms, translations, and delivery instructions - Served at /talks/devconf-2026/ as a static page Signed-off-by: Avinal Kumar --- public/talks/devconf-2026/index.html | 342 ++++++++++ public/talks/devconf-2026/slides.md | 923 +++++++++++++++++++++++++++ 2 files changed, 1265 insertions(+) create mode 100644 public/talks/devconf-2026/index.html create mode 100644 public/talks/devconf-2026/slides.md diff --git a/public/talks/devconf-2026/index.html b/public/talks/devconf-2026/index.html new file mode 100644 index 0000000..8a08f70 --- /dev/null +++ b/public/talks/devconf-2026/index.html @@ -0,0 +1,342 @@ + + + + + + Lost in Transliteration: Why strlen("Dvořák") Returns 8 + + + + + + + +
+
+
+
+
+
+ + + + + + + + + + diff --git a/public/talks/devconf-2026/slides.md b/public/talks/devconf-2026/slides.md new file mode 100644 index 0000000..e7dc7fb --- /dev/null +++ b/public/talks/devconf-2026/slides.md @@ -0,0 +1,923 @@ + + + + + + +```bash +$ printf 'Dvořák' | wc -c +``` + + +Note: +**Do:** Walk on stage, put terminal on screen, no output yet. Pause 3-4 seconds. Ask: "What do you think this prints?" + +- **wc** = word count; **-c** = count bytes (not characters) +- Dvořák = Czech composer surname, pronounced "DVOR-zhahk" + +-- + + + +```bash +$ printf 'Dvořák' | wc -c +8 +``` + + +Note: +**Do:** Reveal the 8. Pause. Dvořák has 6 visible letters — why 8? Don't explain yet. + +- wc -c counts bytes, not characters — this is POSIX behavior, not a bug + +-- + + + +```bash +$ printf 'Dvořák' | wc -c +8 +``` + + +
+ +How many people think this is **wrong**? + + +Note: +**Do:** Ask the question. Wait 5 seconds. Let hands go up. Do NOT answer yet. + +-- + + + +```bash +$ printf 'Dvořák' | wc -c +8 + +$ python3 -c "print(len('Dvořák'))" +6 +``` + + +Note: +Two different answers for the same string. Let the confusion build. + +- Python 3 len() counts Unicode code points, not bytes +- *Exception:* Python 2 len() counted bytes — this changed in 2→3 + +-- + + + +```bash +$ printf '😀' | wc -c +4 + +$ python3 -c "print(len('😀'))" +1 +``` + + +Note: +An emoji: 4 bytes vs 1 character. + +- 😀 = U+1F600 "Grinning Face." Needs 4 bytes in UTF-8 (F0 9F 98 80) because it's above the **BMP** (Basic Multilingual Plane, U+0000–U+FFFF) +- *Exception:* On macOS, `echo` appends a newline — use `printf` to avoid off-by-one + +-- + + + +Which one is **correct**? + + +All of them. + + +Understanding why is basically the entire talk. + + +Note: +**Do:** Pause before "All of them." Then: *"They're counting different things. wc counts bytes. Python counts code points. Both correct."* + +**Key thesis:** bytes ≠ characters ≠ code points + +--- + + + + + + + +## Lost in Transliteration + +Why `strlen("Dvořák")` Returns **8** + + +
+ +Avinal Kumar · glibc contributor + + +DevConf.CZ 2026 + + +Note: +**Do:** Brief intro, under 30 seconds: +*"I'm Avinal. I contribute to glibc — the GNU C Library. I got into character encodings through an iconv bug at the glibc workshop here at DevConf. Today I'll take you through that journey."* + +- **glibc** = GNU C Library — the standard C library on most Linux distros +- **iconv** = POSIX API for converting text between character encodings + +--- + + + +### Today we'll answer + +1. Why does `strlen("Dvořák")` return 8? +2. Why does Unicode exist? +3. How does the C library handle text? +4. How does `iconv` convert between encodings? +5. Does any of this still matter in 2026? + +Note: +**Do:** Read out loud. Give the audience a roadmap. Don't linger. + +- **strlen** = "string length" — counts bytes before the null terminator, NOT characters + +--- + + + +There is no such thing as plain text. + +
+ +If you remember one thing from this talk, remember that sentence. + + +Note: +**Do:** Say this slowly. Pause. *"If you remember one thing, remember that sentence."* + +- "Plain text" implies no encoding — but every byte sequence *has* an encoding. If you don't know it, you're guessing. Wrong guess = **mojibake** (文字化け, Japanese for garbled text, pronounced "mo-ji-ba-keh") + +--- + + + + + + + +

How we ended up with this mess

+ +### ASCII: The 7-bit world + +
+
+ +- 128 characters (0–127) +- 7 bits per character +- English letters, digits, punctuation +- Bit 8 was "spare" + +
+
+ +```text +0x41 = A +0x61 = a +0x30 = 0 +0x20 = (space) +0x0A = (newline) +``` + +
+
+ +*"And all was good — if you spoke English."* + + +Note: +- **ASCII** = American Standard Code for Information Interchange (1963) +- 7 bits = 128 values. The 8th bit was for parity checking on noisy telegraph lines +- Only covers English — no accented chars, no Cyrillic, no CJK, no Arabic + +--- + + + +

How we ended up with this mess

+ +### Code Pages: Everyone fills bit 8 differently + +If I send byte `0xE9` from Paris to Moscow, what character arrives? + + +| Byte | CP-1252 (Western) | CP-866 (Russian) | CP-862 (Hebrew) | +|------|-------------------|-------------------|------------------| +| `0xE9` | é | щ | ט | +| `0xC4` | Ä | ─ | ד | +| `0xF1` | ñ | ё | ס | + + + +CJK needed **thousands** — multi-byte encodings (Shift-JIS, EUC-KR, GB2312) where you can't even move backward in a string. + + +Note: +**Do:** Ask *"If I send byte 0xE9 from Paris to Moscow, what character arrives?"* before revealing the table. + +- **CP** = Code Page. CP-1252 = Windows Western. CP-866 = DOS Russian. CP-862 = DOS Hebrew +- Same byte, different characters — the bytes are correct, the *interpretation* is wrong +- **CJK** = Chinese, Japanese, Korean +- **Shift-JIS** = Shift Japanese Industrial Standards. **EUC-KR** = Extended Unix Code for Korean. **GB2312** = Chinese National Standard +- *Exception:* Multi-byte encodings have a "forward-only" problem — you can't tell if a byte is byte 1 or byte 2 of a character + +--- + + + +

How we ended up with this mess

+ +### Unicode: One number per character + +```text +U+0041 = A U+00E9 = é U+010D = č +U+0639 = ع U+4E16 = 世 U+1F600 = 😀 +``` + +- Code points are **abstract numbers**, not bytes +- Not "16-bit characters" — that's the myth +- 154,998 characters across 168 scripts + +Unicode separated the *idea* of a character from how it's stored. + + +Note: +- **Unicode** = Universal Coded Character Set (1991, Unicode Consortium) +- Code points are abstract numbers — how you *store* them is a separate question (that's what encodings answer) +- *Exception:* "Unicode is 16-bit" myth comes from Unicode 1.0 (1991) which only planned 65,536 chars. Unicode 2.0 (1996) expanded beyond 16 bits. Java and Windows adopted UTF-16 before that expansion, and are now stuck with it +- **BMP** = Basic Multilingual Plane (U+0000–U+FFFF). Characters above it (emoji, rare scripts) are in supplementary planes + +--- + + + +

How we ended up with this mess

+ +### Encodings: Serialization formats + +
+
+

UTF-8

+ +- 1–4 bytes +- ASCII-compatible +- 98% of the web +
+
+

UTF-16

+ +- 2 or 4 bytes +- Needs BOM +- Windows, Java +
+
+

UTF-32

+ +- Fixed 4 bytes +- Simple but wasteful +- glibc internal +
+
+ +Note: +- **UTF** = Unicode Transformation Format +- **UTF-8:** Designed 1992 by Ken Thompson & Rob Pike. ASCII bytes are identical — this is why it won. 98.2% of websites (W3Techs, 2024) +- **UTF-16:** Uses surrogate pairs above U+FFFF. **BOM** = Byte Order Mark (U+FEFF) — indicates endianness +- **UTF-32:** Also called **UCS-4** (Universal Coded Character Set, 4-byte). "hello" = 20 bytes instead of 5 +- *Exception:* UTF-32 and UCS-4 are technically from different standards (ISO 10646 vs Unicode), but identical in practice + +-- + + + +

How we ended up with this mess

+ +### Encodings: Serialization formats + +
+"Dvořák" in UTF-8:  44 76 6F C5 99 C3 A1 6B    8 bytes
+"Dvořák" in UTF-32: 00000044 00000076 0000006F 00000159 000000E1 0000006B  24 bytes +
+ +There is no such thing as plain text. + + +Note: +UTF-8 breakdown: +- D, v, o, k = 1 byte each (ASCII range) +- ř = C5 99 (2 bytes, U+0159) +- á = C3 A1 (2 bytes, U+00E1) +- Total: 4×1 + 2×2 = **8 bytes** for 6 characters + +UTF-32: every char = 4 bytes → 6×4 = **24 bytes**. Same string, 3× the size. + +--- + + + + + + + +Part 2 + +## Text in C: What actually happens + +Note: +**Do:** *"Now we understand WHY bytes and characters differ. Let's see how C deals with it."* + +--- + + + +

Text in C

+ +### C has two ways to see a string + +
+
+ +#### `char` — bytes +- 1 byte per element, no encoding info +- `strlen("Dvořák")` → **8** +- `strlen("😀")` → **4** +- Indexing gives you bytes, not characters + +
+
+ +#### `wchar_t` — code points +- 4 bytes on Linux, 2 on Windows +- `wcslen(L"Dvořák")` → **6** +- `wcslen(L"😀")` → **1** +- Indexing gives you characters + +
+
+ +
+ +`mbrtowc()` bridges between them. `setlocale()` tells it which encoding to expect. + + +Note: +- **wchar_t** = "wide character type." Linux: 4 bytes (UCS-4). Windows: 2 bytes (UTF-16) +- **wcslen** = "wide character string length" +- **L"..."** prefix = wide string literal +- **mbrtowc** = "multibyte restartable to wide character" — converts one multibyte char to one wchar_t +- **setlocale** with LC_CTYPE tells mbrtowc the encoding. Without it → "C" locale = ASCII only +- *Exception:* On Windows, wcslen(L"😀") returns **2** (surrogate pair), not 1 + +--- + + + +

Text in C

+ +### What does "Dvořák" look like in memory? + +```text + Character: D v o ř á k + UTF-8 hex: 44 76 6F C5 99 C3 A1 6B + Bytes: 1 1 1 2 2 1 = 8 bytes + Code points: 1 1 1 1 1 1 = 6 characters +``` + +`strlen` counts the top row. `wcslen` counts the bottom row. + + +Now you know why `strlen("Dvořák")` returns 8. + + +Note: +**Do:** Point at the diagram: *"strlen counts bytes: 1+1+1+2+2+1 = 8. wcslen counts characters: always 1 each = 6. Both correct."* + +This is the answer to the opening mystery. + +--- + + + +

Text in C

+ +### `iconv` — converting between encodings + +```bash +$ echo 'Dvořák' | iconv -f UTF-8 -t ASCII +iconv: illegal input sequence at position 3 +``` + + +Note: +- **iconv** = both a C API (iconv_open/iconv/iconv_close in ``) and a CLI tool +- **-f** = from, **-t** = to +- Position 3 = 4th byte (0-indexed) = where ř starts. ASCII only has 0–127; C5 = 197 → fails +- **EILSEQ** = "illegal sequence" errno value + +-- + + + +

Text in C

+ +### `iconv` — converting between encodings + +```bash +$ echo 'Dvořák' | iconv -f UTF-8 -t ASCII +iconv: illegal input sequence at position 3 + +$ echo 'Dvořák' | iconv -f UTF-8 -t ASCII//TRANSLIT +Dvorak + +$ echo 'Dvořák' | iconv -f UTF-8 -t ASCII//IGNORE +Dvok +``` + + +- **`//TRANSLIT`** — approximate: ř→r, á→a +- **`//IGNORE`** — drop what doesn't fit + +Note: +- **//TRANSLIT** = transliteration. Appended to target encoding. Finds closest match: ř→r, á→a, ö→o, ñ→n +- **//IGNORE** = silently drop unconvertible chars. Notice "Dvok" — both ř AND á dropped +- *Exception:* //TRANSLIT is glibc-specific, not POSIX. musl libc (Alpine Linux) doesn't support it + +--- + + + +

Text in C

+ +### Real encoding pairs from across the world + +```bash +$ echo '東京' | iconv -f UTF-8 -t SHIFT_JIS | hexdump -C +00000000 93 8c 8b 9e 0a |.....| + +$ echo 'こんにちは世界' | iconv -f UTF-8 -t EUC-JP | hexdump -C +00000000 a4 b3 a4 f3 a4 cb a4 c1 a4 cf c0 a4 b3 a6 0a |...............| + +$ echo 'Ελληνικά κείμενο' | iconv -f UTF-8 -t ISO-8859-7 | hexdump -C +00000000 c5 eb eb e7 ed e9 ea dc 20 ea e5 df ec e5 ed ef |........ .......| +``` + +Same characters, completely different bytes — depending on the encoding. + + +Note: +- 東京 = Tōkyō (Tokyo) +- こんにちは世界 = "Konnichiwa Sekai" = "Hello World" +- Ελληνικά κείμενο = "Elliniká keímeno" = "Greek text" +- **hexdump -C** = canonical hex+ASCII dump. Non-ASCII shows as dots +- Same text in Shift-JIS vs EUC-JP → completely different bytes. Without knowing the encoding, unreadable + +--- + + + +

Text in C

+ +### When conversion fails + +```bash +$ echo 'مرحبا' | iconv -f UTF-8 -t ISO-8859-1 +iconv: illegal input sequence at position 0 + +$ echo 'Résumé' | iconv -f UTF-8 -t CP866 +iconv: illegal input sequence at position 1 + +$ echo -ne '\xEF\xBB\xBFhello' | hexdump -C +00000000 ef bb bf 68 65 6c 6c 6f |...hello| + +$ echo -ne '\xEF\xBB\xBFhello' | iconv -f UTF-8 -t ASCII//TRANSLIT +hello +``` + +- Arabic → Latin-1: impossible — the encoding can't hold it +- French Résumé → Russian CP866: `é` doesn't exist in that code page +- BOM: 3 invisible bytes at the start — your first "character" is garbage + +Note: +- مرحبا = "marhaba" = "hello" in Arabic +- **ISO-8859-1** = Latin-1. Zero Arabic chars → fails at position 0 +- **CP866** = DOS Cyrillic. é doesn't map → fails at position 1 (R is fine, é isn't) +- **BOM** = Byte Order Mark (U+FEFF, encoded EF BB BF in UTF-8). Windows Notepad adds it. Breaks JSON parsers, shell shebangs, and string comparisons + +--- + + + +

Text in C

+ +### Longer text, bigger difference + +```bash +$ printf 'Příliš žluťoučký kůň úpěl ďábelské ódy' | wc -c +53 + +$ python3 -c "print(len('Příliš žluťoučký kůň úpěl ďábelské ódy'))" +38 + +$ echo 'Příliš žluťoučký kůň úpěl ďábelské ódy' \ + | iconv -f UTF-8 -t ASCII//TRANSLIT +Prilis zlutoucky kun upel dabelske ody +``` + +A Czech pangram: **38 characters**, **53 bytes** — a 40% difference. + + +`//TRANSLIT` strips all diacritics and produces valid ASCII. + + +Note: +- **Translation:** "Too yellow a horse groaned devilish odes" — a Czech pangram (like "The quick brown fox" but for testing diacritics) +- 15 extra bytes from accented characters: each adds 1 byte in UTF-8 +- Czech diacritics: **háček** (ˇ) = caron (ř, š, č, ž, ň, ď, ť, ě), **čárka** (´) = acute (á, é, í, ó, ú), **kroužek** (°) = ring (ů) +- **Do:** DevConf is in Brno — the audience will recognize this pangram + +--- + + + +

Text in C

+ +### How many encodings? + +```bash +$ iconv -l | wc -l +1180 + +$ find /usr/lib64/gconv -name '*.so' | wc -l +253 +``` + +**1180** encoding names served by **253** shared libraries. + + +How does glibc manage this without writing thousands of converters? + + +Note: +**Do:** LIVE DEMO if possible. + +- **iconv -l** = list all encodings. 1180 includes aliases (SHIFT-JIS, SJIS, MS_KANJI = same encoding) +- **/usr/lib64/gconv/** = where glibc stores converter .so files (Fedora/RHEL). Debian: /usr/lib/x86_64-linux-gnu/gconv/ +- **.so** = shared object (dynamically loaded library) +- 1180 names, 253 plugins — far fewer than the 39,800 needed for N×N + +--- + + + + + + + +Part 3 + +## Inside glibc's iconv + +Note: +**Do:** *"We've seen what iconv does from the outside. Now let's look under the hood."* + +- **gconv** = glibc's internal conversion framework ("g" = GNU, "conv" = conversion) + +--- + + + +

Inside glibc

+ +### The naive approach: N×N converters + +Suppose I support 200 encodings. How many converters do I need? + + +```text +Shift-JIS → UTF-8 UTF-8 → Shift-JIS +Shift-JIS → EUC-KR EUC-KR → Shift-JIS +UTF-8 → EUC-KR EUC-KR → UTF-8 +... +``` + + +5 encodings = 20 converters. 200 encodings? + + +200 × 199 = 39,800 converters. That's not going to work. + + +Note: +**Do:** Ask *"How many converters do I need?"* before revealing. Let them guess. + +- Formula: N × (N-1) for directed pairs +- Nobody will write 39,800 converters + +--- + + + +

Inside glibc

+ +### The smart approach: one universal pivot + +What if every encoding just learned to convert to **one common format**? + + +```text +Shift-JIS → ??? → UTF-8 +``` + + +Note: +Hub-and-spoke architecture — same principle as airline routing through hub airports. + +-- + + + +

Inside glibc

+ +### The smart approach: one universal pivot + +glibc's gconv framework uses an internal **UCS-4 based representation** as the pivot. + +```text +Shift-JIS → UCS-4 → UTF-8 +``` + + +Now you need just **2 converters per encoding** (to UCS-4 and from UCS-4). + + +200 encodings × 2 = 400 converters instead of 39,800. + + +Note: +- **UCS-4** = Universal Coded Character Set, 4-byte form (ISO 10646). Essentially UTF-32 +- glibc calls it **INTERNAL** in gconv-modules config +- 2 converters per encoding → 400 total. 99% reduction +- *Exception:* glibc says "UCS-4 *based*" — the internal representation has nuances around stateful encodings + +--- + + + +

Inside glibc

+ +### The lookup table: `gconv-modules` + +
# iconvdata/gconv-modules
+#   from             to              module     cost
+module  ISO-8859-1// INTERNAL        ISO8859-1   1
+module  INTERNAL     ISO-8859-1//    ISO8859-1   1
+ +
# iconvdata/gconv-modules-extra.conf
+module  SJIS//       INTERNAL        SJIS        1
+module  INTERNAL     SJIS//          SJIS        1
+ +`INTERNAL` = the UCS-4 pivot + + +Each line maps an encoding to a `.so` plugin. `iconv_open` reads this file, loads the right plugins, and chains them. + + +Note: +These are actual files from the glibc source tree. + +- Format: `module FROM// TO MODULE_NAME COST` +- **INTERNAL** = glibc's name for UCS-4 +- **Cost** = routing weight when multiple paths exist (lower = preferred) +- Each encoding has exactly 2 lines — one each direction. Hub-and-spoke in practice + +--- + + + +

Inside glibc

+ +### The conversion pipeline + +
+
+flowchart TB
+    A["Shift-JIS bytes"] --> B["SJIS.so\n(gconv module)"]
+    B --> C["UCS-4\n(internal pivot)"]
+    C --> D["UTF-8 converter\n(built-in)"]
+    D --> E["UTF-8 bytes"]
+    style C fill:#0f62fe,stroke:#78a9ff,color:#fff
+    style B fill:#393939,stroke:#78a9ff,color:#c6c6c6
+    style D fill:#393939,stroke:#78a9ff,color:#c6c6c6
+    style A fill:#262626,stroke:#525252,color:#f1c21b
+    style E fill:#262626,stroke:#525252,color:#42be65
+
+
+ +Adding a new encoding = writing **one** `.so` plugin. + + +Note: +**Do:** THIS IS THE MONEY SLIDE. Spend time here. Point at each box: +1. *"Shift-JIS bytes come in"* +2. *"SJIS.so converts to UCS-4"* +3. *"UTF-8 converter turns UCS-4 into UTF-8"* +4. *"UTF-8 bytes come out"* + +Adding a new encoding = one .so that converts to/from UCS-4. People will photograph this. + +--- + + + +

Inside glibc

+ +### The iconv flow + +
+
+sequenceDiagram
+    participant App as Your Code
+    participant glibc as glibc internals
+    App->>glibc: iconv_open("UTF-8", "SJIS")
+    Note right of glibc: look up gconv-modules
+    Note right of glibc: load SJIS.so + UTF-8
+    Note right of glibc: build step chain
+    glibc-->>App: return descriptor
+    App->>glibc: iconv(cd, &in, ...)
+    Note right of glibc: step[0]: SJIS → UCS-4
+    Note right of glibc: step[1]: UCS-4 → UTF-8
+    glibc-->>App: advance pointers
+    App->>glibc: iconv_close(cd)
+    Note right of glibc: free chain, unload modules
+
+
+ +Three calls. That's the entire API. + + +Note: +The API in three calls: +1. **iconv_open** → returns descriptor (pointer to gconv_info struct with step chain) +2. **iconv** → walks the chain. Both in/out pointers advance. Errors: **EILSEQ** (illegal sequence), **E2BIG** (output buffer full — flush and retry, not a real error), **EINVAL** (incomplete sequence) +3. **iconv_close** → free chain, unload modules + +- *Highlight:* E2BIG is the #1 mistake — people call iconv once and assume it's done + +--- + + + + + + + +Part 4 + +## Does this still matter? + +Note: +**Do:** *"Modern languages have Unicode strings by default. So why should anyone care about iconv in 2026?"* + +--- + + + +

Relevance today

+ +### How modern languages handle encoding + +| Language | Strings are... | Encoding conversion | +|----------|----------------|---------------------| +| **Python 3** | Unicode internally | Built-in codecs | +| **Go** | UTF-8 by definition | `golang.org/x/text` | +| **Rust** | Always valid UTF-8 | `encoding_rs` crate | +| **Java** | UTF-16 internally | `java.nio.charset` | +| **C/C++** | Just bytes — no encoding | **`iconv`** | + +Modern languages solved this by making strings Unicode-native. C didn't — and can't, because it would break 50 years of code. + + +Note: +- C can't change because `char = 1 byte` is baked into the language spec and **ABI** (Application Binary Interface) +- Even modern languages need encoding conversion at **I/O boundaries** — files, sockets, C library calls via **FFI** (Foreign Function Interface) +- Python's codecs, Go's x/text, Rust's encoding_rs all exist because the outside world isn't always UTF-8 + +--- + + + +

Relevance today

+ +### Encoding bugs are alive and well + +
+
+ +#### The Turkish İ problem + +| Locale | `toupper('i')` | +|--------|----------------| +| en_US | I | +| tr_TR | İ (dotted!) | + +Tests pass in English, break in Turkish. + +
+
+ +#### `//IGNORE` inconsistency + +```bash +$ echo 'héllo' | iconv \ + -f UTF-8 -t ASCII//IGNORE +``` + +Some modules skip the bad byte. Some stop with an error. +**Same flag, different behavior.** + +
+
+ +
+ +Every time a language reads a file, parses a socket, or calls a C library — encoding conversion still happens. These bugs still bite. + + +Note: +**Turkish İ:** +- Turkish has 4 i's: i, İ, ı, I. toupper('i') → İ (U+0130), not I +- Any case-insensitive comparison using toupper/tolower is locale-dependent + +**//IGNORE:** +- Behavior depends on *which* gconv module runs — inconsistent across encodings +- This is a real unfixed glibc bug. This is what got me into the codebase + +--- + + + + + + + +### glibc Development Workshop — Third Edition + +Led by **Arjun Shankar** (Red Hat, glibc developer) + +Tomorrow, Friday June 19 · 10:15 AM · Room A218 + +Pick a bug, get a cheat sheet, ship a patch. + +6 patches in 2024 · 15+ in 2025 · **yours in 2026?** + +Note: +**Do:** Tell the personal story: +*"Two years ago I walked into this workshop at DevConf. Arjun gave me a small iconv task. I got curious, fell down the rabbit hole, and that became this talk. That one task turned into 14 patches in glibc."* + +- **Arjun Shankar** = Red Hat engineer, glibc developer. Runs this workshop yearly at DevConf.CZ +- Format: show up, get a cheat sheet with a small bug + pointers, experienced contributors help you submit +- Room A218, capacity 20. First come, first served +- *"If anything in this talk made you curious, room A218 tomorrow morning."* + +--- + + + + + + + +### Questions? · Resources + +- **Joel Spolsky** — "The Absolute Minimum Every Software Developer Must Know About Unicode" +- **GNU C Library Manual** — "Character Set Handling" chapter +- **unicode.org** — the specification + +avinal.space · @avinal + +Attendance at DevConf.CZ 2026 was supported by the **[GNU Toolchain Fund](https://my.fsf.org/civicrm/contribute/transact?reset=1&id=57)**, a part of the FSF's Working Together for Free Software Fund. + + +Note: +**Do:** Leave this up during Q&A. + +- Joel Spolsky's article (2003) — the classic intro, entertaining +- glibc manual — authoritative API reference (sourceware.org/glibc/manual) +- **GNU Toolchain Fund** = part of the **FSF's** (Free Software Foundation) "Working Together for Free Software" fund