Add Reveal.js slides for DevConf.CZ 2026 talk

"Lost in Transliteration: Why strlen("Dvořák") Returns 8" Scheduled June 18, 2026 at 10:15 in room E104. - Self-contained Reveal.js 5.1.0 deck loaded from CDN - Markdown-based slides (slides.md) with HTML shell (index.html) - IBM Carbon Design System theme with custom syntax highlighting - Mermaid diagrams for gconv pipeline and iconv flow - Speaker notes with full forms, translations, and delivery instructions - Served at /talks/devconf-2026/ as a static page Signed-off-by: Avinal Kumar <avinal.xlvii@gmail.com>
2026-07-03 23:30:09 +05:30 · 2026-06-16 13:28:38 +05:30
parent 63ab0e12b2
commit 96ea6019ae
2 changed files with 1265 additions and 0 deletions
@@ -0,0 +1,342 @@
 <!DOCTYPE html>
 <html lang="en">
 <head>
  <meta charset="utf-8" />
  <meta name="viewport" content="width=device-width, initial-scale=1.0" />
  <title>Lost in Transliteration: Why strlen("Dvořák") Returns 8</title>
  <meta name="description" content="DevConf.CZ 2026 talk by Avinal Kumar — character encoding, Unicode, and glibc's iconv internals" />
  <meta name="author" content="Avinal Kumar" />
  <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/reveal.js/5.1.0/reveal.min.css" />
  <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/reveal.js/5.1.0/theme/black.min.css" id="theme" />
  <style>
    @import url('https://fonts.googleapis.com/css2?family=IBM+Plex+Sans:wght@300;400;500;600;700&family=IBM+Plex+Mono:wght@400;500;600;700&display=swap');
    /* ============================================
       IBM Carbon Design System — Color Tokens
       ============================================ */
    :root {
      /* Carbon Gray 80 background */
      --r-background-color: #2f2f2f;
      /* Carbon typography */
      --r-main-font: 'IBM Plex Sans', system-ui, sans-serif;
      --r-main-font-size: 34px;
      --r-heading-font: 'IBM Plex Sans', system-ui, sans-serif;
      --r-heading-color: #f4f4f4;
      --r-heading-font-weight: 600;
      --r-main-color: #c6c6c6;
      --r-link-color: #78a9ff;
      --r-link-color-hover: #a6c8ff;
      --r-code-font: 'IBM Plex Mono', monospace;
      --r-heading-text-transform: none;
      --r-heading-letter-spacing: -0.01em;
      /* Carbon color palette */
      --carbon-blue-40: #78a9ff;
      --carbon-blue-60: #0f62fe;
      --carbon-purple-40: #be95ff;
      --carbon-teal-20: #9ef0f0;
      --carbon-teal-40: #08bdba;
      --carbon-magenta-40: #ff7eb6;
      --carbon-red-40: #ff8389;
      --carbon-green-40: #42be65;
      --carbon-yellow-30: #f1c21b;
      --carbon-gray-10: #f4f4f4;
      --carbon-gray-30: #c6c6c6;
      --carbon-gray-50: #8d8d8d;
      --carbon-gray-60: #6f6f6f;
      --carbon-gray-70: #525252;
      --carbon-gray-80: #393939;
      --carbon-gray-90: #262626;
      --carbon-gray-100: #161616;
    }
    .reveal {
      font-weight: 400;
      letter-spacing: 0;
    }
    .reveal h1, .reveal h2, .reveal h3 {
      line-height: 1.2;
      margin-bottom: 0.6em;
    }
    .reveal h2 { font-size: 2em; font-weight: 600; }
    .reveal h3 { font-size: 1.4em; font-weight: 600; }
    /* ---- Code blocks: Carbon snippet style ---- */
    .reveal pre {
      width: 100%;
      font-size: 0.52em;
      box-shadow: none;
      border-radius: 0;
      border: none;
      background: var(--carbon-gray-100);
    }
    .reveal pre code {
      padding: 1.2em 1.4em;
      border-radius: 0;
      max-height: 480px;
      line-height: 1.65;
      background: var(--carbon-gray-100);
      color: #fff;
      font-weight: 400;
    }
    .reveal code {
      font-family: var(--r-code-font);
      font-weight: 500;
    }
    .reveal p code, .reveal li code {
      background: var(--carbon-gray-80);
      border: none;
      padding: 0.15em 0.45em;
      border-radius: 0;
      font-size: 0.88em;
      color: var(--carbon-magenta-40);
    }
    /* ---- Carbon syntax highlighting (overrides highlight.js) ---- */
    .reveal pre code .hljs-keyword,
    .reveal pre code .hljs-type,
    .reveal pre code .hljs-built_in { color: var(--carbon-purple-40); }
    .reveal pre code .hljs-string,
    .reveal pre code .hljs-doctag { color: var(--carbon-magenta-40); }
    .reveal pre code .hljs-number,
    .reveal pre code .hljs-literal { color: var(--carbon-blue-40); }
    .reveal pre code .hljs-comment { color: var(--carbon-gray-60); font-style: normal; }
    .reveal pre code .hljs-function,
    .reveal pre code .hljs-title { color: var(--carbon-teal-20); }
    .reveal pre code .hljs-variable,
    .reveal pre code .hljs-attr { color: #fff; }
    .reveal pre code .hljs-params { color: var(--carbon-gray-30); }
    .reveal pre code .hljs-meta,
    .reveal pre code .hljs-preprocessor { color: #569CD6; }
    .reveal pre code .hljs-regexp { color: #D16969; }
    .reveal pre code .hljs-symbol,
    .reveal pre code .hljs-template-variable { color: var(--carbon-red-40); }
    .hljs { background: var(--carbon-gray-100); color: #fff; }
    /* ---- Utility classes: Carbon palette ---- */
    .reveal .dim { opacity: 0.45; }
    .reveal .accent { color: var(--carbon-blue-40); }
    .reveal .green { color: var(--carbon-green-40); }
    .reveal .yellow { color: var(--carbon-yellow-30); }
    .reveal .orange { color: #f0883e; }
    .reveal .red { color: var(--carbon-red-40); }
    .reveal .purple { color: var(--carbon-purple-40); }
    .reveal .teal { color: var(--carbon-teal-20); }
    .reveal .magenta { color: var(--carbon-magenta-40); }
    .reveal .big { font-size: 1.6em; font-weight: 600; letter-spacing: -0.02em; }
    .reveal .medium { font-size: 1.15em; font-weight: 500; }
    .reveal .small { font-size: 0.7em; }
    .reveal .tiny {
      font-size: 0.45em;
      color: var(--carbon-gray-60);
      font-family: var(--r-code-font);
      letter-spacing: 0.02em;
    }
    /* ---- Tables ---- */
    .reveal table { font-size: 0.72em; border-collapse: collapse; border-spacing: 0; }
    .reveal table th {
      color: var(--carbon-gray-10);
      font-weight: 600;
      background: var(--carbon-gray-80);
      padding: 0.6em 1em;
      border-bottom: 2px solid var(--carbon-gray-70);
      text-align: left;
    }
    .reveal table td {
      padding: 0.5em 1em;
      border-bottom: 1px solid var(--carbon-gray-70);
    }
    .reveal table tr:hover td { background: rgba(255,255,255,0.04); }
    /* ---- Custom blocks: Carbon surface style ---- */
    .reveal .hex-display {
      font-family: var(--r-code-font);
      font-size: 0.62em;
      background: var(--carbon-gray-100);
      padding: 1em 1.4em;
      border-radius: 0;
      border: none;
      display: inline-block;
      line-height: 1.9;
      color: #fff;
    }
    .reveal .diagram {
      background: var(--carbon-gray-100);
      border: none;
      border-radius: 0;
      padding: 1.2em 1.4em;
      font-family: var(--r-code-font);
      font-size: 0.58em;
      line-height: 1.7;
      color: #fff;
    }
    /* ---- Section label ---- */
    .reveal .slide-title {
      font-size: 0.45em;
      color: var(--carbon-blue-40);
      text-transform: uppercase;
      letter-spacing: 0.2em;
      font-weight: 600;
      margin-bottom: 0.3em;
      opacity: 0.8;
    }
    /* ---- Blockquotes ---- */
    .reveal blockquote {
      background: var(--carbon-gray-80);
      border-left: 4px solid var(--carbon-blue-60);
      padding: 0.8em 1.2em;
      font-style: italic;
      width: 85%;
      border-radius: 0;
    }
    /* ---- Grid layouts ---- */
    .reveal .two-col {
      display: grid;
      grid-template-columns: 1fr 1fr;
      gap: 2em;
      text-align: left;
    }
    .reveal .three-col {
      display: grid;
      grid-template-columns: 1fr 1fr 1fr;
      gap: 1.5em;
      text-align: left;
      font-size: 0.8em;
    }
    /* ---- Cards: Carbon tile style ---- */
    .reveal .card {
      background: var(--carbon-gray-80);
      border: none;
      border-radius: 0;
      padding: 1.2em;
    }
    .reveal .card h4 {
      margin-bottom: 0.5em;
    }
    /* ---- Lists ---- */
    .reveal ul, .reveal ol { display: block; }
    .reveal li {
      margin-bottom: 0.5em;
      line-height: 1.5;
    }
    .reveal ul li::marker { color: var(--carbon-blue-40); }
    .reveal ol li::marker { color: var(--carbon-blue-40); font-weight: 600; }
    /* ---- Glow effects for emphasis ---- */
    .reveal .glow-blue {
      text-shadow: 0 0 40px rgba(120,169,255,0.4), 0 0 80px rgba(120,169,255,0.15);
      color: var(--carbon-blue-40);
    }
    .reveal .glow-red {
      text-shadow: 0 0 40px rgba(255,131,137,0.4), 0 0 80px rgba(255,131,137,0.15);
      color: var(--carbon-red-40);
    }
    .reveal .glow-green {
      text-shadow: 0 0 40px rgba(66,190,101,0.4), 0 0 80px rgba(66,190,101,0.15);
      color: var(--carbon-green-40);
    }
    /* ---- Badges: Carbon tag style ---- */
    .reveal .badge {
      display: inline-block;
      font-size: 0.55em;
      font-weight: 500;
      padding: 0.15em 0.7em;
      border-radius: 0;
      text-transform: uppercase;
      letter-spacing: 0.06em;
    }
    .reveal .badge-blue { background: rgba(120,169,255,0.2); color: var(--carbon-blue-40); }
    .reveal .badge-red { background: rgba(255,131,137,0.2); color: var(--carbon-red-40); }
    .reveal .badge-green { background: rgba(66,190,101,0.2); color: var(--carbon-green-40); }
    .reveal .badge-yellow { background: rgba(241,194,27,0.2); color: var(--carbon-yellow-30); }
    .reveal .badge-purple { background: rgba(190,149,255,0.2); color: var(--carbon-purple-40); }
    /* ---- HR ---- */
    .reveal hr {
      border: none;
      height: 1px;
      background: var(--carbon-gray-70);
      margin: 1em 0;
    }
    /* ---- Progress bar ---- */
    .reveal .progress span { background: var(--carbon-blue-60); }
    /* ---- Auto-animate transitions ---- */
    .reveal [data-auto-animate] .hex-display,
    .reveal [data-auto-animate] .diagram,
    .reveal [data-auto-animate] pre {
      transition: all 0.6s ease;
    }
    /* ---- Slide number ---- */
    .reveal .slide-number {
      font-family: var(--r-code-font);
      font-size: 0.5em;
      color: var(--carbon-gray-60);
    }
  </style>
 </head>
 <body>
  <div class="reveal">
    <div class="slides">
      <section
        data-markdown="slides.md"
        data-separator="^---$"
        data-separator-vertical="^--$"
        data-separator-notes="^Note:"
        data-charset="utf-8">
      </section>
    </div>
  </div>
  <script src="https://cdnjs.cloudflare.com/ajax/libs/reveal.js/5.1.0/reveal.js"></script>
  <script src="https://cdnjs.cloudflare.com/ajax/libs/reveal.js/5.1.0/plugin/markdown/markdown.min.js"></script>
  <script src="https://cdnjs.cloudflare.com/ajax/libs/reveal.js/5.1.0/plugin/notes/notes.min.js"></script>
  <script src="https://cdnjs.cloudflare.com/ajax/libs/reveal.js/5.1.0/plugin/highlight/highlight.min.js"></script>
  <script src="https://cdnjs.cloudflare.com/ajax/libs/reveal.js/5.1.0/plugin/zoom/zoom.min.js"></script>
  <script src="https://cdn.jsdelivr.net/npm/reveal.js-mermaid-plugin@11.15.0/plugin/mermaid/mermaid.js"></script>
  <script>
    Reveal.initialize({
      mermaid: {
        theme: 'dark',
        themeVariables: {
          darkMode: true,
          background: '#2f2f2f',
          primaryColor: '#393939',
          primaryTextColor: '#c6c6c6',
          primaryBorderColor: '#525252',
          lineColor: '#78a9ff',
          secondaryColor: '#262626',
          tertiaryColor: '#161616',
          fontFamily: "'IBM Plex Sans', system-ui, sans-serif",
          fontSize: '18px',
        },
      },
      hash: true,
      slideNumber: 'c/t',
      showSlideNumber: 'speaker',
      transition: 'fade',
      transitionSpeed: 'default',
      backgroundTransition: 'fade',
      center: true,
      width: 1280,
      height: 720,
      margin: 0.08,
      autoAnimateEasing: 'ease-in-out',
      autoAnimateDuration: 0.8,
      autoAnimateUnmatched: true,
      zoomKey: 'alt',
      plugins: [RevealMarkdown, RevealHighlight, RevealNotes, RevealZoom, RevealMermaid],
    });
  </script>
 </body>
 </html>
@@ -0,0 +1,923 @@
 <!-- ===================================================== -->
 <!-- SECTION 1: THE PROBLEM (mystery opening)              -->
 <!-- ===================================================== -->
 <!-- .slide: data-auto-animate data-background-color="#2f2f2f" -->
 ```bash
 $ printf 'Dvořák' | wc -c
 ```
 <!-- .element: data-id="mystery-code" -->
 Note:
 **Do:** Walk on stage, put terminal on screen, no output yet. Pause 3-4 seconds. Ask: "What do you think this prints?"
 - **wc** = word count; **-c** = count bytes (not characters)
 - Dvořák = Czech composer surname, pronounced "DVOR-zhahk"
 --
 <!-- .slide: data-auto-animate data-background-color="#2f2f2f" -->
 ```bash
 $ printf 'Dvořák' | wc -c
 8
 ```
 <!-- .element: data-id="mystery-code" -->
 Note:
 **Do:** Reveal the 8. Pause. Dvořák has 6 visible letters — why 8? Don't explain yet.
 - wc -c counts bytes, not characters — this is POSIX behavior, not a bug
 --
 <!-- .slide: data-auto-animate data-background-color="#2f2f2f" -->
 ```bash
 $ printf 'Dvořák' | wc -c
 8
 ```
 <!-- .element: data-id="mystery-code" -->
 <br />
 How many people think this is **wrong**?
 <!-- .element: class="medium" -->
 Note:
 **Do:** Ask the question. Wait 5 seconds. Let hands go up. Do NOT answer yet.
 --
 <!-- .slide: data-auto-animate data-background-color="#2f2f2f" -->
 ```bash
 $ printf 'Dvořák' | wc -c
 8
 $ python3 -c "print(len('Dvořák'))"
 6
 ```
 <!-- .element: data-id="mystery-code" -->
 Note:
 Two different answers for the same string. Let the confusion build.
 - Python 3 len() counts Unicode code points, not bytes
 - *Exception:* Python 2 len() counted bytes — this changed in 2→3
 --
 <!-- .slide: data-auto-animate data-background-color="#2f2f2f" -->
 ```bash
 $ printf '😀' | wc -c
 4
 $ python3 -c "print(len('😀'))"
 1
 ```
 <!-- .element: data-id="mystery-code" -->
 Note:
 An emoji: 4 bytes vs 1 character.
 - 😀 = U+1F600 "Grinning Face." Needs 4 bytes in UTF-8 (F0 9F 98 80) because it's above the **BMP** (Basic Multilingual Plane, U+0000–U+FFFF)
 - *Exception:* On macOS, `echo` appends a newline — use `printf` to avoid off-by-one
 --
 <!-- .slide: data-background-color="#2f2f2f" -->
 Which one is **correct**?
 <!-- .element: class="big" -->
 All of them.
 <!-- .element: class="fragment zoom-in glow-blue big" -->
 Understanding why is basically the entire talk.
 <!-- .element: class="fragment fade-up small dim" -->
 Note:
 **Do:** Pause before "All of them." Then: *"They're counting different things. wc counts bytes. Python counts code points. Both correct."*
 **Key thesis:** bytes ≠ characters ≠ code points
 ---
 <!-- ===================================================== -->
 <!-- SECTION 2: INTRODUCTION                               -->
 <!-- ===================================================== -->
 <!-- .slide: data-background-color="#2f2f2f" data-transition="zoom" -->
 ## Lost in Transliteration
 Why `strlen("Dvořák")` Returns **8**
 <!-- .element: class="medium" style="opacity: 0.9" -->
 <br />
 Avinal Kumar · glibc contributor
 <!-- .element: style="font-weight: 500" -->
 <span class="badge badge-blue">DevConf.CZ 2026</span>
 <!-- .element: class="small dim" -->
 Note:
 **Do:** Brief intro, under 30 seconds:
 *"I'm Avinal. I contribute to glibc — the GNU C Library. I got into character encodings through an iconv bug at the glibc workshop here at DevConf. Today I'll take you through that journey."*
 - **glibc** = GNU C Library — the standard C library on most Linux distros
 - **iconv** = POSIX API for converting text between character encodings
 ---
 <!-- .slide: data-background-color="#2f2f2f" data-transition="slide" -->
 ### Today we'll answer
 1. Why does `strlen("Dvořák")` return 8?
 2. Why does Unicode exist?
 3. How does the C library handle text?
 4. How does `iconv` convert between encodings?
 5. Does any of this still matter in 2026?
 Note:
 **Do:** Read out loud. Give the audience a roadmap. Don't linger.
 - **strlen** = "string length" — counts bytes before the null terminator, NOT characters
 ---
 <!-- .slide: data-background-color="#2f2f2f" data-transition="fade" -->
 <span class="glow-blue big">There is no such thing as plain text.</span>
 <br />
 If you remember one thing from this talk, remember that sentence.
 <!-- .element: class="fragment fade-up small dim" -->
 Note:
 **Do:** Say this slowly. Pause. *"If you remember one thing, remember that sentence."*
 - "Plain text" implies no encoding — but every byte sequence *has* an encoding. If you don't know it, you're guessing. Wrong guess = **mojibake** (文字化け, Japanese for garbled text, pronounced "mo-ji-ba-keh")
 ---
 <!-- ===================================================== -->
 <!-- SECTION 3: HISTORY                                    -->
 <!-- ===================================================== -->
 <!-- .slide: data-background-color="#2f2f2f" data-transition="slide" -->
 <p class="slide-title">How we ended up with this mess</p>
 ### ASCII: The 7-bit world
 <div class="two-col">
 <div>
 - 128 characters (0–127)
 - 7 bits per character
 - English letters, digits, punctuation
 - Bit 8 was "spare"
 </div>
 <div>
 ```text
 0x41 = A
 0x61 = a
 0x30 = 0
 0x20 = (space)
 0x0A = (newline)
 ```
 </div>
 </div>
 *"And all was good — if you spoke English."*
 <!-- .element: class="fragment fade-up" -->
 Note:
 - **ASCII** = American Standard Code for Information Interchange (1963)
 - 7 bits = 128 values. The 8th bit was for parity checking on noisy telegraph lines
 - Only covers English — no accented chars, no Cyrillic, no CJK, no Arabic
 ---
 <!-- .slide: data-background-color="#2f2f2f" data-transition="slide" -->
 <p class="slide-title">How we ended up with this mess</p>
 ### Code Pages: Everyone fills bit 8 differently
 If I send byte `0xE9` from Paris to Moscow, what character arrives?
 <!-- .element: class="medium" -->
 | Byte | CP-1252 (Western) | CP-866 (Russian) | CP-862 (Hebrew) |
 |------|-------------------|-------------------|------------------|
 | `0xE9` | é | щ | ט |
 | `0xC4` | Ä | ─ | ד |
 | `0xF1` | ñ | ё | ס |
 <!-- .element: class="fragment fade-in" -->
 CJK needed **thousands** — multi-byte encodings (Shift-JIS, EUC-KR, GB2312) where you can't even move backward in a string.
 <!-- .element: class="fragment fade-up small" -->
 Note:
 **Do:** Ask *"If I send byte 0xE9 from Paris to Moscow, what character arrives?"* before revealing the table.
 - **CP** = Code Page. CP-1252 = Windows Western. CP-866 = DOS Russian. CP-862 = DOS Hebrew
 - Same byte, different characters — the bytes are correct, the *interpretation* is wrong
 - **CJK** = Chinese, Japanese, Korean
 - **Shift-JIS** = Shift Japanese Industrial Standards. **EUC-KR** = Extended Unix Code for Korean. **GB2312** = Chinese National Standard
 - *Exception:* Multi-byte encodings have a "forward-only" problem — you can't tell if a byte is byte 1 or byte 2 of a character
 ---
 <!-- .slide: data-background-color="#2f2f2f" data-transition="slide" -->
 <p class="slide-title">How we ended up with this mess</p>
 ### Unicode: One number per character
 ```text
 U+0041 = A      U+00E9 = é      U+010D = č
 U+0639 = ع      U+4E16 = 世      U+1F600 = 😀
 ```
 - Code points are **abstract numbers**, not bytes <!-- .element: class="fragment fade-up" -->
 - <span class="red">Not</span> "16-bit characters" — that's the myth <!-- .element: class="fragment fade-up" -->
 - 154,998 characters across 168 scripts <!-- .element: class="fragment fade-up" -->
 Unicode separated the *idea* of a character from how it's stored.
 <!-- .element: class="fragment zoom-in accent" -->
 Note:
 - **Unicode** = Universal Coded Character Set (1991, Unicode Consortium)
 - Code points are abstract numbers — how you *store* them is a separate question (that's what encodings answer)
 - *Exception:* "Unicode is 16-bit" myth comes from Unicode 1.0 (1991) which only planned 65,536 chars. Unicode 2.0 (1996) expanded beyond 16 bits. Java and Windows adopted UTF-16 before that expansion, and are now stuck with it
 - **BMP** = Basic Multilingual Plane (U+0000–U+FFFF). Characters above it (emoji, rare scripts) are in supplementary planes
 ---
 <!-- .slide: data-auto-animate data-background-color="#2f2f2f" data-transition="slide" -->
 <p class="slide-title">How we ended up with this mess</p>
 ### Encodings: Serialization formats
 <div class="three-col">
 <div class="card fragment fade-up" data-fragment-index="1">
 <h4 class="glow-blue">UTF-8</h4>
 - 1–4 bytes
 - ASCII-compatible
 - <span class="badge badge-blue">98% of the web</span>
 </div>
 <div class="card fragment fade-up" data-fragment-index="2">
 <h4 class="yellow">UTF-16</h4>
 - 2 or 4 bytes
 - Needs BOM
 - <span class="badge badge-yellow">Windows, Java</span>
 </div>
 <div class="card fragment fade-up" data-fragment-index="3">
 <h4 class="green">UTF-32</h4>
 - Fixed 4 bytes
 - Simple but wasteful
 - <span class="badge badge-green">glibc internal</span>
 </div>
 </div>
 Note:
 - **UTF** = Unicode Transformation Format
 - **UTF-8:** Designed 1992 by Ken Thompson & Rob Pike. ASCII bytes are identical — this is why it won. 98.2% of websites (W3Techs, 2024)
 - **UTF-16:** Uses surrogate pairs above U+FFFF. **BOM** = Byte Order Mark (U+FEFF) — indicates endianness
 - **UTF-32:** Also called **UCS-4** (Universal Coded Character Set, 4-byte). "hello" = 20 bytes instead of 5
 - *Exception:* UTF-32 and UCS-4 are technically from different standards (ISO 10646 vs Unicode), but identical in practice
 --
 <!-- .slide: data-auto-animate data-background-color="#2f2f2f" -->
 <p class="slide-title">How we ended up with this mess</p>
 ### Encodings: Serialization formats
 <div class="hex-display" data-id="encoding-hex">
 "Dvořák" in UTF-8: &nbsp;44 76 6F <span class="red" style="font-weight:700;">C5 99</span> C3 A1 6B &nbsp;&nbsp;&nbsp;<span class="badge badge-blue">8 bytes</span><br />
 "Dvořák" in UTF-32: 00000044 00000076 0000006F <span class="red" style="font-weight:700;">00000159</span> 000000E1 0000006B &nbsp;<span class="badge badge-green">24 bytes</span>
 </div>
 <span class="glow-blue big">There is no such thing as plain text.</span>
 <!-- .element: class="fragment zoom-in" -->
 Note:
 UTF-8 breakdown:
 - D, v, o, k = 1 byte each (ASCII range)
 - ř = C5 99 (2 bytes, U+0159)
 - á = C3 A1 (2 bytes, U+00E1)
 - Total: 4×1 + 2×2 = **8 bytes** for 6 characters
 UTF-32: every char = 4 bytes → 6×4 = **24 bytes**. Same string, 3× the size.
 ---
 <!-- ===================================================== -->
 <!-- SECTION 4: INTO C — real examples, not code           -->
 <!-- ===================================================== -->
 <!-- .slide: data-background-color="#2f2f2f" data-transition="zoom" -->
 <span class="badge badge-blue" style="font-size: 0.6em;">Part 2</span>
 ## Text in C: What actually happens
 Note:
 **Do:** *"Now we understand WHY bytes and characters differ. Let's see how C deals with it."*
 ---
 <!-- .slide: data-background-color="#2f2f2f" data-transition="slide" -->
 <p class="slide-title">Text in C</p>
 ### C has two ways to see a string
 <div class="two-col">
 <div class="card">
 #### `char` — bytes
 - 1 byte per element, no encoding info
 - `strlen("Dvořák")` → **8**
 - `strlen("😀")` → **4**
 - Indexing gives you bytes, not characters
 </div>
 <div class="card">
 #### `wchar_t` — code points
 - 4 bytes on Linux, <span class="red">2 on Windows</span>
 - `wcslen(L"Dvořák")` → **6**
 - `wcslen(L"😀")` → **1**
 - Indexing gives you characters
 </div>
 </div>
 <br />
 `mbrtowc()` bridges between them. `setlocale()` tells it which encoding to expect.
 <!-- .element: class="fragment fade-up small" -->
 Note:
 - **wchar_t** = "wide character type." Linux: 4 bytes (UCS-4). Windows: 2 bytes (UTF-16)
 - **wcslen** = "wide character string length"
 - **L"..."** prefix = wide string literal
 - **mbrtowc** = "multibyte restartable to wide character" — converts one multibyte char to one wchar_t
 - **setlocale** with LC_CTYPE tells mbrtowc the encoding. Without it → "C" locale = ASCII only
 - *Exception:* On Windows, wcslen(L"😀") returns **2** (surrogate pair), not 1
 ---
 <!-- .slide: data-background-color="#2f2f2f" data-transition="slide" -->
 <p class="slide-title">Text in C</p>
 ### What does "Dvořák" look like in memory?
 ```text
 Character:   D     v     o     ř        á        k
 UTF-8 hex:  44    76    6F   C5 99    C3 A1     6B
 Bytes:       1     1     1     2        2        1   =  8 bytes
 Code points: 1     1     1     1        1        1   =  6 characters
 ```
 `strlen` counts the top row. `wcslen` counts the bottom row.
 <!-- .element: class="fragment fade-up small" -->
 Now you know why `strlen("Dvořák")` returns 8.
 <!-- .element: class="fragment fade-up accent" -->
 Note:
 **Do:** Point at the diagram: *"strlen counts bytes: 1+1+1+2+2+1 = 8. wcslen counts characters: always 1 each = 6. Both correct."*
 This is the answer to the opening mystery.
 ---
 <!-- .slide: data-auto-animate data-background-color="#2f2f2f" -->
 <p class="slide-title">Text in C</p>
 ### `iconv` — converting between encodings
 ```bash
 $ echo 'Dvořák' | iconv -f UTF-8 -t ASCII
 iconv: illegal input sequence at position 3
 ```
 <!-- .element: data-id="iconv-demo" -->
 Note:
 - **iconv** = both a C API (iconv_open/iconv/iconv_close in `<iconv.h>`) and a CLI tool
 - **-f** = from, **-t** = to
 - Position 3 = 4th byte (0-indexed) = where ř starts. ASCII only has 0–127; C5 = 197 → fails
 - **EILSEQ** = "illegal sequence" errno value
 --
 <!-- .slide: data-auto-animate data-background-color="#2f2f2f" -->
 <p class="slide-title">Text in C</p>
 ### `iconv` — converting between encodings
 ```bash
 $ echo 'Dvořák' | iconv -f UTF-8 -t ASCII
 iconv: illegal input sequence at position 3
 $ echo 'Dvořák' | iconv -f UTF-8 -t ASCII//TRANSLIT
 Dvorak
 $ echo 'Dvořák' | iconv -f UTF-8 -t ASCII//IGNORE
 Dvok
 ```
 <!-- .element: data-id="iconv-demo" -->
 - **`//TRANSLIT`** — approximate: ř→r, á→a
 - **`//IGNORE`** — drop what doesn't fit
 Note:
 - **//TRANSLIT** = transliteration. Appended to target encoding. Finds closest match: ř→r, á→a, ö→o, ñ→n
 - **//IGNORE** = silently drop unconvertible chars. Notice "Dvok" — both ř AND á dropped
 - *Exception:* //TRANSLIT is glibc-specific, not POSIX. musl libc (Alpine Linux) doesn't support it
 ---
 <!-- .slide: data-background-color="#2f2f2f" data-transition="slide" -->
 <p class="slide-title">Text in C</p>
 ### Real encoding pairs from across the world
 ```bash
 $ echo '東京' | iconv -f UTF-8 -t SHIFT_JIS | hexdump -C
 00000000  93 8c 8b 9e 0a                          |.....|
 $ echo 'こんにちは世界' | iconv -f UTF-8 -t EUC-JP | hexdump -C
 00000000  a4 b3 a4 f3 a4 cb a4 c1  a4 cf c0 a4 b3 a6 0a  |...............|
 $ echo 'Ελληνικά κείμενο' | iconv -f UTF-8 -t ISO-8859-7 | hexdump -C
 00000000  c5 eb eb e7 ed e9 ea dc  20 ea e5 df ec e5 ed ef  |........ .......|
 ```
 Same characters, completely different bytes — depending on the encoding.
 <!-- .element: class="fragment fade-up small" -->
 Note:
 - 東京 = Tōkyō (Tokyo)
 - こんにちは世界 = "Konnichiwa Sekai" = "Hello World"
 - Ελληνικά κείμενο = "Elliniká keímeno" = "Greek text"
 - **hexdump -C** = canonical hex+ASCII dump. Non-ASCII shows as dots
 - Same text in Shift-JIS vs EUC-JP → completely different bytes. Without knowing the encoding, unreadable
 ---
 <!-- .slide: data-background-color="#2f2f2f" data-transition="slide" -->
 <p class="slide-title">Text in C</p>
 ### When conversion fails
 ```bash
 $ echo 'مرحبا' | iconv -f UTF-8 -t ISO-8859-1
 iconv: illegal input sequence at position 0
 $ echo 'Résumé' | iconv -f UTF-8 -t CP866
 iconv: illegal input sequence at position 1
 $ echo -ne '\xEF\xBB\xBFhello' | hexdump -C
 00000000  ef bb bf 68 65 6c 6c 6f                  |...hello|
 $ echo -ne '\xEF\xBB\xBFhello' | iconv -f UTF-8 -t ASCII//TRANSLIT
 hello
 ```
 - Arabic → Latin-1: impossible — the encoding can't hold it
 - French Résumé → Russian CP866: `é` doesn't exist in that code page
 - BOM: 3 invisible bytes at the start — your first "character" is garbage
 Note:
 - مرحبا = "marhaba" = "hello" in Arabic
 - **ISO-8859-1** = Latin-1. Zero Arabic chars → fails at position 0
 - **CP866** = DOS Cyrillic. é doesn't map → fails at position 1 (R is fine, é isn't)
 - **BOM** = Byte Order Mark (U+FEFF, encoded EF BB BF in UTF-8). Windows Notepad adds it. Breaks JSON parsers, shell shebangs, and string comparisons
 ---
 <!-- .slide: data-background-color="#2f2f2f" data-transition="slide" -->
 <p class="slide-title">Text in C</p>
 ### Longer text, bigger difference
 ```bash
 $ printf 'Příliš žluťoučký kůň úpěl ďábelské ódy' | wc -c
 53
 $ python3 -c "print(len('Příliš žluťoučký kůň úpěl ďábelské ódy'))"
 38
 $ echo 'Příliš žluťoučký kůň úpěl ďábelské ódy' \
    | iconv -f UTF-8 -t ASCII//TRANSLIT
 Prilis zlutoucky kun upel dabelske ody
 ```
 A Czech pangram: **38 characters**, **53 bytes** — a 40% difference.
 <!-- .element: class="fragment fade-up" -->
 `//TRANSLIT` strips all diacritics and produces valid ASCII.
 <!-- .element: class="fragment fade-up small" -->
 Note:
 - **Translation:** "Too yellow a horse groaned devilish odes" — a Czech pangram (like "The quick brown fox" but for testing diacritics)
 - 15 extra bytes from accented characters: each adds 1 byte in UTF-8
 - Czech diacritics: **háček** (ˇ) = caron (ř, š, č, ž, ň, ď, ť, ě), **čárka** (´) = acute (á, é, í, ó, ú), **kroužek** (°) = ring (ů)
 - **Do:** DevConf is in Brno — the audience will recognize this pangram
 ---
 <!-- .slide: data-background-color="#2f2f2f" data-transition="slide" -->
 <p class="slide-title">Text in C</p>
 ### How many encodings?
 ```bash
 $ iconv -l | wc -l
 1180
 $ find /usr/lib64/gconv -name '*.so' | wc -l
 253
 ```
 **1180** encoding names served by **253** shared libraries.
 <!-- .element: class="fragment fade-up" -->
 How does glibc manage this without writing thousands of converters?
 <!-- .element: class="fragment fade-up accent" -->
 Note:
 **Do:** LIVE DEMO if possible.
 - **iconv -l** = list all encodings. 1180 includes aliases (SHIFT-JIS, SJIS, MS_KANJI = same encoding)
 - **/usr/lib64/gconv/** = where glibc stores converter .so files (Fedora/RHEL). Debian: /usr/lib/x86_64-linux-gnu/gconv/
 - **.so** = shared object (dynamically loaded library)
 - 1180 names, 253 plugins — far fewer than the 39,800 needed for N×N
 ---
 <!-- ===================================================== -->
 <!-- SECTION 5: HOW IT WORKS INSIDE                        -->
 <!-- ===================================================== -->
 <!-- .slide: data-background-color="#2f2f2f" data-transition="zoom" -->
 <span class="badge badge-blue" style="font-size: 0.6em;">Part 3</span>
 ## Inside glibc's iconv
 Note:
 **Do:** *"We've seen what iconv does from the outside. Now let's look under the hood."*
 - **gconv** = glibc's internal conversion framework ("g" = GNU, "conv" = conversion)
 ---
 <!-- .slide: data-background-color="#2f2f2f" data-transition="slide" -->
 <p class="slide-title">Inside glibc</p>
 ### The naive approach: N×N converters
 Suppose I support 200 encodings. How many converters do I need?
 <!-- .element: class="medium" -->
 ```text
 Shift-JIS → UTF-8       UTF-8 → Shift-JIS
 Shift-JIS → EUC-KR      EUC-KR → Shift-JIS
 UTF-8 → EUC-KR          EUC-KR → UTF-8
 ...
 ```
 <!-- .element: class="fragment fade-in" -->
 5 encodings = 20 converters. 200 encodings?
 <!-- .element: class="fragment fade-up" -->
 200 × 199 = <span class="red">39,800 converters</span>. That's not going to work.
 <!-- .element: class="fragment zoom-in" -->
 Note:
 **Do:** Ask *"How many converters do I need?"* before revealing. Let them guess.
 - Formula: N × (N-1) for directed pairs
 - Nobody will write 39,800 converters
 ---
 <!-- .slide: data-auto-animate data-background-color="#2f2f2f" -->
 <p class="slide-title">Inside glibc</p>
 ### The smart approach: one universal pivot
 What if every encoding just learned to convert to **one common format**?
 <!-- .element: class="medium" -->
 ```text
 Shift-JIS  →  ???  →  UTF-8
 ```
 <!-- .element: data-id="hub-text" class="fragment fade-in" -->
 Note:
 Hub-and-spoke architecture — same principle as airline routing through hub airports.
 --
 <!-- .slide: data-auto-animate data-background-color="#2f2f2f" -->
 <p class="slide-title">Inside glibc</p>
 ### The smart approach: one universal pivot
 glibc's gconv framework uses an internal **UCS-4 based representation** as the pivot.
 ```text
 Shift-JIS  →  UCS-4  →  UTF-8
 ```
 <!-- .element: data-id="hub-text" -->
 Now you need just **2 converters per encoding** (to UCS-4 and from UCS-4).
 <!-- .element: class="fragment fade-up" -->
 200 encodings × 2 = <span class="green">400 converters</span> instead of 39,800.
 <!-- .element: class="fragment zoom-in" -->
 Note:
 - **UCS-4** = Universal Coded Character Set, 4-byte form (ISO 10646). Essentially UTF-32
 - glibc calls it **INTERNAL** in gconv-modules config
 - 2 converters per encoding → 400 total. 99% reduction
 - *Exception:* glibc says "UCS-4 *based*" — the internal representation has nuances around stateful encodings
 ---
 <!-- .slide: data-background-color="#2f2f2f" data-transition="slide" -->
 <p class="slide-title">Inside glibc</p>
 ### The lookup table: `gconv-modules`
 <pre><code class="language-text" data-line-numbers data-ln-start-from="47"># iconvdata/gconv-modules
 #   from             to              module     cost
 module  ISO-8859-1// INTERNAL        ISO8859-1   1
 module  INTERNAL     ISO-8859-1//    ISO8859-1   1</code></pre>
 <pre><code class="language-text" data-line-numbers data-ln-start-from="415"># iconvdata/gconv-modules-extra.conf
 module  SJIS//       INTERNAL        SJIS        1
 module  INTERNAL     SJIS//          SJIS        1</code></pre>
 `INTERNAL` = the UCS-4 pivot
 <!-- .element: class="fragment fade-up accent" -->
 Each line maps an encoding to a `.so` plugin. `iconv_open` reads this file, loads the right plugins, and chains them.
 <!-- .element: class="fragment fade-up small" -->
 Note:
 These are actual files from the glibc source tree.
 - Format: `module FROM// TO MODULE_NAME COST`
 - **INTERNAL** = glibc's name for UCS-4
 - **Cost** = routing weight when multiple paths exist (lower = preferred)
 - Each encoding has exactly 2 lines — one each direction. Hub-and-spoke in practice
 ---
 <!-- .slide: data-background-color="#2f2f2f" data-transition="slide" -->
 <p class="slide-title">Inside glibc</p>
 ### The conversion pipeline
 <div class="mermaid">
 <pre>
 flowchart TB
    A["Shift-JIS bytes"] --> B["SJIS.so\n(gconv module)"]
    B --> C["UCS-4\n(internal pivot)"]
    C --> D["UTF-8 converter\n(built-in)"]
    D --> E["UTF-8 bytes"]
    style C fill:#0f62fe,stroke:#78a9ff,color:#fff
    style B fill:#393939,stroke:#78a9ff,color:#c6c6c6
    style D fill:#393939,stroke:#78a9ff,color:#c6c6c6
    style A fill:#262626,stroke:#525252,color:#f1c21b
    style E fill:#262626,stroke:#525252,color:#42be65
 </pre>
 </div>
 Adding a new encoding = writing **one** `.so` plugin.
 <!-- .element: class="fragment fade-up small" -->
 Note:
 **Do:** THIS IS THE MONEY SLIDE. Spend time here. Point at each box:
 1. *"Shift-JIS bytes come in"*
 2. *"SJIS.so converts to UCS-4"*
 3. *"UTF-8 converter turns UCS-4 into UTF-8"*
 4. *"UTF-8 bytes come out"*
 Adding a new encoding = one .so that converts to/from UCS-4. People will photograph this.
 ---
 <!-- .slide: data-background-color="#2f2f2f" data-transition="slide" -->
 <p class="slide-title">Inside glibc</p>
 ### The iconv flow
 <div class="mermaid">
 <pre>
 sequenceDiagram
    participant App as Your Code
    participant glibc as glibc internals
    App->>glibc: iconv_open("UTF-8", "SJIS")
    Note right of glibc: look up gconv-modules
    Note right of glibc: load SJIS.so + UTF-8
    Note right of glibc: build step chain
    glibc-->>App: return descriptor
    App->>glibc: iconv(cd, &in, ...)
    Note right of glibc: step[0]: SJIS → UCS-4
    Note right of glibc: step[1]: UCS-4 → UTF-8
    glibc-->>App: advance pointers
    App->>glibc: iconv_close(cd)
    Note right of glibc: free chain, unload modules
 </pre>
 </div>
 Three calls. That's the entire API.
 <!-- .element: class="fragment fade-up small" -->
 Note:
 The API in three calls:
 1. **iconv_open** → returns descriptor (pointer to gconv_info struct with step chain)
 2. **iconv** → walks the chain. Both in/out pointers advance. Errors: **EILSEQ** (illegal sequence), **E2BIG** (output buffer full — flush and retry, not a real error), **EINVAL** (incomplete sequence)
 3. **iconv_close** → free chain, unload modules
 - *Highlight:* E2BIG is the #1 mistake — people call iconv once and assume it's done
 ---
 <!-- ===================================================== -->
 <!-- SECTION 6: RELEVANCE TODAY                            -->
 <!-- ===================================================== -->
 <!-- .slide: data-background-color="#2f2f2f" data-transition="zoom" -->
 <span class="badge badge-red" style="font-size: 0.6em;">Part 4</span>
 ## Does this still matter?
 Note:
 **Do:** *"Modern languages have Unicode strings by default. So why should anyone care about iconv in 2026?"*
 ---
 <!-- .slide: data-background-color="#2f2f2f" data-transition="slide" -->
 <p class="slide-title">Relevance today</p>
 ### How modern languages handle encoding
 | Language | Strings are... | Encoding conversion |
 |----------|----------------|---------------------|
 | **Python 3** | Unicode internally | Built-in codecs |
 | **Go** | UTF-8 by definition | `golang.org/x/text` |
 | **Rust** | Always valid UTF-8 | `encoding_rs` crate |
 | **Java** | UTF-16 internally | `java.nio.charset` |
 | **C/C++** | Just bytes — no encoding | **`iconv`** |
 Modern languages solved this by making strings Unicode-native. C didn't — and can't, because it would break 50 years of code.
 <!-- .element: class="fragment fade-up small" -->
 Note:
 - C can't change because `char = 1 byte` is baked into the language spec and **ABI** (Application Binary Interface)
 - Even modern languages need encoding conversion at **I/O boundaries** — files, sockets, C library calls via **FFI** (Foreign Function Interface)
 - Python's codecs, Go's x/text, Rust's encoding_rs all exist because the outside world isn't always UTF-8
 ---
 <!-- .slide: data-background-color="#2f2f2f" data-transition="slide" -->
 <p class="slide-title">Relevance today</p>
 ### Encoding bugs are alive and well
 <div class="two-col">
 <div class="card">
 #### The Turkish İ problem
 | Locale | `toupper('i')` |
 |--------|----------------|
 | en_US | I |
 | tr_TR | <span class="red">İ</span> (dotted!) |
 Tests pass in English, break in Turkish.
 </div>
 <div class="card">
 #### `//IGNORE` inconsistency
 ```bash
 $ echo 'héllo' | iconv \
  -f UTF-8 -t ASCII//IGNORE
 ```
 Some modules skip the bad byte. Some stop with an error.
 **Same flag, different behavior.**
 </div>
 </div>
 <br />
 Every time a language reads a file, parses a socket, or calls a C library — encoding conversion still happens. These bugs still bite.
 <!-- .element: class="fragment fade-up accent small" -->
 Note:
 **Turkish İ:**
 - Turkish has 4 i's: i, İ, ı, I. toupper('i') → İ (U+0130), not I
 - Any case-insensitive comparison using toupper/tolower is locale-dependent
 **//IGNORE:**
 - Behavior depends on *which* gconv module runs — inconsistent across encodings
 - This is a real unfixed glibc bug. This is what got me into the codebase
 ---
 <!-- ===================================================== -->
 <!-- SECTION 7: GLIBC WORKSHOP                             -->
 <!-- ===================================================== -->
 <!-- .slide: data-background-color="#2f2f2f" data-transition="fade" -->
 ### glibc Development Workshop — Third Edition
 Led by **Arjun Shankar** (Red Hat, glibc developer)
 <span class="accent medium">Tomorrow, Friday June 19 · 10:15 AM · Room A218</span>
 Pick a bug, get a cheat sheet, ship a patch.
 6 patches in 2024 · 15+ in 2025 · **yours in 2026?**
 Note:
 **Do:** Tell the personal story:
 *"Two years ago I walked into this workshop at DevConf. Arjun gave me a small iconv task. I got curious, fell down the rabbit hole, and that became this talk. That one task turned into 14 patches in glibc."*
 - **Arjun Shankar** = Red Hat engineer, glibc developer. Runs this workshop yearly at DevConf.CZ
 - Format: show up, get a cheat sheet with a small bug + pointers, experienced contributors help you submit
 - Room A218, capacity 20. First come, first served
 - *"If anything in this talk made you curious, room A218 tomorrow morning."*
 ---
 <!-- ===================================================== -->
 <!-- SECTION 8+9: REFERENCES + QUESTIONS                   -->
 <!-- ===================================================== -->
 <!-- .slide: data-background-color="#2f2f2f" data-transition="fade" -->
 ### Questions? · Resources
 - **Joel Spolsky** — "The Absolute Minimum Every Software Developer Must Know About Unicode" <!-- .element: class="small" -->
 - **GNU C Library Manual** — "Character Set Handling" chapter <!-- .element: class="small" -->
 - **unicode.org** — the specification <!-- .element: class="small" -->
 <span class="badge badge-blue">avinal.space</span> · <span class="badge badge-purple">@avinal</span>
 Attendance at DevConf.CZ 2026 was supported by the **[GNU Toolchain Fund](https://my.fsf.org/civicrm/contribute/transact?reset=1&id=57)**, a part of the FSF's Working Together for Free Software Fund.
 <!-- .element: class="small" -->
 Note:
 **Do:** Leave this up during Q&A.
 - Joel Spolsky's article (2003) — the classic intro, entertaining
 - glibc manual — authoritative API reference (sourceware.org/glibc/manual)
 - **GNU Toolchain Fund** = part of the **FSF's** (Free Software Foundation) "Working Together for Free Software" fund