Guides Text Encoding

Text Encoding Explained

Discover how computers represent text as numbers. From ASCII to UTF-8, learn why encoding matters for international text.

10 minute read Intermediate

Why Encoding Matters

Computers don't understand letters or symbols—they only work with numbers (binary: 0s and 1s). Text encoding is the system that maps characters to numbers so computers can store and display text.

When you see the letter "A" on screen, the computer actually stores the number 65. When displaying text, it looks up 65 in an encoding table and shows "A". Different encoding systems use different number-to-character mappings, which is why encoding problems cause gibberish text like "é" instead of "é".

Real-World Impact

Encoding mistakes cause website errors, corrupted emails, and broken database records. Understanding encoding prevents data loss and ensures text displays correctly worldwide.

ASCII: The Beginning

ASCII (American Standard Code for Information Interchange) was created in 1963 and uses 7 bits to represent 128 characters:

  • 0-31: Control characters (newline, tab, etc.)
  • 32-126: Printable characters (letters, numbers, punctuation)
  • 127: Delete character
ASCII Character Examples
Character → ASCII Number (Decimal)
A → 65
B → 66
a → 97
0 → 48
Space → 32
! → 33

ASCII's Limitation

ASCII only covers English characters. It can't represent:

  • Accented letters (é, ñ, ü)
  • Non-Latin scripts (中文, العربية, Русский)
  • Emoji (😀, 🚀)
  • Special symbols (€, ©, ™)

Different countries created extended ASCII variants (like Windows-1252, ISO-8859-1) that used the 8th bit for 128 additional characters, but these were incompatible with each other. A file encoded in one extended ASCII couldn't be read correctly in another.

Unicode: Universal Characters

Unicode solved the incompatibility problem by creating a single character set for all languages. Unicode assigns a unique number (called a code point) to every character from every writing system.

Unicode Code Points

Code points are written as U+ followed by hexadecimal digits:

Character → Unicode Code Point
A → U+0041
€ → U+20AC
中 → U+4E2D
😀 → U+1F600

Unicode contains over 140,000 characters covering 150+ scripts. It's constantly expanding to include historical scripts, emoji, and symbols.

Unicode vs Encoding

Unicode defines which characters exist and their code points. Encodings like UTF-8 define how to store these code points as bytes in files and memory.

UTF-8: The Modern Standard

UTF-8 (8-bit Unicode Transformation Format) is the most popular Unicode encoding. It's used by 98% of websites and is the default for most programming languages.

Why UTF-8 Won

  • Backward compatible with ASCII: ASCII characters use the same bytes in UTF-8
  • Variable length: Common characters (English) use 1 byte, others use 2-4 bytes
  • Self-synchronizing: Can detect character boundaries even if you jump into the middle of a file
  • No byte order issues: Unlike UTF-16, no BOM (Byte Order Mark) needed

How UTF-8 Encodes Characters

UTF-8 Byte Lengths
1 byte (U+0000 to U+007F): ASCII characters
  Example: "A" → 01000001 (1 byte)

2 bytes (U+0080 to U+07FF): Latin extended, Greek, Cyrillic
  Example: "é" → 11000011 10101001 (2 bytes)

3 bytes (U+0800 to U+FFFF): Most common characters
  Example: "中" → 11100100 10111000 10101101 (3 bytes)

4 bytes (U+10000 to U+10FFFF): Emoji, rare characters
  Example: "😀" → 11110000 10011111 10011000 10000000 (4 bytes)

UTF-8 vs Other Encodings

Encoding Bytes per Character Pros Cons
ASCII 1 Simple, compact for English Only 128 characters
UTF-8 1-4 (variable) Universal, ASCII compatible Variable length can be slower
UTF-16 2-4 (variable) Common in Windows/Java Wastes space for ASCII text
UTF-32 4 (fixed) Fixed width, fast indexing 4× larger than UTF-8 for English

Common Encoding Problems

1. Mojibake (文字化け) - Garbled Text

Problem: "café" displays as "café"

Cause: UTF-8 text interpreted as Windows-1252 or ISO-8859-1.

Solution: Set correct encoding when opening files. Most text editors have "Reopen with Encoding" options.

2. BOM (Byte Order Mark) Issues

Problem: Files start with weird characters like ""

Cause: UTF-8 BOM (EF BB BF) visible in editors that don't expect it.

Solution: Save as "UTF-8 without BOM" for web files. Only use BOM for Windows text files.

3. Database Encoding Mismatches

Problem: Text looks fine in app but garbled in database.

Cause: Connection charset doesn't match database charset.

Solution:

-- MySQL example
SET NAMES utf8mb4;
CREATE DATABASE mydb CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

4. Email Encoding

Problem: Subject lines with non-ASCII characters show as "=?UTF-8?Q?..."

Cause: Email headers must be MIME-encoded for international characters.

Solution: Use email libraries that handle encoding automatically (e.g., PHPMailer, Python's email module).

Never Mix Encodings

The #1 rule: Be consistent. If you read a file as UTF-8, write it as UTF-8. Mixing encodings in a single file guarantees corruption.

Best Practices

  • Always use UTF-8 unless you have a specific reason not to
  • Declare encoding in HTML: <meta charset="UTF-8">
  • Set database charset to UTF-8: Use utf8mb4 in MySQL for full Unicode support
  • Specify encoding when opening files: Python: open('file.txt', encoding='utf-8')
  • Test with international characters: Use é, 中文, 😀 in your tests
  • Save source code as UTF-8: Configure your IDE to default to UTF-8
Quick Reference

For Web: UTF-8, always
For JSON: UTF-8 (required by spec)
For CSV: UTF-8 with BOM if opening in Excel
For databases: utf8mb4 (MySQL) or UTF-8 (PostgreSQL)
For code files: UTF-8 without BOM

Detecting Encoding

If you receive a file with unknown encoding, use detection tools:

  • Command line: file -i filename.txt (Linux/Mac)
  • Python: chardet library
  • Online tools: Search "encoding detector"
  • Text editors: Notepad++ shows encoding in status bar

Remember: UTF-8 is not just a technical choice—it's about inclusivity. Using UTF-8 ensures your software works for users worldwide, regardless of their language.