Why Your CSV Turns to Gibberish When Converted to JSON
A CSV file with perfectly readable Korean, Japanese, or accented text opens up as rows of "□□□" boxes or seemingly random symbols the moment it's converted. The data isn't corrupted — it's being read with the wrong encoding, and the fix is usually simpler than the symptom suggests.
What's actually happening
Text files don't carry their encoding as a visible label most of the time — a program has to guess, or be told, how to translate the raw bytes on disk back into characters. If a file was saved as EUC-KR (a legacy Korean encoding, common in Excel exports on Korean-locale Windows machines) or Shift-JIS (Japanese), and a tool reads it assuming UTF-8, every multi-byte character gets misinterpreted. The English text and numbers in the same file usually still look fine, because ASCII characters happen to be encoded identically in most of these systems — which is exactly what makes the bug so confusing. Everything looks broken only in the non-English columns, and it's easy to assume the file itself is the problem rather than the assumption being made about it.
Where this actually comes from
The most common source, by far, is Excel. When you "Save As" a CSV on a Korean or Japanese Windows locale, the default encoding is often the legacy one (EUC-KR/CP949, or Shift-JIS) rather than UTF-8, unless you specifically choose "CSV UTF-8" as the save format. A less common but real source: Excel's "Save As → Unicode Text," which produces a UTF-16 encoded file with a .txt extension that people sometimes rename to .csv — this isn't even comma-delimited by default (it's tab-delimited), and it's a completely different byte format from either UTF-8 or EUC-KR.
How to actually diagnose it
Three encodings account for the overwhelming majority of real-world cases: UTF-8 (the modern default, and hopefully what you're already dealing with), EUC-KR/CP949 (legacy Korean), and UTF-16 (from certain Excel export paths). A reliable way to tell them apart without guessing:
- Check the first few bytes. A byte-order mark (BOM) at the very start of the file —
EF BB BFfor UTF-8,FF FEorFE FFfor UTF-16 — tells you definitively, when present. Not every file has one, but when it's there, trust it. - No BOM, but lots of null bytes? UTF-16 without a BOM shows a very distinctive pattern: a null byte (
0x00) next to almost every ASCII character. If you open the file in a hex viewer and see that pattern, it's UTF-16. - No BOM, no null-byte pattern, but garbled non-English text? It's very likely EUC-KR/CP949 (for Korean) or an equivalent legacy encoding for other languages, being misread as UTF-8.
The fix depends on the tool
Some tools let you pick an input encoding explicitly — use that if it's available, and set it to match what you've diagnosed above. Tools that don't offer this choice and always assume UTF-8 will keep producing garbled output no matter how many times you re-export the same file; the encoding needs to be handled at read time, not worked around after the fact.
Try it
FreeToolDev's CSV to JSON converter auto-detects encoding on every file — it checks for a BOM first, then sniffs for UTF-16's null-byte pattern, then falls back through UTF-8 and EUC-KR automatically. This came directly out of hitting exactly this bug while testing the tool against a real Korean-language CSV export, which is why it's handled automatically rather than left as a setting you have to know to look for.