In other words, the ambiguity problem still exists today. When writing a text file from Python, the default encoding is platform-dependent on my Windows PC, it’s Windows-1252. The Windows Registry editor, for example, still saves text files as UTF-16. UTF-8 hasn’t taken over the world just yet, though. It’s impressive how quickly that number has changed it was less than 10% as recently as 2006. More than 95% of the Internet is now delivered using UTF-8. Fortunately, the text file landscape has gotten simpler over time, with UTF-8 winning out over other character encodings. It’s a problem that has been around for a while.
Get text encoding software#
This poses a challenge to software that loads text. That’s obviously an artificial example, but the point is that text files are inherently ambiguous. a big-endian UTF-16 file containing “슢슢슢”.a little-endian UTF-16 (or UCS-2) file containing “ꋂꋂꋂ”.For example, suppose a file contains the following bytes:
![get text encoding get text encoding](https://devblogs.microsoft.com/wp-content/uploads/sites/29/2019/02/5078.hsg-9-19-11-03.png)
Sometimes it’s impossible to determine the encoding used by a particular text file. Lines of text could be terminated with a linefeed character \n (typical on UNIX), a CRLF sequence \r\n (typical on Windows) or, if the file was created on an older system, some other character sequence.
![get text encoding get text encoding](https://www.paloaltonetworks.com/blog/wp-content/uploads/2016/09/CTF_Threat_2.png)
The file may or may not begin with a byte order mark (BOM). The text could be encoded as ASCII, UTF-8, UTF-16 (little or big-endian), Windows-1252, Shift JIS, or any of dozens of other encodings. This text file can take on a surprising number of different formats.