Pythons (2) and Unicode

Repeat after me: Each unicode encoding (UTF-8, UTF-7, UTF-16, UTF-32, etc) maps different sequences of bytes to the unicode code points (therefore might as well map same sequences of bytes to different unicode code points). A code point is a number that maps to a particular abstract character (grapheme).

Types:

The unicode type stores an abstract sequence of code points.
str is for strings of bytes. These are very similar in nature to how strings are handled in C.

uni.encode(encoding): Unicode string to a Python byte string s.decode(encoding): convert a byte string to a Unicode string unicode(s, encoding): convert a byte string to a Unicode string

Important - Python 3. Important not only for guidelines but also for clarity. Moreover many “solutions” floating on the web are Python 3 (open() has both an encoding and newlines param in P3. ). Speaking of 2:

with open("file.txt", 'wb')as out:
    out.write('\n')
with open("file2.txt", 'w')as out:
    out.write('\n')
print(os.stat('file.txt').st_size)  # 1
print(os.stat('file2.txt').st_size)  # 2

Refs:

Unicode In Python, Completely Demystified - best slides I’ve seen, I wish I had the template.
Strip newlines
os.linesep
raise MyException(u'Cannot do this while at a café')
the standard library remains ASCII-only with the exception of contributor names in comments.
Ref
Encode and decode as needed

Udun's Labs

Pythons (2) and Unicode

Comments