Repeat after me: Each unicode encoding (UTF-8, UTF-7, UTF-16, UTF-32, etc) maps different sequences of bytes to the unicode code points (therefore might as well map same sequences of bytes to different unicode code points). A code point is a number that maps to a particular abstract character (grapheme).
Types:
- The
unicode
type stores an abstract sequence of code points. str
is for strings of bytes. These are very similar in nature to how strings are handled in C.
Important - Python 3. Important not only for guidelines but also for clarity. Moreover many “solutions” floating on the web are Python 3 (open()
has both an encoding and newlines param in P3. ). Speaking of 2:
with open("file.txt", 'wb')as out:
out.write('\n')
with open("file2.txt", 'w')as out:
out.write('\n')
print(os.stat('file.txt').st_size) # 1
print(os.stat('file2.txt').st_size) # 2
Refs:
- Unicode In Python, Completely Demystified - best slides I’ve seen, I wish I had the template.
- Strip newlines
- os.linesep
raise MyException(u'Cannot do this while at a café')
- the standard library remains ASCII-only with the exception of contributor names in comments.
- Ref
- Encode and decode as needed