3. bytes와 str의 차이를 알아두라¶

bytes 타입의 인스턴스에는 부호가 없는 8바이트 데이터가 그대로 들어감
직접 대응하는 텍스트 인코딩이 없음

In [4]:

a = b'h\x65llo'

In [5]:

print(list(a))

[104, 101, 108, 108, 111]

In [6]:

print(a)

b'hello'

str 인스턴스에는 사람이 사용하는 언어의 문자를 표현하는 유니코드 코드 포인트가 들어가 있다.
직접 대응하는 이진 인코딩이 없음

In [7]:

a = 'a\u0300 propos'

In [8]:

print(list(a))

['a', '̀', ' ', 'p', 'r', 'o', 'p', 'o', 's']

In [9]:

print(a)

à propos

unicode to bytes -> encode()
bytes to unicode -> decode()

파이썬 프로그램을 작성할 떄 유니코드 데이터를 인코딩하거나 디코딩하는 부분을 인터페이스의 가장 먼 경계 지점에 위치시켜라.

이런 방식을 유니코드 샌드위치 라고 부른다.

str을 사용하자

문자 표현 방식에 의해 발생하는 두가지 상황¶

UTF-8(또는 다른 인코딩 방식)로 인코딩된 8비트 시퀀스를 그대로 사용하고 싶다.
특정 인코딩을 지정하지 않은 유니코드 문자열을 사용하고 싶다.

bytes나 str 인스턴스를 받아서 항상 str 반환하는 함수¶

In [10]:

def to_str(bytes_or_str):
    if isinstance(bytes_or_str, bytes):
        value = bytes_or_str.decode('utf-8')
    else:
        value = bytes_or_str
    return value # str 인스턴스

In [11]:

print(repr(to_str(b'foo')))

'foo'

In [12]:

print(repr(to_str('bar')))

'bar'

In [13]:

print(repr(to_str(b'\xed\x95\x9c')))

'한'

bytes나 str 인스턴스를 받아서 항상 bytes 반환하는 함수¶

In [14]:

def to_bytes(bytes_or_str):
    if isinstance(bytes_or_str, str):
        value = bytes_or_str.encode('utf-8')
    else:
        value = bytes_or_str
    return value # bytes 인스턴스

In [16]:

print(repr(to_bytes(b'foo')))

b'foo'

In [17]:

print(repr(to_bytes('bar')))

b'bar'

In [18]:

print(repr(to_bytes('한글')))

b'\xed\x95\x9c\xea\xb8\x80'

이진 8비트 값과 유니코드 문자여을 다룰 때 두가지 문제점¶

bytes와 str은 호환되지 않으므로 어떤 타입인지 알아야함¶

'+' 연산자는 bytes는 bytes끼리, str은 str끼리 연산 가능

다른 연산자도 마찬가지

In [19]:

print(b'one' + b'two')

b'onetwo'

In [20]:

print('one' + 'two')

onetwo

In [21]:

print(b'one' + 'two')

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-21-7e91d857e3d1> in <module>
----> 1 print(b'one' + 'two')

TypeError: can't concat str to bytes

같은 문자를 갖는 bytes와 str 인스턴스가 같은지 비교하면 False

In [22]:

print(b'foo' == 'foo')

False

% 연산자에 대해 적용해보자¶

In [23]:

print(b'red %s' % b'blue')

b'red blue'

In [24]:

print('red %s' % 'blue')

red blue

하지만 파이썬이 어떤 이진 텍스트 인코딩을 사용할지 알 수 없으므로 str 인스턴스를 bytes 형식화 문자열에 넘길 수는 없음

In [25]:

print(b'red %s' % 'blue')

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-25-996799a8aced> in <module>
----> 1 print(b'red %s' % 'blue')

TypeError: %b requires a bytes-like object, or an object that implements __bytes__, not 'str'

하지만 반대는 가능하다.

In [26]:

print('red %s' % b'blue')

red b'blue'

여기서는 bytes 인스턴스의 repr 메서드를 호출함

파일 핸들과 관련한 연산들이 디폴트로 유니코드 문자열을 요구하고 이진 바이트 문자열을 요구하지 않음¶

In [27]:

with open('data.bin', 'w') as f:
    f.write(b'\xf1\xf2\xf3\xf4\xf5')

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-27-e42a59df12fe> in <module>
      1 with open('data.bin', 'w') as f:
----> 2     f.write(b'\xf1\xf2\xf3\xf4\xf5')

TypeError: write() argument must be str, not bytes

'wb'가 아닌 'w'로 열었기 때문임

In [28]:

with open('data.bin', 'wb') as f:
    f.write(b'\xf1\xf2\xf3\xf4\xf5')

반대로 읽을 때도 비슷한 오류 발생

In [29]:

with open('data.bin', 'r') as f:
    data = f.read()

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-29-e61b03babdec> in <module>
      1 with open('data.bin', 'r') as f:
----> 2     data = f.read()

/usr/lib/python3.8/codecs.py in decode(self, input, final)
    320         # decode input (taking the buffer into account)
    321         data = self.buffer + input
--> 322         (result, consumed) = self._buffer_decode(data, self.errors, final)
    323         # keep undecoded input until the next call
    324         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 0: invalid continuation byte

In [30]:

with open('data.bin', 'rb') as f:
    data = f.read()

In [31]:

data

Out[31]:

b'\xf1\xf2\xf3\xf4\xf5'

In [32]:

with open('data.bin', 'r', encoding='cp1252') as f:
    data = f.read()

In [33]:

data

Out[33]:

'ñòóôõ'

핸들이 텍스트 모드에 있으면 시스템의 디폴트 텍스트 인코딩을 bytes.encode (쓰기의 경우)와 str.decode (읽기의 경우)에 적용해서 이진데이터를 해석한다.¶

-> python2 이야기인듯 하다.

대부분 시스템 디폴트 인코딩은 utf-8 인데 이는 위 이진 데이터를 읽을 수 없음

따라서 디폴트 인코딩이 다를 수 있으므로 명시해주는게 좋음

In [35]:

import locale
print(locale.getpreferredencoding())

UTF-8