Note: phiên bản Tiếng Việt của bài này ở link dưới.

https://duongnt.com/codecs-register-error-vie

Custom codecs error handling with register_error

Some people might believe that character encoding/decoding is easy or unimportant. That is, until they are hit by one of its many caveats. Fortunately, Python has extensive support for Unicode. And thanks to the codecs.register_error method, we can freely customize the error handling process.

Characters encoding primer in Python

The basic concepts

In Python 3, the idea of string and byte representation has been overhauled. We need to understand these main concepts.

  • Character: the abstract concept of a symbol in a writing system.
  • Code point: a number mapped to a specific character in a standard. Usually, we use the Unicode standard.
  • Byte representation: the actual bytes to represent a character in a specific encoding.
  • Encoding: the process of converting a code point to its byte representation.
  • Decoding: the process of converting a byte representation back to the corresponding code point.

Below are some concrete examples.

Character Code point Byte representation (ASCII) Byte representation (UTF-8)
1 U+0031 \x31 \x31
a U+0061 \x61 \x61
μ U+03BC Unavailable \xce\xbc
U+1EE5 Unavailable \xe1\xbb\xa5

From the examples above, we can see that some code points can be encoded in UTF-8 but are not available in ASCII. This is because ASCII uses only 7 bits to encode a character, while UTF-8 can use up to 4 bytes per character.

We can’t encode μ with ASCII. And we also can’t decode μ‘s UTF-8 byte representation into ASCII.

'μ'.encode('ASCII')

# Throws: UnicodeEncodeError: 'ascii' codec can't encode character '\u03bc' in position 0: ordinal not in range(128)

data = 'μ'.encode('UTF-8')
data.decode('ASCII')

# Throws: UnicodeDecodeError: 'ascii' codec can't decode byte 0xce in position 0: ordinal not in range(128)

Built-in error handlers

Both the str.encode and the bytes.decode support an errors argument. If we pass the name of an error handler to this argument, the runtime will call that handler when it encounters an error. Here are the built-in handlers.

utf8_bytes = 'Hà Nội'.encode(encoding='utf-8')

utf8_bytes.decode(encoding='ascii', errors='strict')
# UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)

utf8_bytes.decode(encoding='ascii', errors='ignore')
# 'H Ni'

utf8_bytes.decode(encoding='ascii', errors='replace')
# 'H�� N���i'

utf8_bytes.decode(encoding='ascii', errors='backslashreplace')
# 'H\\xc3\\xa0 N\\xe1\\xbb\\x99i'

utf8_bytes.decode(encoding='ascii', errors='surrogateescape')
# 'H\udcc3\udca0 N\udce1\udcbb\udc99i'

Custom codecs error handlers

In most cases, the built-in handlers are sufficient. But if we want to implement our own logic to handle an encoding or decoding error, we can create our own handlers method.

Use the register_error method to register a new handler

First, we need to add our handler to the codec registry with the register_error method.

def our_handler(e):
    # our error handling logic

# codecs.register_error('<a name>', <handler name>)
codecs.register_error('our_handler_name', our_handler)

Then we can pass our_handler_name to the errors argument just like a built-in handler.

utf8_bytes.decode(encoding='ascii', errors='our_handler_name')

How to create a handler

A handler must satisfy these two conditions.

  • Receive one position argument. When used to handle an encoding error, that argument is an UnicodeEncodeError. And when used to handle a decoding error, that argument is a UnicodeDecodeError. Below are their fields.
    • encoding: the encoding currently being used.
    • object: the string being encoded, or the bytes being decoded.
    • start: the first index where we encounter an error.
    • end: the last index where we encounter an error.
    • reason: the error message.
  • Return a tuple with two elements.
    • The first one is the replacement for the character we can’t encode or the byte we can’t decode.
    • The second one is the index to resume encoding/decoding.

An example

As a demonstration, we will create a handler to use * as a replacement when encountering an error. We will also log the error details to the console.

def our_handler(e):
    if e is UnicodeEncodeError:
        print('Encounter an error while encoding a character')
    else: # e is UnicodeDecodeError
        print('Encounter an error while decoding a byte')

    print(f'encoding: {e.encoding}')
    print(f'object: {e.object}')
    print(f'start: {e.start}')
    print(f'reason: {e.reason}')

    return "*", e.end # resume encoding/decoding after the error location

Handler encoding error

We will encode the text Hà Nội using ASCII and handler errors with our_handler.

print('Hà Nội'.encode('ascii', errors='our_handler_name'))

Below is the result.

Encounter an error while decoding a character
encoding: ascii
object: Hà Nội
start: 1
reason: ordinal not in range(128)
Encounter an error while decoding a character
encoding: ascii
object: Hà Nội
start: 4
reason: ordinal not in range(128)
b'H* N*i'

We can see that our method correctly detected two places that the ASCII standard cannot encode. These are index 1 (à) and index 4 (ộ). We replaced both of them with *.

Handler decoding error

Similarly, we will decode the utf8_bytes object created in a previous section with ASCII and our_handler.

utf8_bytes.decode(encoding='ascii', errors='our_handler_name')

Below is the result.

Encounter an error while decoding a byte
encoding: ascii
object: b'H\xc3\xa0 N\xe1\xbb\x99i'
start: 1
reason: ordinal not in range(128)
Encounter an error while decoding a byte
encoding: ascii
object: b'H\xc3\xa0 N\xe1\xbb\x99i'
start: 2
reason: ordinal not in range(128)
Encounter an error while decoding a byte
encoding: ascii
object: b'H\xc3\xa0 N\xe1\xbb\x99i'
start: 5
reason: ordinal not in range(128)
Encounter an error while decoding a byte
encoding: ascii
object: b'H\xc3\xa0 N\xe1\xbb\x99i'
start: 6
reason: ordinal not in range(128)
Encounter an error while decoding a byte
encoding: ascii
object: b'H\xc3\xa0 N\xe1\xbb\x99i'
start: 7
reason: ordinal not in range(128)
H** N***i

A curious thing is the character à is replaced with ** and is replaced with ***. This is because utf-8 is a variable encoding. The character à was encoded using 2 bytes (\xc3\xa0), and was encoded using 3 bytes (\xe1\xbb\x99). Each time we try to decode one of those bytes using ASCII, we encounter an error and our_handler is invoked.

Conclusion

If we only work with ASCII text, then the encoding problem in Python is trivial. But most of the time, we need to work with Unicode. In that case, knowing how to handle encoding and decoding errors would be beneficial.

A software developer from Vietnam and is currently living in Japan.

One Thought on “Codecs error handling with register_error”

Leave a Reply