Note: phiên bản Tiếng Việt của bài này ở link dưới.
https://duongnt.com/codecs-register-error-vie
Some people might believe that character encoding/decoding is easy or unimportant. That is, until they are hit by one of its many caveats. Fortunately, Python has extensive support for Unicode. And thanks to the codecs.register_error
method, we can freely customize the error handling process.
Characters encoding primer in Python
The basic concepts
In Python 3, the idea of string and byte representation has been overhauled. We need to understand these main concepts.
- Character: the abstract concept of a symbol in a writing system.
- Code point: a number mapped to a specific character in a standard. Usually, we use the Unicode standard.
- Byte representation: the actual bytes to represent a character in a specific encoding.
- Encoding: the process of converting a code point to its byte representation.
- Decoding: the process of converting a byte representation back to the corresponding code point.
Below are some concrete examples.
Character | Code point | Byte representation (ASCII) | Byte representation (UTF-8) |
---|---|---|---|
1 | U+0031 | \x31 | \x31 |
a | U+0061 | \x61 | \x61 |
μ | U+03BC | Unavailable | \xce\xbc |
ụ | U+1EE5 | Unavailable | \xe1\xbb\xa5 |
From the examples above, we can see that some code points can be encoded in UTF-8 but are not available in ASCII
. This is because ASCII
uses only 7 bits to encode a character, while UTF-8 can use up to 4 bytes per character.
We can’t encode μ
with ASCII
. And we also can’t decode μ
‘s UTF-8 byte representation into ASCII
.
'μ'.encode('ASCII')
# Throws: UnicodeEncodeError: 'ascii' codec can't encode character '\u03bc' in position 0: ordinal not in range(128)
data = 'μ'.encode('UTF-8')
data.decode('ASCII')
# Throws: UnicodeDecodeError: 'ascii' codec can't decode byte 0xce in position 0: ordinal not in range(128)
Built-in error handlers
Both the str.encode and the bytes.decode support an errors
argument. If we pass the name of an error handler to this argument, the runtime will call that handler when it encounters an error. Here are the built-in handlers.
utf8_bytes = 'Hà Nội'.encode(encoding='utf-8')
utf8_bytes.decode(encoding='ascii', errors='strict')
# UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
utf8_bytes.decode(encoding='ascii', errors='ignore')
# 'H Ni'
utf8_bytes.decode(encoding='ascii', errors='replace')
# 'H�� N���i'
utf8_bytes.decode(encoding='ascii', errors='backslashreplace')
# 'H\\xc3\\xa0 N\\xe1\\xbb\\x99i'
utf8_bytes.decode(encoding='ascii', errors='surrogateescape')
# 'H\udcc3\udca0 N\udce1\udcbb\udc99i'
Custom codecs error handlers
In most cases, the built-in handlers are sufficient. But if we want to implement our own logic to handle an encoding or decoding error, we can create our own handlers method.
Use the register_error method to register a new handler
First, we need to add our handler to the codec registry with the register_error
method.
def our_handler(e):
# our error handling logic
# codecs.register_error('<a name>', <handler name>)
codecs.register_error('our_handler_name', our_handler)
Then we can pass our_handler_name
to the errors
argument just like a built-in handler.
utf8_bytes.decode(encoding='ascii', errors='our_handler_name')
How to create a handler
A handler must satisfy these two conditions.
- Receive one position argument. When used to handle an encoding error, that argument is an
UnicodeEncodeError
. And when used to handle a decoding error, that argument is aUnicodeDecodeError
. Below are their fields.encoding
: the encoding currently being used.object
: thestring
being encoded, or thebytes
being decoded.start
: the first index where we encounter an error.end
: the last index where we encounter an error.reason
: the error message.
- Return a tuple with two elements.
- The first one is the replacement for the character we can’t encode or the byte we can’t decode.
- The second one is the index to resume encoding/decoding.
An example
As a demonstration, we will create a handler to use *
as a replacement when encountering an error. We will also log the error details to the console.
def our_handler(e):
if e is UnicodeEncodeError:
print('Encounter an error while encoding a character')
else: # e is UnicodeDecodeError
print('Encounter an error while decoding a byte')
print(f'encoding: {e.encoding}')
print(f'object: {e.object}')
print(f'start: {e.start}')
print(f'reason: {e.reason}')
return "*", e.end # resume encoding/decoding after the error location
Handler encoding error
We will encode the text Hà Nội
using ASCII
and handler errors with our_handler
.
print('Hà Nội'.encode('ascii', errors='our_handler_name'))
Below is the result.
Encounter an error while decoding a character
encoding: ascii
object: Hà Nội
start: 1
reason: ordinal not in range(128)
Encounter an error while decoding a character
encoding: ascii
object: Hà Nội
start: 4
reason: ordinal not in range(128)
b'H* N*i'
We can see that our method correctly detected two places that the ASCII
standard cannot encode. These are index 1 (à)
and index 4 (ộ)
. We replaced both of them with *
.
Handler decoding error
Similarly, we will decode the utf8_bytes
object created in a previous section with ASCII
and our_handler
.
utf8_bytes.decode(encoding='ascii', errors='our_handler_name')
Below is the result.
Encounter an error while decoding a byte
encoding: ascii
object: b'H\xc3\xa0 N\xe1\xbb\x99i'
start: 1
reason: ordinal not in range(128)
Encounter an error while decoding a byte
encoding: ascii
object: b'H\xc3\xa0 N\xe1\xbb\x99i'
start: 2
reason: ordinal not in range(128)
Encounter an error while decoding a byte
encoding: ascii
object: b'H\xc3\xa0 N\xe1\xbb\x99i'
start: 5
reason: ordinal not in range(128)
Encounter an error while decoding a byte
encoding: ascii
object: b'H\xc3\xa0 N\xe1\xbb\x99i'
start: 6
reason: ordinal not in range(128)
Encounter an error while decoding a byte
encoding: ascii
object: b'H\xc3\xa0 N\xe1\xbb\x99i'
start: 7
reason: ordinal not in range(128)
H** N***i
A curious thing is the character à
is replaced with **
and ộ
is replaced with ***
. This is because utf-8
is a variable encoding. The character à
was encoded using 2 bytes (\xc3\xa0
), and ộ
was encoded using 3 bytes (\xe1\xbb\x99
). Each time we try to decode one of those bytes using ASCII
, we encounter an error and our_handler
is invoked.
Conclusion
If we only work with ASCII
text, then the encoding problem in Python is trivial. But most of the time, we need to work with Unicode. In that case, knowing how to handle encoding and decoding errors would be beneficial.
One Thought on “Codecs error handling with register_error”