Invalid byte sequence for encoding
Posted by Darryll Sulymka on 02/16/2010 02:50:33

I have been working on a project that is indexing a large amount of data. I ran into a wall with MySQL. With in an hour of running it went from around 2-3secs every 1000 rows inserted to 40secs and continued to raise. At the decline in speed I would be dead by the time it finished indexing the data I need it to. I did a search and found Postgresql. With a few tweaks to my code I was able to get postgresql to process 10,000 records in about 6-15secs. It seems to float back and forth between these times. With the switch I ran into an new problem one I had not had with MySQL possibly because I had not processed enough data but I think it was a Postgresql problem. I would get the error

psycopg2.DataError: invalid byte sequence for encoding "UTF8": 0x8f
HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".

After some reading I found that this was a problem in an attempt to convert a special character to UTF8. Since I didn’t need the special characters I proceed to strip them out from the data.

Python:
      for c in subject:
        if ord(c)>127 or ord(c)<9:
          c="_"
        subject2=subject2+c

This lowered the frequency of this happening but every once and a while I would get the error message again. For the longest time I wasn’t able to figure out why. It turns out I had failed to account for the escape character '\' if only the error message came back as error invalid byte sequence for encoding "UTF8": 0x8f … \8934 … I may have figured this one out sooner. Beside for this one problem working with protgresql has been a pleasant experience.


Permalink