[#118] problem with utf8 systems trying to see non utf8 table

Date:
2006-05-15 10:06
Priority:
3
State:
Open
Submitted by:
Didrik Pinte (dpinte)
Assigned to:
Nobody (None)
Category:
none
Version:
none
Resolution:
none
Summary:
problem with utf8 systems trying to see non utf8 table

Detailed description
Here is the error I get :

An unhandled exception occurred:
'utf8' codec can't decode bytes in position 5-6: invalid data
(please report to http://thuban.intevation.org/bugtracker.html)

Traceback (most recent call last):
File "/home/did/projets/python/thuban/thuban/Thuban/UI/tableview.py", line 74, in GetValue
record = self.table.ReadRowAsDict(row, row_is_ordinal = 1)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 5-6: invalid data
Message  ↓
Date: 2007-01-08 15:13
Sender: Bernhard Reiter

I wonder if we should make this a new feature request
as the problem probably was there all versions
and there will never be a perfect solution.

Date: 2007-01-05 23:45
Sender: Didrik Pinte

Patch applied in revision 2718. I leave the bug opened because it's still a basic workaround and it can introduce bugs in case of non iso-8859-1 encoded files.

Date: 2007-01-01 21:31
Sender: Bernhard Reiter

Ping Didrik?

Date: 2006-11-06 10:25
Sender: Bernhard Reiter

Hi Didrik,
I think the patch is an improvement, so please commit.

Actually using DBF encoding is a good idea, I am unsure
if it will help all cases and it does not seem to be
a really save bet.

Still having a patch does this would be cool.
Another alternative would be to have a setting that
the user can use in cases the encoding is unknown
and guessing it by the DBF encoding does not work.

Date: 2006-10-03 12:53
Sender: Didrik Pinte

here is a full patch agains all the files. The patch is currently iso-8859-1 forced.

They are two main options in order to really fix the bug :

[1] change the patch to support other type of default encodings for all the opened files.

[2] BETTER :
- update shapelib to support the DBF encoding (see byte 29 of the dbf header (http://www.clicketyclick.dk/databases/xbase/format/dbf.html#DBF_NOTE_5_TARGET))
This seems to be more and more used by current gis software (ArcGIS since 8.2).
- update pyshapelib so that it convert the output of shapelib to the locale encoding.

Date: 2006-09-07 13:36
Sender: Didrik Pinte

The bug also affect the "Add or remove labels" tool -->

line 57 of Thuban/UI/controls.py :

self.SetStringItem(i, 1, str(value))


To summarize, the following files must be patched in order to process correctly non-utf8 files in a utf8 environment:
- Thuban/UI/controls.py
- Thuban/UI/classgen.py
- Thuban/UI/tableview.py

A solution could be to define a preferred encoding for file not dependant of the locales.

Date: 2006-08-07 14:28
Sender: Didrik Pinte

The bug is also affect other parts of the code. For example : when trying to display municipality names on the Properties of a layer, if the string is not correctly encoded, you get the following error :
-----------------------------------------------------------
An unhandled exception occurred:
'utf8' codec can't decode bytes in position 2-3: invalid data
(please report to http://thuban.intevation.org/bugtracker.html)

Traceback (most recent call last):
File "/home/did/projets/python/thuban/trunk/thuban/Thuban/UI/classgen.py", line 761, in _OnRetrieve
i = self.list_avail.InsertStringItem(index, str(v))
File "/usr/lib/python2.3/site-packages/wx-2.6-gtk2-unicode/wx/_controls.py", line 4809, in InsertStringItem
return _controls_.ListCtrl_InsertStringItem(*args, **kwargs)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 2-3: invalid data
-----------------------------------------------------------

Date: 2006-08-07 08:40
Sender: Didrik Pinte

here is the patch

Date: 2006-08-07 08:34
Sender: Didrik Pinte

I've implemented a first workaround but it's not ok for all the possible encodings. It works only for iso-8859-1 encodings.

The only thing to change is to catch UnicodeError when reading values. If exception is raised, get the value back but decode it from iso-8859-1.

Thuban/UI/tableview.py : line 74
------------------------------------------------------------
try:
record = self.table.ReadRowAsDict(row, row_is_ordinal = 1)
except UnicodeError:
record = dict()
for (key, val) in self.table.ReadRowAsDict(row, \
row_is_ordinal = 1).items():
if isinstance(val, str):
record[key] = val.decode('iso-8859-1')
else:
record[key] = val
return record[self.columns[col][0]]
------------------------------------------------------------

This allows you to open correctly iso-8859-1 files when using a UTF-8 system.

Didrik

Date: 2006-05-15 14:39
Sender: Didrik Pinte

i've attached a shapefile with a polygon dataset.

When locales are in UTF-8, try to open the table with Thuban, you will reproduce the bug.

Date: 2006-05-15 12:37
Sender: Bernhard Reiter

Can you point to or attach a simple test case?

Date: 2006-05-15 10:44
Sender: Didrik Pinte

switching my Gnome session to ISO-8859-1, the table is correctly displayed.

This can be an important bug for Ubuntu users. If i'm not wrong the default encoding is UTF8.

The bug seems to be related to pyshapelib. I'm investigating a way a solving the problem.

Attachments:
Size Name Date By Download
103 KiBpolygones.tar.gz2006-05-15 14:39Didrik Pintepolygones.tar.gz
879 bytesutf8.patch2006-08-07 08:40Didrik Pinteutf8.patch
3 KiButf8_2709.patch2006-10-03 12:53Didrik Pinteutf8_2709.patch
Field Old Value Date By
File Added33: utf8_2709.patch2006-10-03 12:53Didrik Pinte
File Added27: utf8.patch2006-08-07 08:40Didrik Pinte
File Added18: polygones.tar.gz2006-05-15 14:39Didrik Pinte