[Shapelib] Re: shapelib improvements

Frank Warmerdam warmerdam at pobox.com
Wed Dec 5 14:15:59 EST 2007


Bram de Greve wrote:
> I'm not really a fan either, but I'm thinking of filenames with Japanese
> characters, or symbols like pi, phi, ...  Granted, it's not something
> *I* would be using (and I assume you neither), but some people might beg
> to differ.  So, the issue at hand is filenames with unicode symbols. 
> The first thing I tried was to encode them into streams of narrow
> characters (MBCS), but that failed to "cover all the unicode characters"
> somehow.  So, then I decided to use the wide character functions instead. 
> 
> This was however quite a long time ago, and now I wonder.  See, I tried
> that encoding using Python routines (we're talking about pyshapelib
> here), since I was doing that successfully for the linux.  See, Python
> somehow "knows" what encoding to use for different filesystems, like
> UTF-8 for Apple, and whatever CODESET the user has set on Linux.  For
> windows, Python tells to use the MBCS codec, which is described as
> "Windows only: Encode operand according to the ANSI codepage (CP_ACP)". 
> That means that depending onn the code page the machine is using, it may
> or may not support the unicode characters in question.  So, to really
> access those files with unicode filenames, you need to use _wfopen.

Bram,

Well, I guess what confuses me is that if I use commandline expansion
on the Win32 command shell, it will expand oddly named files (other
character sets) into something that I can pass to fopen() within my
program and open them successfully.  I don't know what magic is involved,
but this has lead me to doubt that wide characters are really needed.

> Anyway, if SHPOpenLL would be implemented to allow the 2GB+ files, we
> can easily provide SHPOpenW and shield them off with "windows only"
> macros.  However, I'm not sure how SHPOpenLL would look like.  I assume
> you're talking about thinks like CreateFile and CreateFileMapping?  If
> so, you can't anymore rely on fread and fseek and things like that.  So,
> I'm not sure how a shared implementation would look like.  Helper
> functions that call fread or ReadFile depending on some magical parameter?

I was actually thinking of provide a means to pass hook functions for io,
and providing a default implementation using stdio which likely would
not support large files unless some sort of preprocessor macro
(like _LARGE_SOURCE?) on unix would make it work.  I have a preliminary
implementation in CVS now that is using a hook function structure that
looks like:

typedef void *SAFile;
typedef unsigned long SAOffset;

typedef struct {
     SAFile     (*FOpen) ( const char *filename, const char *path);
     SAOffset   (*FRead) ( void *p, SAOffset size, SAOffset nmemb, SAFile file);
     SAOffset   (*FWrite)( void *p, SAOffset size, SAOffset nmemb, SAFile file);
     SAOffset   (*FSeek) ( SAFile file, SAOffset offset, int whence );
     SAOffset   (*FTell) ( SAFile file );
     int        (*FFlush)( SAFile file );
     int        (*FClose)( SAFile file );

     void       (*Error) ( const char *message );
} SAHooks;

The new function SASetupDefaultHooks will fill this structure
with pointers to functions that just use stdio.

void SHPAPI_CALL SASetupDefaultHooks( SAHooks *psHooks );

The SHPOpen() works as normal:

SHPHandle SHPAPI_CALL
       SHPOpen( const char * pszShapeFile, const char * pszAccess );

But there it now basically just calls SASetupDefaultHooks() and
calls a low level function:

SHPHandle SHPAPI_CALL
       SHPOpenLL( const char *pszShapeFile, const char *pszAccess,
                  SAHooks *psHooks );

The downside of this approach is that it still assumes that filenames
can be internally manipulated to prepare different extensions, etc.
An alternative might include having the low level function also
takes the filenames of the various component files (.shp, .shx) as
arguments so they don't need to be manipulated internally.  That way
they could actually be wide character strings.

For GDAL, I'd actually implement my own io functions based on my VSI*L
stuff and have shapelib go through those.  They support large files.
And I'd do something to ensure that SAOffset is a 64bit integer.   Some
mechanism would need to be provided to override this.

>>> (2) Language drivers and code pages DBF:
>>>
>>> (2.a) DBF Files have a language driver id (LDID) indicating the codec
>>> used to store the content.  This is easy, we just need to read the field
>>> (an integer) and add it to the struct, possible also adding an accessor
>>> function?
>> Sounds reasonable.
>>
>>> (2.b) Shapfiles created by ESRI ArcGIS are sometimes accompanied by a
>>> .CPG (codepage) file, indicating the codec when the LDID are incapable
>>> of identifying it.  This is the only way to support unicode through the
>>> UTF-8 codec.  This requires trying to read an additional .CPG file when
>>> openening a DBF.  the code page string would also be added to the
>>> struct, with possible an accessor as well.
>> Sounds reasonable.
>>
>> Is there any way to unify the LDID and CPG handling as far as the
>> application is aware?  Instead of application developers having to
>> be aware of the distinction?
> 
> OK, bear in mind however that supporting the .CPG files requires similar
> filename manipulation as for the shapefiles (.SHP and .SHX), so again we
> might have some code duplication here ;)

:-)

> Any unified system would have to rely on strings I guess, as the CPG is
> string based.  So we might embed a table that converts the LDID integers
> to names, and always return as string: if the CPG is set, return that
> one, otherwise return name of LDID. 

Perhaps if the string is "LDID/<somenumber>" it would be treated as
LDID otherwise as CPG?

It doesn't really completely isolate the application from knowing about
LDID but it keeps the API simplier, and in theory the user could be passing
in the correctly formatted string.

Best regards,
-- 
---------------------------------------+--------------------------------------
I set the clouds in motion - turn up   | Frank Warmerdam, warmerdam at pobox.com
light and sound - activate the windows | http://pobox.com/~warmerdam
and watch the world go round - Rush    | President OSGeo, http://osgeo.org



More information about the Shapelib mailing list