[Shapelib] Re: shapelib improvements

Bram de Greve bram.degreve at bramz.net
Thu Dec 6 09:04:08 EST 2007


Frank Warmerdam wrote:
> Bram,
>
> Well, I guess what confuses me is that if I use commandline expansion
> on the Win32 command shell, it will expand oddly named files (other
> character sets) into something that I can pass to fopen() within my
> program and open them successfully.  I don't know what magic is involved,
> but this has lead me to doubt that wide characters are really needed.

Fair enough.  As I said, I'm not a fan either, so if we can open them
without needing _wfopen and the lot, I certainly won't complain.  What
OS are you using?

Still, I don't know what encoding to use ...  UTF-8 certainly doesn't
work.  If I want to create file with as name the symbol pi (\u03C0), and
I pass it as an UTF-8 encoded string (\xcf\x80), fopen will create a
file  If I interpret the documentation well, on Windows, fopen accepts
the filename in an ANSI encoding that can vary depending on the user's
regionale settings.  In my case I guess that would be ANSI 1252, but I'm
not sure where I can check that!  And the symbol pi  won't fit in that
encoding (of course, I might be missing something here).  A good
encoding would be UTF-8 (pi = , but fopen won't accept that.  Well, it
_will_ accept that, but it will create a file with the name π (I hope
that comes through, that's \xcf\u20ac).  Instead - If I read the
documentation right - Windows will use an ANSI codepage set by the
user's regionale (CP_ACP).  In my case, I guess that's ANSI 1252, and
that can't represent the symbol pi (well, the encoding just converted it
to the letter p).

So, yeah, I think there still might be some issues here ...


> I was actually thinking of provide a means to pass hook functions for io,
> and providing a default implementation using stdio which likely would
> not support large files unless some sort of preprocessor macro
> (like _LARGE_SOURCE?) on unix would make it work.  I have a preliminary
> implementation in CVS now that is using a hook function structure that
> looks like:
>
> typedef void *SAFile;
> typedef unsigned long SAOffset;
>
> typedef struct {
>     SAFile     (*FOpen) ( const char *filename, const char *path);
>     SAOffset   (*FRead) ( void *p, SAOffset size, SAOffset nmemb,
> SAFile file);
>     SAOffset   (*FWrite)( void *p, SAOffset size, SAOffset nmemb,
> SAFile file);
>     SAOffset   (*FSeek) ( SAFile file, SAOffset offset, int whence );
>     SAOffset   (*FTell) ( SAFile file );
>     int        (*FFlush)( SAFile file );
>     int        (*FClose)( SAFile file );
>
>     void       (*Error) ( const char *message );
> } SAHooks;
>
> The new function SASetupDefaultHooks will fill this structure
> with pointers to functions that just use stdio.
>
> void SHPAPI_CALL SASetupDefaultHooks( SAHooks *psHooks );
>
> The SHPOpen() works as normal:
>
> SHPHandle SHPAPI_CALL
>       SHPOpen( const char * pszShapeFile, const char * pszAccess );
>
> But there it now basically just calls SASetupDefaultHooks() and
> calls a low level function:
>
> SHPHandle SHPAPI_CALL
>       SHPOpenLL( const char *pszShapeFile, const char *pszAccess,
>                  SAHooks *psHooks );
>
> The downside of this approach is that it still assumes that filenames
> can be internally manipulated to prepare different extensions, etc.
> An alternative might include having the low level function also
> takes the filenames of the various component files (.shp, .shx) as
> arguments so they don't need to be manipulated internally.  That way
> they could actually be wide character strings.
>
> For GDAL, I'd actually implement my own io functions based on my VSI*L
> stuff and have shapelib go through those.  They support large files.
> And I'd do something to ensure that SAOffset is a 64bit integer.   Some
> mechanism would need to be provided to override this.

I like this idea very much:

- If filenames are in multibyte encoding (narrow character), then the
same code can easily handle the filenames in UTF-8 encoding.
- When using CreateFile and company for the 2GB+ support, please use the
unicode version =)  Yes, I know, that's the wide character version.  But
you can easily convert from the narrow character to wide character using
MultiByteToWideChar, you're in windows land after all.  Then you can
specify an encoding like UTF-8 and all unicode characters will be
supported, and I'll be happy =)

>
>> Any unified system would have to rely on strings I guess, as the CPG is
>> string based.  So we might embed a table that converts the LDID integers
>> to names, and always return as string: if the CPG is set, return that
>> one, otherwise return name of LDID. 
>
> Perhaps if the string is "LDID/<somenumber>" it would be treated as
> LDID otherwise as CPG?
>
> It doesn't really completely isolate the application from knowing about
> LDID but it keeps the API simplier, and in theory the user could be
> passing
> in the correctly formatted string.
>

Yes, I think this might be a good idea.  Though, it would be nice to
include a bit more of information in case of an LDID, but then we would
need an official list of what each LDID stands for, and I haven't really
found one.  Maybe you know of a resource?  But "LDID/<somenumber>" is
definitely something that we could use if nothing better comes up.

Bram


More information about the Shapelib mailing list