Subject: detecting non-convertibility of characters
Date: Tue, 29 Jan 2002 21:41:08 +0100
Le Tue, Jan 29, 2002, � 05:20:41PM +0100, Cyrille Chepelov a �crit:
> Hence my intent to test how to detect that \xc2\xab doesn't translate into
> anything in the current locale encoding, and use the ASCII fallback in that
> case. However, for the locales where \xc2\xab is displayable and if we can
> reliably detect it is indeed displayable, IMO we should use it rather than
> ASCII simulacres.
OK, here are the results:
- test.c is basically a stripped down, hardcoded-to-latin1 version
of charconv.c (it's encoded in utf-8. I hope the test files you sent me
weren't swear words <grin/> They looked definitely Japanese in my emacs21.)
There are four strings: one latin1 (expected to convert), and three which
are not expected to convert into latin1 (for various but obvious reasons).
- test.log is the result of the test, with 2>&1.
As you can see, unicode_iconv() just bails out (and sets errno) when
the string is not convertible.
I'm thinking about adding a try_charconv_utf8_to_local8() function (taking
all code from charconv_utf8_to_local8() until before the test on the result
of unicode_iconv(), and letting it return NULL (but silently !) if the input
string can't be converted to local charset. This should allow to detect
whether the � and � characters are convertible in the current encoding.
Problem: I see there's an alternate implementation of
charconv_utf8_to_local8, which basically delegates to glib1.3. Is this
function silent when presented with "bad" input ? Or is it safe to assume
we're going to either HAVE_ICONV or HAVE_UNICODE even in the glib1.3 case
and use code derived from the older implementation of charconv_utf8_to_local8 ?
Now people are talking of C++0x, I'll probably write to Mr. Sutter so that
the Powers That Be (and Who Talk To The C Comittee) seriously plan of adding
#mess, #beware, #horrible and #hell pre-processor directives.
-- Cyrille
--
Grumpf.
#include <glib.h>
#include <string.h>
#include <unicode.h>
#include <errno.h>
typedef char utfchar;
int get_local_charset(char **charset)
{
char *cst = "ISO-8859-1";
*charset = cst;
return 0;
}
static unicode_iconv_t conv_u2l = (unicode_iconv_t)0;
static int local_is_utf8 = 0;
static void
check_conv_u2l(void){
char *charset = NULL;
if (local_is_utf8 || (conv_u2l!=(unicode_iconv_t)0)) return;
local_is_utf8 = get_local_charset(&charset);
if (local_is_utf8) return;
conv_u2l = unicode_iconv_open(charset,"UTF-8");
}
extern gchar *
charconv_utf8_to_local8(const utfchar *utf)
{
const utfchar *u = utf;
int uleft = strlen(utf);
gchar *local,*l,*lres;
int lleft;
int lost = 0;
g_assert(utf);
if (!utf) return NULL; /* GIGO */
check_conv_u2l();
if (local_is_utf8) return g_strdup(utf);
lleft = uleft +2;
l = local = g_malloc(lleft+2);
*l = 0;
unicode_iconv(conv_u2l,NULL,NULL,NULL,NULL); /* reset the state machines */
while ((uleft) && (lleft)) {
ssize_t res = unicode_iconv(conv_u2l,
&u,&uleft,
&l,&lleft);
*l = 0;
if (res==(ssize_t)-1) {
g_warning("unicode_iconv(u2l,...) failed, because '%s'",
strerror(errno));
break;
} else {
lost += (int)res; /* lost chars in the process. */
}
}
lres = g_strdup(local); /* get the actual size. */
g_free(local);
return lres;
}
utfchar *test0_data = "pépère"; // latin1
utfchar *test_data = "これは日本語のテストです";
utfchar *test2_data = "コレハハンカクカタカナノテストデス";
utfchar *test3_data = "Mélange hétérogène d'€ et de £.";
int main(int argc, char **argv) {
gchar *charset = NULL;
get_local_charset(&charset);
printf("Current charset: %s\n",charset);
printf("test0=%s\ntest=%s\ntest2=%s\ntest3=%s\n",
charconv_utf8_to_local8(test0_data),
charconv_utf8_to_local8(test_data),
charconv_utf8_to_local8(test2_data),
charconv_utf8_to_local8(test3_data));
return 0;
}
** WARNING **: unicode_iconv(u2l,...) failed, because 'Invalid or incomplete multibyte or wide character'
** WARNING **: unicode_iconv(u2l,...) failed, because 'Invalid or incomplete multibyte or wide character'
** WARNING **: unicode_iconv(u2l,...) failed, because 'Invalid or incomplete multibyte or wide character'
Current charset: ISO-8859-1
test0=p�p�re
test=
test2=
test3=M�lange h�t�rog�ne d'