[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

detecting non-convertibility of characters



Le Tue, Jan 29, 2002, à 05:20:41PM +0100, Cyrille Chepelov a écrit:

> Hence my intent to test how to detect that \xc2\xab doesn't translate into
> anything in the current locale encoding, and use the ASCII fallback in that
> case. However, for the locales where \xc2\xab is displayable and if we can
> reliably detect it is indeed displayable, IMO we should use it rather than
> ASCII simulacres.

OK, here are the results:
	- test.c is basically a stripped down, hardcoded-to-latin1 version
of charconv.c (it's encoded in utf-8. I hope the test files you sent me
weren't swear words <grin/> They looked definitely Japanese in my emacs21.)
There are four strings: one latin1 (expected to convert), and three which
are not expected to convert into latin1 (for various but obvious reasons).
	- test.log is the result of the test, with 2>&1.

	As you can see, unicode_iconv() just bails out (and sets errno) when
the string is not convertible.

I'm thinking about adding a try_charconv_utf8_to_local8() function (taking
all code from charconv_utf8_to_local8() until before the test on the result
of unicode_iconv(), and letting it return NULL (but silently !) if the input
string can't be converted to local charset. This should allow to detect
whether the « and » characters are convertible in the current encoding.

Problem: I see there's an alternate implementation of
charconv_utf8_to_local8, which basically delegates to glib1.3. Is this
function silent when presented with "bad" input ? Or is it safe to assume
we're going to either HAVE_ICONV or HAVE_UNICODE even in the glib1.3 case
and use code derived from the older implementation of charconv_utf8_to_local8 ?

Now people are talking of C++0x, I'll probably write to Mr. Sutter so that
the Powers That Be (and Who Talk To The C Comittee) seriously plan of adding
#mess, #beware, #horrible and #hell pre-processor directives.

	-- Cyrille

-- 
Grumpf.


#include <glib.h>
#include <string.h>
#include <unicode.h>
#include <errno.h>

typedef char utfchar;


int get_local_charset(char **charset) 
{

    char *cst = "ISO-8859-1";
    *charset = cst;
    
    return 0;
}


static unicode_iconv_t conv_u2l = (unicode_iconv_t)0;
static int local_is_utf8 = 0;

static void 
check_conv_u2l(void){
  char *charset = NULL;
  
  if (local_is_utf8 || (conv_u2l!=(unicode_iconv_t)0)) return;
  local_is_utf8 = get_local_charset(&charset);
  if (local_is_utf8) return;

  conv_u2l = unicode_iconv_open(charset,"UTF-8");
}
  
extern gchar *
charconv_utf8_to_local8(const utfchar *utf)
{
  const utfchar *u = utf;
  int uleft = strlen(utf);
  gchar *local,*l,*lres;
  int lleft;
  int lost = 0;

  g_assert(utf);
  if (!utf) return NULL; /* GIGO */

  check_conv_u2l();
  if (local_is_utf8) return g_strdup(utf);

  lleft = uleft +2;
  l = local = g_malloc(lleft+2);
  *l = 0;
  unicode_iconv(conv_u2l,NULL,NULL,NULL,NULL); /* reset the state machines */
  while ((uleft) && (lleft)) {
    ssize_t res = unicode_iconv(conv_u2l,
                                &u,&uleft,
                                &l,&lleft);
    *l = 0;
    if (res==(ssize_t)-1) {
      g_warning("unicode_iconv(u2l,...) failed, because '%s'",
                strerror(errno));
      break;
    } else {
      lost += (int)res; /* lost chars in the process. */
    }
  }
  lres = g_strdup(local); /* get the actual size. */
  g_free(local); 
  return lres;
}

utfchar *test0_data = "pépère"; // latin1
utfchar *test_data = "ã“ã‚Œã¯æ—¥æœ¬èªžã®ãƒ†ã‚¹ãƒˆã§ã™";
utfchar *test2_data = "コレハハï¾ï½¶ï½¸ï½¶ï¾€ï½¶ï¾…ノテストデス";
utfchar *test3_data = "Mélange hétérogène d'€ et de £.";

int main(int argc, char **argv) {
    gchar *charset = NULL;
    
    get_local_charset(&charset);
    printf("Current charset: %s\n",charset);
    
    printf("test0=%s\ntest=%s\ntest2=%s\ntest3=%s\n",
           charconv_utf8_to_local8(test0_data),
           charconv_utf8_to_local8(test_data),
           charconv_utf8_to_local8(test2_data),
           charconv_utf8_to_local8(test3_data));
    return 0;
}

** WARNING **: unicode_iconv(u2l,...) failed, because 'Invalid or incomplete multibyte or wide character'

** WARNING **: unicode_iconv(u2l,...) failed, because 'Invalid or incomplete multibyte or wide character'

** WARNING **: unicode_iconv(u2l,...) failed, because 'Invalid or incomplete multibyte or wide character'
Current charset: ISO-8859-1
test0=pépère
test=
test2=
test3=Mélange hétérogène d'


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index] Mail converted by Mofo Magic and the Flying D

 
All trademarks and copyrights are the property of their respective owners.

Other Directory Sites: SeekWonder | Directory Owners Forum

GuideSMACK