Reading from file containg utf-8(HINDI) format text and writing to anther file

https://stackoverflow.com/questions/11427103

20-06-2021
|

Вопрос

I am trying to read characters from a file and after removing punctuations. I want to store the words in an array and finally write them to another file. The contents of the file are :-

"यौ ता बाबू उदयभाहू उपेक्षा औंर अपमान्नकीपीड््ा ढोये जैसेतैस्ये वहबाबाके आश्रम म्पें पहैच गया । बाबा मान्नो उसी की प्रतीक्षा म्पें वैठे थे । वह ज्योही दण्डवत की मुदा म्पें हुभ्रा त्योंही बाबा का गभ्रीर स्वर उसके कानों म्पे टकराया ' आभ्रो, ञैं तुम्हारे लिए ही बैठा हूें । ' अमित न्ने मस्तक ऊैंचा उठाया औंर एकाम्र भाव न्से बाबा को देखता रहा । बाबा के पास वह अनेकों बार आ चुका था परन्तु. आज जैसी व्यथा, थकान्न औंर प्तानता इससे दूर्व नहीं थी आदमी कभ्रीकभी इतना टूट ञाता ड़ँ कि ठसे अपने अस्तिल्द के प्रति भ्री शंका होन्ने लगती न्है वह अनेक विचारों म्पें खो गया उसके नेत्र बाबा कौ देख रहे थे परन्तु उस्यका मन कहीं औंर भ्रटक रद्दा था ।"........

I tried to read these characters(Hindi-- utf-8) using old turbo c++. Using simple char data-type.

The program compiled but the contents were not properly written to the file. Then I used the same coding in visual c++ with the same code and I got error--

"Debug assertion failed ... unsigned(c+1) <=256"

Next I tried to use wide character data-type for this purpose. using<wchar.h> and <cwchar.h> header files and data-type wchar_t and other wide character functions but still the output is not proper --"��त �ྤ��௤ྤ�"

Is there any alternative or any other method to solve this problem.

Do answer with complete code segment also tell me what is the alternative for getline function for wchar. This is what I have tried to do...

#include<sstream>
#include<iostream>
#include<fstream>
#include<ctype.h>
#include<string>
#include<stdio.h>
#include<conio.h>
#include <istream>
#include<vector>
#include<string>
#include<stdlib.h>
#include<iostream>
#include<fstream>
#include<ctype.h>
#include<string.h>
#include<stdio.h>
#include<conio.h>
#include<vector>
#include<wchar.h>
#include<cwchar>
#include <locale.h>
#include <cwchar>
using namespace std;
unsigned char line[1000],storech[2000],storech1[20000];
wchar_t word[50];
std::vector< wchar_t* > storewrd;

void main()
{ 
    FILE * file3 = fopen("H:\\myfile.txt" , "w");
    cout << "check" << endl;
    FILE *stream;
    stream = fopen( "H:\\ocr.txt", "r" );
    setlocale(LC_ALL,"");
    int ch;
    int  test;
    wchar_t temp1;
    wchar_t buffer[500];
    wchar_t temp[500];

    int x=0,j=0;
        do
    {
        int loop = 0;
        ch = fgetwc(stream);

        //read word 
        while( (ch != '\n') && (ch != WEOF) ) 
        {
                buffer[loop] = ch;
            loop++;


         test = fgetwc(stream);
         temp1 = (wchar_t) test;
         if(!iswpunct(test))    
         fputwc( test , file3);
             wcout << temp1 << "  ";


        }


            int t;
        if (ch!= WEOF)
        {
             for(t=0;t<loop;t++)
             {
            temp[t] = buffer[t];
             }
             temp[loop++] = '\0';

                j++;
                //cout << buffer[loop] << "  ";
        }       
    }while(ch != WEOF);

    cout << "check";


    _getch();

}

Решение

It's not really clear to me what you're trying to do: where did the assertion failure occur? How are you trying to determine whether the characters are punctuation or not?

UTF-8 is a multibyte encoding, which means that the single byte functions like ispunct don't work on it. It is a variable length encoding, however, and all of the characters in the original ASCII code set have single byte encodings. If the only punctuation you are concerned with are characters in the original ASCII, you can “cheat” a bit, and use something like:

if ( (ch & 0x80) == 0 && ispunct( ch ) ) {
    //  is ASCII punctuation
} else {
    //  is something else
}

I put “cheat” in quotes, because one of the goals of Unicode and UTF-8 is that code that looks for things like ASCII punctuation should work unchanged.

If you need to recognize more than just ASCII punctuaion (e.g. things like «, ¿ or —), and you want to use wchar_t (which is usually, but not always UTF-16 or UTF-32), and the file is UTF-8, you'll need to use an appropriate locale which does the code translation. In this case, you should definitely use iostream, and not C style IO; iostream will allow you to imbue the stream with the appropriate locale, and C++ locales will allow you to create locales on the fly, by changing a single facet (codecvt, in this case) from another locale (probably the global one). (Under Linux, the global locale, particularly in non-English speaking areas, is often a UTF-8 locale, which can be used directly. Under Windows, I would expect it to be a UTF-16 locale, which will not translate UTF-8 correctly.) If you don't want to get involved with locales, read your UTF-8 directly into a char buffer, and use the iconv library or something similar to translate it within your program. Be aware, however, that there might be some rare punctuation outside of the basic plane, which will be encoded using two surrogate characters in UTF-16; iswpunct will not work for these if your wchar_t uses UTF-16 (Windows and AIX). (Most of the characters outside the basic plane are CJK or from historic scripts not used today, so this might not be an issue for you.)

Другие советы

You could try using ICU for this.

Stdio file functions, like fwprintf or fputwc convert the output to ASCII internally, even when using the unicode variants. I've had this problem too.

But since your encoding is UTF-8, why don't you read it as ASCII and write it as ASCII? UTF-8 is encoded in such a way that it should work with programs that aren't aware they're using UTF-8 instead of ASCII.

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow