문제

Lets say i have this code:

use strict;
use LWP qw ( get );

my $content = get ( "http://www.msn.co.il" );

print STDERR $content;

The error log shows something like "\xd7\x9c\xd7\x94\xd7\x93\xd7\xa4\xd7\xa1\xd7\x94" which i'm guessing it's utf-16 ?

The website's encoding is with

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1255">

so why these characters appear and not the windows-1255 chars ?

And, another weird thing is that i have two servers:

the first server returning CP1255 chars and i can simply convert it to utf8, and the current server gives me these chars and i can't do anything with it ...

is there any configuration file in apache/perl/module that is messing up the encoding ? forcing something ... ?

The result in my website at the second server, is that the perl file and the headers are all utf8, so when i write text that aren't english chars, the content from the example above is showing ok ( even though it's weird utf chars ) but my own static text are look like "×ס'××ר××:"

One more thing that i tested is ...

Through perl:

my $content = `curl "http://www.anglo-saxon.co.il"`;    

I get utf8 encoding.

Through Bash:

curl "http://www.anglo-saxon.co.il"

and here i get CP1255 ( Windows-1255 ) encoding ...

Also, when i run the script in bash - it gives CP1255, and when run it through the web - then it's utf8 again ...

fixed the problem by changin the content from utf8 - to what is supposed to, and then back to utf8:

use Text::Iconv;

my $converter = Text::Iconv->new("utf8", "CP1255");
   $content=$converter->convert($content);

my $converter = Text::Iconv->new("CP1255", "utf8");
   $content=$converter->convert($content);
도움이 되었습니까?

해결책

The string with the hex values that you gave appears to be a UTF-8 encoding. You are getting this because Perl ‘likes to’ use UTF-8 when it deals with strings. The LWP::Simple->get() method automatically decodes the content from the server which includes undoing any Content-Encoding as well as converting to UTF-8.

You could dig into the internals and get a version that does change the character encoding (see HTTP::Message's decoded_content, which is used by HTTP::Response's decoded_content, which you can get from LWP::UserAgent's get). But it may be easier to re-encode the the data in your desired encoding with something like

use Encode; 
...; 
$cp1255_bytes = encode('CP1255', decode('UTF_8', $utf8_bytes));

The mixed readable/garbage characters you see are due to mixing multiple, incompatible encodings in the same stream. Probably the stream is labeled as UTF-8 but you are putting CP1255 encoded characters into it. You either need to label the stream as CP1255 and put only CP1255-encoded data into it, or label it as UTF-8 and put only UTF-8-encoded data into it. Remind yourself that bytes are not characters and convert between them appropriately.

다른 팁

확실히 itemUpdated 이벤트에서 값을 가져 오려고 노력하십시오.어쩌면 가치가 아직 존재하지 않기 때문일 수있는 이유는 아직 없습니다.

http://www.msn.co.il is in UTF-8, and indicates that properly. The string "\xd7\x9c\xd7\x94\xd7\x93\xd7\xa4\xd7\xa1\xd7\x94" is also proper UTF-8 (להדפסה). I don't see the problem.

I think your second problem is due to you mixing different encodings (UTF-8 and Windows-1252). You might want to encode/decode your strings properly.

First, note that you should import get from LWP::Simple. Second, everything works fine with:

#!/usr/bin/perl
use strict; use warnings;
use LWP::Simple qw ( getstore );
getstore 'http://www.msn.co.il', 'test.html';

which indicates to me that the problem is the encoding of the filehandle to which you are sending the output.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top