سؤال

With PHP, I'm Trying to determine the length (number of characters) in strings such as these:

1
1.1
1.1.1
1.1.2
1.1.3
1.1.3.1
1.1.3.2
1.1.4
1.1.5
1.1.6
1.1.7

etc.

When the length of these strings are measured with mb_strlen() or strlen(), the results are

------------------------------
value   | mb_strlen() | strlen()
------------------------------
1       | 1           | 1
------------------------------
1.1     | 5           | 5
------------------------------
1.1.1   | 9           | 9
------------------------------
1.1.1.1 | 13          | 13
------------------------------
1.1.1.2 | 13          | 13
------------------------------
1.1.1.3 | 13          | 13
------------------------------

It appears that it's counting "." as 3 characters? I'm wondering about just doing a small function to compensate for the predictable "miscount", but am wondering why it's counting the "." as 3 characters to begin with.

I have already looked through several places including this SO article and read the article mentioned, adding the suggested conversions to the page:

mb_language('uni');
mb_internal_encoding('UTF-8');
$str = mb_convert_encoding($str, 'UTF-8', 'UTF-8');

What gives?

EDIT: The strings are imported as part of a csv import.

Here is code:

<?
    $f = fopen("s2db.csv", "r");
    while (($line = fgetcsv($f)) !== false) {

            $colcount = 0;
            foreach ($line as $cell) {
                //lets get the lines into variables first
                //there only five, so just count
                switch ($colcount) {
                    case '0':
                        $item = $cell;
                        break;
                    case '1':
                        $itemtitle = htmlspecialchars($cell);
                        break;
                    case '2':
                        $itemsubject = htmlspecialchars($cell);
                        break;
                    case '3':
                        $itemnumber = htmlspecialchars($cell);
                        break;
                    case '4':
                        $itemqty = htmlspecialchars($cell);
                        break;
                    case '5':
                        $itemfilename = htmlspecialchars($cell);
                        break;                    
                }
                $colcount++;
            }
            $itemlen = strlen($item);
            echo "Value = " . $item . " | strlen() Length = " . $itemlen .  "|  mb_strlen() = " . mb_strlen($item) . "</br>";
    }
?>

Here are results

Value = 1 | strlen() Length = 3| mb_strlen() = 3
Value = 1.1 | strlen() Length = 7| mb_strlen() = 7
Value = 1.1.1 | strlen() Length = 11| mb_strlen() = 11
Value = 1.1.1.1 | strlen() Length = 15| mb_strlen() = 15
Value = 1.1.1.2 | strlen() Length = 15| mb_strlen() = 15
Value = 1.1.1.3 | strlen() Length = 15| mb_strlen() = 15
Value = 1.1.1.3.1 | strlen() Length = 19| mb_strlen() = 19
Value = 1.1.1.3.2 | strlen() Length = 19| mb_strlen() = 19
Value = 1.1.1.3.3 | strlen() Length = 19| mb_strlen() = 19
Value = 1.1.1.4 | strlen() Length = 15| mb_strlen() = 15

SOLUTION:

I gave @hek2mgl the vote because his hexdump helped me determine that I wasn't crazy and it really was counting the "." as 3, as shown here.

Nothing I can do about the import format, so I'm just going to add code to compensate:

Thanks everyone for the help!

هل كانت مفيدة؟

المحلول

I got:

<?php

$str = '1.1.1';
var_dump(mb_strlen($str, 'utf-8'));  // 5
var_dump(strlen($str));              // 5

as expected. Seems the . in your case isn't the regular dot but a special unicode char. Please show a hexdump of your input data. You can use Hexdump (I wrote the package for such cases):

Installation:

sudo pear channel-discover www.metashock.de/pear
sudo pear install metashock/Hexdump

Usage:

<?php

require_once 'Hexdump.php';
hexdump('1.1.1');

Would be interesting to see what are the real characters behind the scenes.

نصائح أخرى

Where does you variables come from? Could you please show us the real code (instead of pseudocode)?

I tried to reproduce the described behavior and couldn't. Here's some test I conducted:

$strArray = array(
    '.',
    '1',
    '1.1',
    '1.1.1',
    1,
    1.1,
);

for ($i = 0; $i<count($strArray); ++$i) {
    print "{$strArray[$i]} -> strlen: ".strlen($strArray[$i])." <br/>";
    print "{$strArray[$i]} -> mb_strlen: ".mb_strlen($strArray[$i])." <br/>";
    print '<br>';  
}

This outputs:

. -> strlen: 1 
. -> mb_strlen: 1 

1 -> strlen: 1 
1 -> mb_strlen: 1 

1.1 -> strlen: 3 
1.1 -> mb_strlen: 3 

1.1.1 -> strlen: 5 
1.1.1 -> mb_strlen: 5 

1 -> strlen: 1 
1 -> mb_strlen: 1 

1.1 -> strlen: 3 
1.1 -> mb_strlen: 3

as expected

I know it's not an answer, but for code formatting reasons.

The following, saved in a UTF-8 file, on my setup...

<?php

echo 'mbstring.internal_encoding: '    . ini_get( 'mbstring.internal_encoding' ) . "\r\n";
echo 'mbstring.func_overload: '        . ini_get( 'mbstring.func_overload' ) . "\r\n";
echo 'mbstring.language: '             . ini_get( 'mbstring.language' ) . "\r\n";
echo 'mbstring.strict_detection: '     . ini_get( 'mbstring.strict_detection' ) . "\r\n";
echo 'mbstring.substitute_character: ' . ini_get( 'mbstring.substitute_character' ) . "\r\n";
echo 'mbstring.detect_order: '         . ini_get( 'mbstring.detect_order' ) . "\r\n";
echo 'mbstring.encoding_translation: ' . ini_get( 'mbstring.encoding_translation' ) . "\r\n";
echo "\r\n";

function outputLengths( $sString )  {
    echo( "mb_strlen('$sString', 'utf-8') = " . mb_strlen($sString, 'utf-8')  ."\r\n" );
    echo( "strlen('$sString') = " . strlen($sString)  ."\r\n\r\n" );
}

outputLengths( '1' );
outputLengths( '1.1' );
outputLengths( '1.1.1' );
outputLengths( '1.1.3.1' );

Outputs:

mbstring.internal_encoding: UTF-8
mbstring.func_overload: 0
mbstring.language: neutral
mbstring.strict_detection: 0
mbstring.substitute_character:
mbstring.detect_order:
mbstring.encoding_translation: 0

mb_strlen('1', 'utf-8') = 1
strlen('1') = 1

mb_strlen('1.1', 'utf-8') = 3
strlen('1.1') = 3

mb_strlen('1.1.1', 'utf-8') = 5
strlen('1.1.1') = 5

mb_strlen('1.1.3.1', 'utf-8') = 7
strlen('1.1.3.1') = 7

What do you get?

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top