regex - PHP preg_replace() fails when a non UTF8 Character is detected -

php regular expression fails when non utf 8 character found!

i need strip 40,000 database records grab width , height value custom_size mysql table field.

the filed in sorts of different random formats.

the reliable way grab numeric value left , right side of x , strip non numeric values them.

the code below works pretty 99% of time until found few records non utf 8 characters.

31*32 , 35”x21” 2 examples.

when these ran these php errors , script halts....

warning: preg_replace(): compilation failed: version of pcre not compiled pcre_utf8 support @ offset 1683977065 on line 21  warning: preg_match(): compilation failed: version of pcre not compiled pcre_utf8 support @ offset 0 on line 24

demo:

<?php  $strings = array(      '12x12',     '172.61 cm x 28.46 cm',     '31"x21"',     '1"x1"',     '31*32',     '35”x21”' );   foreach($strings $string){      if($string != ''){          $string = str_replace('”','"',$string);          // strip out characters except numbers, letter x, , decimal points         $string = preg_replace( '/([^0-9x\.])/ui', '', strtolower( $string ) );          // find fits number x number format         preg_match( '/([0-9]+(\.[0-9]+)?)x([0-9]+(\.[0-9]+)?)/ui', $string, $values );           echo 'original value: ' .$string.'<br>';         echo 'width: ' .$values[1].'<br>';         echo 'height: ' .$values[3].'<br><hr><br>';               }  }

any ideas around this? cannot rebuild server software add support

just found answer php library convert utf8 seems helping lot https://stackoverflow.com/a/3521396/143030

by default, pcre regex-engine reads character string 1 byte @ time, so, default ignores byte sequences may compose single character when multibyte encoding utf-8 in use, , see them separated bytes (one byte, 1 character).

for example, character u+201d: right double quotation mark uses 3 bytes in utf-8:

$a = '”';  ($i=0; $i < strlen($a); $i++) {     echo dechex(ord($a[$i])), ' '; }

result:

e2 80 9d

to enable multibyte read in pcre regex engine, can either use 1 of these directives @ beginning of pattern: (*utf), (*utf8), (*utf16), (*utf32) or u modifier (that switches on available multi-bytes mode, extends meaning of shorthand character classes \s, \d, \w... unicode. in other words u modifier shortcut (*utfx) , (*ucp) changes character classes.)

but these features available if pcre module has been compiled support of these encodings. (this case of default php installations, isn't absolutely systematic or mandatory.)

it seems isn't case since when use u modifier, obtain explicit message:

this version of pcre not compiled pcre_utf8 support

you can't against except if decide change php installation 1 pcre module compiled utf8 support.

however, isn't problem in case, because in patterns u modifier totally useless if input utf8 encoded.

the reason 2 patterns use ascii literal characters (characters in 00-7f range) , because characters beyond ascii range in utf8 encoding never use bytes range:

unicode  char   utf8    name -------------------------------------------------------- u+007d     }       7d   right curly bracket u+007e     ~       7e   tilde u+007f             7f   <control> u+0080          c2 80   <control> u+0081          c2 81   <control> ... u+00be     ¾    c2   vulgar fraction 3 quarters u+00bf     ¿    c2 bf   inverted question mark u+00c0     À    c3 80   latin capital letter grave u+00c1     Á    c3 81   latin capital letter acute ...

so can write:

$string = preg_replace( '/[^0-9x.]+/', '', strtolower( $string ) );

(no need use modifier since string lowercase. no need escape dot in character class , use capture group. adding + quantifier speeds replacement since several consecutive characters removed in 1 replacement, instead of 1 one.)

and:

if (preg_match('/([0-9]+(?:\.[0-9]+)?)x([0-9]+(?:\.[0-9]+)?)/', $string, $values)) {     echo 'original value: ', $string, '<br>';     echo 'width: ', $values[1], '<br>';     echo 'height: ', $values[2], '<br><hr><br>'; }

however, can dangerous patterns, example not remove first character expected if 1 encoded several bytes, first byte of character:

$a = preg_replace('/^./', '', '”abc');  ($i=0; $i < strlen($a); $i++) {     echo ' ', dechex(ord($a[$i])); }

returns:

 80 9d 61 62 63 # �  �   b  c

Search This Blog

Brant

regex - PHP preg_replace() fails when a non UTF8 Character is detected -

Comments

Post a Comment

Popular posts from this blog

Rendering JButton to get the JCheckBox behavior in a JTable by using images does not update my table -

javascript - Using jquery append to add option values into a select element not working -

Android soft keyboard reverts to default keyboard on orientation change -