regex - PHP preg_replace() fails when a non UTF8 Character is detected -
php regular expression fails when non utf 8 character found!
i need strip 40,000 database records grab width , height value custom_size
mysql table field.
the filed in sorts of different random formats.
the reliable way grab numeric value left , right side of x
, strip non numeric values them.
the code below works pretty 99% of time until found few records non utf 8 characters.
31*32
, 35”x21”
2 examples.
when these ran these php errors , script halts....
warning: preg_replace(): compilation failed: version of pcre not compiled pcre_utf8 support @ offset 1683977065 on line 21 warning: preg_match(): compilation failed: version of pcre not compiled pcre_utf8 support @ offset 0 on line 24
demo:
<?php $strings = array( '12x12', '172.61 cm x 28.46 cm', '31"x21"', '1"x1"', '31*32', '35”x21”' ); foreach($strings $string){ if($string != ''){ $string = str_replace('”','"',$string); // strip out characters except numbers, letter x, , decimal points $string = preg_replace( '/([^0-9x\.])/ui', '', strtolower( $string ) ); // find fits number x number format preg_match( '/([0-9]+(\.[0-9]+)?)x([0-9]+(\.[0-9]+)?)/ui', $string, $values ); echo 'original value: ' .$string.'<br>'; echo 'width: ' .$values[1].'<br>'; echo 'height: ' .$values[3].'<br><hr><br>'; } }
any ideas around this? cannot rebuild server software add support
just found answer php library convert utf8 seems helping lot https://stackoverflow.com/a/3521396/143030
by default, pcre regex-engine reads character string 1 byte @ time, so, default ignores byte sequences may compose single character when multibyte encoding utf-8 in use, , see them separated bytes (one byte, 1 character).
for example, character u+201d: right double quotation mark uses 3 bytes in utf-8:
$a = '”'; ($i=0; $i < strlen($a); $i++) { echo dechex(ord($a[$i])), ' '; }
result:
e2 80 9d
to enable multibyte read in pcre regex engine, can either use 1 of these directives @ beginning of pattern: (*utf)
, (*utf8)
, (*utf16)
, (*utf32)
or u modifier (that switches on available multi-bytes mode, extends meaning of shorthand character classes \s
, \d
, \w
... unicode. in other words u modifier shortcut (*utfx)
, (*ucp)
changes character classes.)
but these features available if pcre module has been compiled support of these encodings. (this case of default php installations, isn't absolutely systematic or mandatory.)
it seems isn't case since when use u modifier, obtain explicit message:
this version of pcre not compiled pcre_utf8 support
you can't against except if decide change php installation 1 pcre module compiled utf8 support.
however, isn't problem in case, because in patterns u modifier totally useless if input utf8 encoded.
the reason 2 patterns use ascii literal characters (characters in 00-7f range) , because characters beyond ascii range in utf8 encoding never use bytes range:
unicode char utf8 name -------------------------------------------------------- u+007d } 7d right curly bracket u+007e ~ 7e tilde u+007f 7f <control> u+0080 c2 80 <control> u+0081 c2 81 <control> ... u+00be ¾ c2 vulgar fraction 3 quarters u+00bf ¿ c2 bf inverted question mark u+00c0 À c3 80 latin capital letter grave u+00c1 Á c3 81 latin capital letter acute ...
so can write:
$string = preg_replace( '/[^0-9x.]+/', '', strtolower( $string ) );
(no need use modifier since string lowercase. no need escape dot in character class , use capture group. adding +
quantifier speeds replacement since several consecutive characters removed in 1 replacement, instead of 1 one.)
and:
if (preg_match('/([0-9]+(?:\.[0-9]+)?)x([0-9]+(?:\.[0-9]+)?)/', $string, $values)) { echo 'original value: ', $string, '<br>'; echo 'width: ', $values[1], '<br>'; echo 'height: ', $values[2], '<br><hr><br>'; }
however, can dangerous patterns, example not remove first character expected if 1 encoded several bytes, first byte of character:
$a = preg_replace('/^./', '', '”abc'); ($i=0; $i < strlen($a); $i++) { echo ' ', dechex(ord($a[$i])); }
returns:
80 9d 61 62 63 # � � b c
Comments
Post a Comment