strip out multi-byte white space from a string PHP
https://stackoverflow.com/questions/13946518/strip-out-multi-byte-white-space-from-a-string-php
I am trying to use a preg_replace to eliminate the Japanese full-width white space " " from a string input but I end up with a corrupted multi-byte string.
I would prefer to preg_replace instead of str_replace. Here is a sample code:
$keywords = ' ラメ単色';
$keywords = str_replace(array(' ', ' '), ' ', urldecode($keywords)); // outputs :'ラメ単色'
$keywords = preg_replace("@[ ]@", ' ',urldecode($keywords)); // outputs :'�� ��単色'
Anyone has any idea as to why this is so and how to remedy this situation?
php
regex
utf-8
preg-replace
multibyte
Share
Improve this question
Follow
edited Dec 19 '12 at 12:29
alex
451k190190 gold badges845845 silver badges967967 bronze badges
asked Dec 19 '12 at 6:10
shawndreck
1,94911 gold badge2121 silver badges3030 bronze badges
Is $keywords the same as ' ラメ単色'? –
alex
Dec 19 '12 at 6:13
yea, copied and edited in a haste –
shawndreck
Dec 19 '12 at 6:14
Add a comment
4 Answers
9
Add the u flag to your regex. This makes the RegEx engine treat the input string as UTF-8.
$keywords = preg_replace("@[ ]@u", ' ',urldecode($keywords));
// outputs :'ラメ単色'
CodePad.
The reason it mangles the string is because to the RegEx engine, your replacement characters, 20 (space) or e3 80 80 (IDEOGRAPHIC SPACE) are not treated as two characters, but separate bytes 20, e3 and 80.
When you look at the byte sequence of your string to scan, we get e3 80 80 e3 83 a9 e3 83 a1 e5 8d 98 e8 89 b2. We know the first character is a IDEOGRAPHIC SPACE, but because PHP is treating it as a sequence of bytes, it does a replacement individually of the first four bytes, because they match individual bytes that the regex engine is scanning.
As for the mangling which results in the � (REPLACEMENT CHARACTER), we can see this happens because the byte e3 is present further along in the string. The e3 byte is the start byte of a three byte long Japanese character, such as e3 83 a9 (KATAKANA LETTER RA). When that leading e3 is replaced with a 20 (space), it no longer becomes a valid UTF-8 sequence.
When you enable the u flag, the RegEx engine treats the string as UTF-8, and won't treat your characters in your character class on a per-byte basis.
Share
Improve this answer
Follow
edited Dec 19 '12 at 12:46
answered Dec 19 '12 at 6:22
alex
451k190190 gold badges845845 silver badges967967 bronze badges
I am going accept yours as the answer since it makes use of my preferred preg_replace. The mb_ereg_replace also does the job though. Thanks! –
shawndreck
Dec 19 '12 at 6:53
Add a comment
2
To avoid additional problems, also consider setting the internal encoding explicitly to your mb_* functions solution:
mb_internal_encoding("UTF-8");
Share
Improve this answer
Follow
answered Dec 19 '12 at 6:32
lobostome
43344 silver badges88 bronze badges
Add a comment
1
Always good to dig into the documentation. I found out that preg_* related function are not optimized for mulitbyte charaacter. Instead mb_ereg_* and mb_* functions are supposed to be used. I solved this little issue by refactoring the code to something like:
$keywords = ' ラメ単色';
$pattern = " "/*ascii whitespace*/ . " "/*multi-byte whitespace*/;
$keywords = trim(
mb_ereg_replace("[{$pattern}]+", ' ',urldecode($keywords))); // outputs:'ラメ単色'
Thanks all the same!
Share
Improve this answer
Follow
answered Dec 19 '12 at 6:22
shawndreck
1,94911 gold badge2121 silver badges3030 bronze badges
Add a comment
-1
Use this
$keywords = preg_replace('/\s+/', ' ',urldecode($keywords));
PreviousLoại bỏ khoản trắng trong tiếng nhật mcjc.jp (ok)NextReturning current URL in WordPress (ok)
Last updated