java - Regex: Want to change case of letter following one of a set, except HTML entity -
examples:
rythm&blues -> rythm&blues .. don't wear white/live -> don't wear white/live
first convert whole string lowercase (because want have uppercase @ start of word).
i using split pattern: [&/\\.\\s-]
, convert parts' first letter uppercase.
it works well, except, converts html entities of course: e.g. don't
converted don't
entity should left alone.
while writing discover additional problem... initial conversion lowercase potentially messes html entities well. so, entities should totally left alone. (e.g. ç
not same ç
)
an html entity matched this: &[a-z][a-z][a-z]{1,5};
i thinking of doing groups, unfortunately find hard figure out.
this pattern seems handle situation
"\\w+|&#?\\w+;\\w*"
there may edge cases, can adjust accordingly come up.
pattern breakdown:
\\w+
- match word&#?\\w+;\\w*
- match html entity
code sample:
public static void main(string[] args) throws exception { string[] lines = { "rythm&blues", ".. don't wear white/live" }; pattern pattern = pattern.compile("\\w+|&#?\\w+;\\w*"); (int = 0; < lines.length; i++) { matcher matcher = pattern.matcher(lines[i]); while (matcher.find()) { if (matcher.group().startswith("&")) { // handle html entities // there letters after semi-colon // need lower case if (!matcher.group().endswith(";")) { string htmlentity = matcher.group(); int semicolonindex = htmlentity.indexof(";"); lines[i] = lines[i].replace(htmlentity, htmlentity.substring(0, semicolonindex) + htmlentity.substring(semicolonindex + 1) .tolowercase()); } } else { // uppercase first letter of word , lowercase // rest of word lines[i] = lines[i].replace(matcher.group(), character.touppercase(matcher.group().charat(0)) + matcher.group().substring(1).tolowercase()); } } } system.out.println(arrays.tostring(lines)); }
results:
[rythm&blues, .. don't wear white/live]
Comments
Post a Comment