performance - Efficient replacement for for-loop when splitting strings in R -
i have large dataframe (20 columns, >100k rows) , need split column of character strings multiple new columns.
the first 3 observations of column in question this:
scans <- data.frame(scan = c("ct cervical sp,ct head plain", "ii < 1 hour", "l-s spine,l-s spine"))
which looks this:
scan 1 ct cervical sp,ct head plain 2 ii < 1 hour 3 l-s spine,l-s spine
i need split 5 columns (there maximum of 5 substrings in each observation), , observations fewer substrings want remaining columns filled nas. using code:
scans <- data.frame(scan = c("ct cervical sp,ct head plain", "ii < 1 hour", "l-s spine,l-s spine")) for(i in 1:nrow(scans)){ scans$scan1[i] <- strsplit(scans$scan, ",")[[i]][1] scans$scan2[i] <- strsplit(scans$scan, ",")[[i]][2] scans$scan3[i] <- strsplit(scans$scan, ",")[[i]][3] scans$scan4[i] <- strsplit(scans$scan, ",")[[i]][4] scans$scan5[i] <- strsplit(scans$scan, ",")[[i]][5] }
which works , outputs desired solution:
scan scan1 scan2 scan3 scan4 scan5 1 ct cervical sp,ct head plain ct cervical sp ct head plain na na na 2 ii < 1 hour ii < 1 hour na na na na 3 l-s spine,l-s spine l-s spine l-s spine na na na
... slow. looping on tens or hundreds of thousands of observations time consuming.
many advice.
another way use tstrsplit
in devel version of data.table
library(data.table) # v >= 1.9.5 setdt(scans)[, tstrsplit(scan, ",", fixed = true)] # v1 v2 # 1: ct cervical sp ct head plain # 2: ii < 1 hour na # 3: l-s spine l-s spine
if sure have 5 splits @ least once, create these columns reference
setdt(scans)[, paste0("scan", 1:5) := tstrsplit(scan, ",")]
alternatively, tidyr
package offers similar functuanality
library(tidyr) separate(scans, scan, paste0("scan", 1:2), ",", = "merge", remove = false) # scan scan1 scan2 # 1 ct cervical sp,ct head plain ct cervical sp ct head plain # 2 ii < 1 hour ii < 1 hour <na> # 3 l-s spine,l-s spine l-s spine l-s spine
or option using base r
cbind(scans, read.table(text= as.character(scans$scan),sep=",", fill=true, na.strings=''))
Comments
Post a Comment