performance - Efficient replacement for for-loop when splitting strings in R -


i have large dataframe (20 columns, >100k rows) , need split column of character strings multiple new columns.

the first 3 observations of column in question this:

scans <- data.frame(scan = c("ct cervical sp,ct head plain", "ii < 1 hour",                   "l-s spine,l-s spine")) 

which looks this:

                          scan 1 ct cervical sp,ct head plain 2                  ii < 1 hour 3          l-s spine,l-s spine 

i need split 5 columns (there maximum of 5 substrings in each observation), , observations fewer substrings want remaining columns filled nas. using code:

scans <- data.frame(scan = c("ct cervical sp,ct head plain", "ii < 1 hour", "l-s spine,l-s spine"))  for(i in 1:nrow(scans)){   scans$scan1[i] <- strsplit(scans$scan, ",")[[i]][1]   scans$scan2[i] <- strsplit(scans$scan, ",")[[i]][2]   scans$scan3[i] <- strsplit(scans$scan, ",")[[i]][3]   scans$scan4[i] <- strsplit(scans$scan, ",")[[i]][4]   scans$scan5[i] <- strsplit(scans$scan, ",")[[i]][5] } 

which works , outputs desired solution:

                          scan          scan1         scan2 scan3 scan4 scan5 1 ct cervical sp,ct head plain ct cervical sp ct head plain    na    na    na 2                  ii < 1 hour    ii < 1 hour            na    na    na    na 3          l-s spine,l-s spine      l-s spine     l-s spine    na    na    na 

... slow. looping on tens or hundreds of thousands of observations time consuming.

many advice.

another way use tstrsplit in devel version of data.table

library(data.table) # v >= 1.9.5 setdt(scans)[, tstrsplit(scan, ",", fixed = true)] #                v1            v2 # 1: ct cervical sp ct head plain # 2:    ii < 1 hour            na # 3:      l-s spine     l-s spine  

if sure have 5 splits @ least once, create these columns reference

setdt(scans)[, paste0("scan", 1:5) := tstrsplit(scan, ",")] 

alternatively, tidyr package offers similar functuanality

library(tidyr) separate(scans, scan, paste0("scan", 1:2), ",", = "merge", remove = false) #                           scan          scan1         scan2 # 1 ct cervical sp,ct head plain ct cervical sp ct head plain # 2                  ii < 1 hour    ii < 1 hour          <na> # 3          l-s spine,l-s spine      l-s spine     l-s spine 

or option using base r

 cbind(scans, read.table(text= as.character(scans$scan),sep=",", fill=true, na.strings='')) 

Comments

Popular posts from this blog

javascript - Using jquery append to add option values into a select element not working -

Android soft keyboard reverts to default keyboard on orientation change -

jquery - javascript onscroll fade same class but with different div -