r - Keep doubled columns which differ in only 2 letters in a data.frame -
i have data frame in r consists of around 100 columns. of columns doubled differ in 2 letters. want keep these columns , delete columns not doubled.
here example:
234-rgz sk 234-rgz pv 556-gft sk 456-hjk sk 456-hjk pv
the output should be:
234-rgz sk 234-rgz pv 456-hjk sk 456-hjk pv
all columns have same naming conventions. number starting 2 150 "-" after 4 or 5 letters, space , "sk" or "pv". thought of using regular expression don't solving problem how rid of single columns. help!
you can use duplicated
on column names after removing suffix part. output logical index can used subset
original dataset.
v1 <- colnames(df1) v2 <- sub('\\s+[^ ]+$', '', v1) indx <- duplicated(v2)|duplicated(v2, fromlast=true) v1[indx] #[1] "234-rgz sk" "234-rgz pv" "456-hjk sk" "456-hjk pv"
to subset columns in dataframe,
df1[indx]
or option splitting column names string substring , use grep
match substring have frequency >1
tbl <- table(unlist(strsplit(v1, '\\s+.*'))) df1[grep(paste(names(tbl)[tbl>1], collapse="|"), v1)]
data
set.seed(24) df1 <- as.data.frame(matrix(sample(0:9, 5*10, replace=true), ncol=5, dimnames=list(null, c('234-rgz sk', '234-rgz pv' , '556-gft sk', '456-hjk sk' , '456-hjk pv') )) )
Comments
Post a Comment