Iterate over column names in an R data frame in order to change their type -
library(lubridate) # data build df d1 <- c("1/2/14", "3/5/15", "1/13/11") #start d2 <- c("1/2/15", "4/5/15", "6/18/15") #stop d3 <- c("5/16/08", "1/7/07", "6/22/01") #start d4 <- c("11/29/12", "8/5/14", "1/13/12") #stop <- c("blah", "blah", "blah") b <- c("blah", "blah", "blah") c <- c("blah", "blah", "blah") f <- c("blah", "blah", "blah") colnames <- c("col.a", "col.b", "col.c", "project1.start", "project1.end", "project2.start", "project2.end", "col.f") # assemble df df <- data.frame(a,b,c,d1,d2,d3,d4,f) names(df) <- colnames # change char cols dx posix date objects play nicely # lubridate df$project1.start <- mdy(df$project1.start) df$project1.end <- mdy(df$project1.end) df$project2.start <- mdy(df$project2.start) df$project2.end <- mdy(df$project2.end)
but! want above mdy
iteratively on dx specify. imagine instead of d1-d4 have d1-d142. there must elegant, i.e., non-brute force way of doing this!
so, tried this. know i'm doing mdy
on many columns, trying make work @ all. i've tried loops seq()
, etc., know i'm missing vector based approach r expects.
f <- function(x) {x <- mdy(x)} newdf <- apply(df,2,f)
but throws
warning messages: 1: formats failed parse. no formats found. ... 10: formats failed parse. no formats found.
and newdf bad:
col.a col.b col.c project1.start project1.end project2.start project2.end col.f [1,] na na na na na na na na [2,] na na na na na na na na [3,] na na na na na na na na project1.duration project2.duration [1,] na na [2,] na na [3,] na na
what doing st00pid?
so, once done, want date math
df$project1.duration <- (df$project1.end - df$project1.start ) df$project2.duration <- (df$project2.end - df$project2.start )
same here. want able iterate on durations dx columns perhaps need reshape data make happen. how take large number of durations of these different projects separately coded , reassemble them df can make plot of different durations each project. in sample df have 3 different durations, rows 1:3, able compare rows each project.
your error because apply
applying mdy
every column of df
, not "projectx.{start,end}" ones. , because df[col]
data.frame
, , mdy
needs vector -- try df[[col]]
.
e.g.
cols <- grep('project', names(df)) # one-liner df[cols] <- lapply(df[cols], mdy) # or loop if want (col in cols) { df[[col]] <- mdy(df[[col]]) }
in regards calculating per-project data (like duration), can kludge this:
projects <- paste0('project', 1:2) # many projects df[paste0(projects, '.duration')] <- df[paste0(projects, '.end')] - df[paste0(projects, '.start')]
however in long run (particularly if have lots of projects or want calculate lots of stats per project, not duration) might consider having data in long format, i.e.
project start end duration 1 ... 1 1 2 2 2
(probably sort of id variable know project 2 went project 1)
then can mydf$duration <- mydf$end - mydf$start
, if want in wide format again can make use of reshape
.
Comments
Post a Comment