How can we remove tweets from a specific user (user with high number of tweets) for sentiment analysis using R? -


aim: perform sentiment analysis on historical judgement usa courts on same sex marriage. # since no of tweets extremely high users, may introduce bias. how can remove them? # also, why number of unique tweets in usafull , total different?

    rm(list=ls())     library(twitter)     library(wordcloud)     library(tm)      download.file(url="http://curl.haxx.se/ca/cacert.pem",   destfile="cacert.pem")      consumer_key <- 'key'     consumer_secret <- 'secret'     access_token <- 'key'     access_secret <- 'secret'     setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)       usa <- searchtwitter("#lovewins", n=1500 , lang="en")      usa2 <- searchtwitter("#lgbt", n=1500 , lang="en")      usa3 <- searchtwitter("#gay", n=1500 , lang="en")  #get text     tusa <- sapply(usa, function(x) x$gettext())     tusa2 <- sapply(usa2, function(x) x$gettext())     tusa3 <- sapply(usa3, function(x) x$gettext())  #join texts     total <- c(tusa,tusa2,tusa3)  #remove duplicated tweets     total <- total[!duplicated(total)]  #no. of unique tweets     uni <- length(total)  # merging 3 set of tweets horozontally     usafull<-c(usa,usa2,usa3)  #convert tweets dafa frame     usafull <- twlisttodf(usafull)     usafull <- unique(usafull)  #to know dates of tweets (date formatting)     usafull$date <- format(usafull$created, format = "%y-%m-%d")     table(usafull$date)  #make table of number of tweets per user in decreasing number of tweets     tdata <- as.data.frame(table(usafull$screenname))     tdata <- tdata[order(tdata$freq, decreasing = t), ]     names(tdata) <- c("user","tweets")     head(tdata)   # plot freq of tweets on time in 2 hour windows     library(ggplot2)     minutes <-60     ggplot(data = usafull, aes(x=created))+geom_bar(aes(fill=..count..),    binwidth =60*minutes)+scale_x_datetime("date")+ scale_y_continuous("frequency")   #plot table above top 30 identify unusual trends     par(mar=c(5,10,2,2))     with(tdata[rev(1:30), ], barplot(tweets, names=user, horiz = t, las =1,     main="top 30: tweets per user", col = 1))  # twitter users more 20 tweets removing bias     userid <- tdata[(tdata$tweets>20),]     userid <- userid[,1] 

from code understand want remove tweets in userid, 1 way this,

usafull_nobias <- subset(usafull, !(screenname %in% userid$user)) 

as reason why different number of tweets in total , usafull, due fact in total using text tweets find duplicates, , in usafull using full tweet; take account e.g. retweets might have same text might come different users, have different ids, etc.

hope helps.


Comments

Popular posts from this blog

javascript - Using jquery append to add option values into a select element not working -

Android soft keyboard reverts to default keyboard on orientation change -

jquery - javascript onscroll fade same class but with different div -