How can we remove tweets from a specific user (user with high number of tweets) for sentiment analysis using R? -
aim: perform sentiment analysis on historical judgement usa courts on same sex marriage. # since no of tweets extremely high users, may introduce bias. how can remove them? # also, why number of unique tweets in usafull , total different?
rm(list=ls()) library(twitter) library(wordcloud) library(tm) download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="cacert.pem") consumer_key <- 'key' consumer_secret <- 'secret' access_token <- 'key' access_secret <- 'secret' setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret) usa <- searchtwitter("#lovewins", n=1500 , lang="en") usa2 <- searchtwitter("#lgbt", n=1500 , lang="en") usa3 <- searchtwitter("#gay", n=1500 , lang="en") #get text tusa <- sapply(usa, function(x) x$gettext()) tusa2 <- sapply(usa2, function(x) x$gettext()) tusa3 <- sapply(usa3, function(x) x$gettext()) #join texts total <- c(tusa,tusa2,tusa3) #remove duplicated tweets total <- total[!duplicated(total)] #no. of unique tweets uni <- length(total) # merging 3 set of tweets horozontally usafull<-c(usa,usa2,usa3) #convert tweets dafa frame usafull <- twlisttodf(usafull) usafull <- unique(usafull) #to know dates of tweets (date formatting) usafull$date <- format(usafull$created, format = "%y-%m-%d") table(usafull$date) #make table of number of tweets per user in decreasing number of tweets tdata <- as.data.frame(table(usafull$screenname)) tdata <- tdata[order(tdata$freq, decreasing = t), ] names(tdata) <- c("user","tweets") head(tdata) # plot freq of tweets on time in 2 hour windows library(ggplot2) minutes <-60 ggplot(data = usafull, aes(x=created))+geom_bar(aes(fill=..count..), binwidth =60*minutes)+scale_x_datetime("date")+ scale_y_continuous("frequency") #plot table above top 30 identify unusual trends par(mar=c(5,10,2,2)) with(tdata[rev(1:30), ], barplot(tweets, names=user, horiz = t, las =1, main="top 30: tweets per user", col = 1)) # twitter users more 20 tweets removing bias userid <- tdata[(tdata$tweets>20),] userid <- userid[,1]
from code understand want remove tweets in userid
, 1 way this,
usafull_nobias <- subset(usafull, !(screenname %in% userid$user))
as reason why different number of tweets in total
, usafull
, due fact in total
using text tweets find duplicates, , in usafull
using full tweet; take account e.g. retweets might have same text might come different users, have different ids, etc.
hope helps.
Comments
Post a Comment