How can we remove tweets from a specific user (user with high number of tweets) for sentiment analysis using R? -


aim: perform sentiment analysis on historical judgement usa courts on same sex marriage. # since no of tweets extremely high users, may introduce bias. how can remove them? # also, why number of unique tweets in usafull , total different?

    rm(list=ls())     library(twitter)     library(wordcloud)     library(tm)      download.file(url="http://curl.haxx.se/ca/cacert.pem",   destfile="cacert.pem")      consumer_key <- 'key'     consumer_secret <- 'secret'     access_token <- 'key'     access_secret <- 'secret'     setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)       usa <- searchtwitter("#lovewins", n=1500 , lang="en")      usa2 <- searchtwitter("#lgbt", n=1500 , lang="en")      usa3 <- searchtwitter("#gay", n=1500 , lang="en")  #get text     tusa <- sapply(usa, function(x) x$gettext())     tusa2 <- sapply(usa2, function(x) x$gettext())     tusa3 <- sapply(usa3, function(x) x$gettext())  #join texts     total <- c(tusa,tusa2,tusa3)  #remove duplicated tweets     total <- total[!duplicated(total)]  #no. of unique tweets     uni <- length(total)  # merging 3 set of tweets horozontally     usafull<-c(usa,usa2,usa3)  #convert tweets dafa frame     usafull <- twlisttodf(usafull)     usafull <- unique(usafull)  #to know dates of tweets (date formatting)     usafull$date <- format(usafull$created, format = "%y-%m-%d")     table(usafull$date)  #make table of number of tweets per user in decreasing number of tweets     tdata <- as.data.frame(table(usafull$screenname))     tdata <- tdata[order(tdata$freq, decreasing = t), ]     names(tdata) <- c("user","tweets")     head(tdata)   # plot freq of tweets on time in 2 hour windows     library(ggplot2)     minutes <-60     ggplot(data = usafull, aes(x=created))+geom_bar(aes(fill=..count..),    binwidth =60*minutes)+scale_x_datetime("date")+ scale_y_continuous("frequency")   #plot table above top 30 identify unusual trends     par(mar=c(5,10,2,2))     with(tdata[rev(1:30), ], barplot(tweets, names=user, horiz = t, las =1,     main="top 30: tweets per user", col = 1))  # twitter users more 20 tweets removing bias     userid <- tdata[(tdata$tweets>20),]     userid <- userid[,1] 

from code understand want remove tweets in userid, 1 way this,

usafull_nobias <- subset(usafull, !(screenname %in% userid$user)) 

as reason why different number of tweets in total , usafull, due fact in total using text tweets find duplicates, , in usafull using full tweet; take account e.g. retweets might have same text might come different users, have different ids, etc.

hope helps.


Comments

Popular posts from this blog

Rendering JButton to get the JCheckBox behavior in a JTable by using images does not update my table -

javascript - Using jquery append to add option values into a select element not working -

Android soft keyboard reverts to default keyboard on orientation change -