Chris's Blog

Devops Shokunin

Blogspam Analysis with R Part 1

Comments Off on Blogspam Analysis with R Part 1

This morning while checking the comments on this blog I was surprised at the amount of spam comments caught by the Akismet plugin, so I decided to dive in with some logfile analysis using R to see if I could lessen the scourge.

Grab the data from my nginx logs, since I get very few comments, we can assume that everything is spam.

echo '"IP", "DATE"' > ~/tmp/data_analysis/blogspam.csv
zgrep '/wp-comments-post.php' /var/log/nginx/acc* |
perl -ne 'if (m/.*:(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) \- \- \[(\d+)\/(\w+)\/2014/)
{print "\",$1, "\",\"", $3, "/", $2, "\"\n"}' >>

Install R and start it

$ sudo apt-get install -y r-base-core
$ R

Load the data into R

spammers  <- read.csv(file="blogspam.csv", head=TRUE,sep=",")

Let’s find the biggest IP and heavest days:

 > summary(spammers)
               IP             Date        : 2135   Sep/07 : 1364        : 2069   Oct/02 : 1353        : 1971   Oct/03 : 1348        : 1864   Sep/09 : 1344        : 1819   Oct/01 : 1333        : 1712   Sep/30 : 1328  
 (Other)        :50435   (Other):53935

Histogram by IP Frequency

iplist <-$IP))
hist(iplist$Freq, breaks=100, xlab="ip distribution",
      main="Spammer IPs",  col="darkblue")


This shows that there is no single IP causing all of the trouble, so there is no simple solution of blocking a single IP.

Graph the number of spam comments per day.
Note: you need to sort the data by date or your lines will be all over the place and the graph unreadable

dates <-$Date))
datessorted <- dates[order(as.Date(dates$Var1,format = "%b/%d")),]
plot(as.POSIXct(datessorted$Var1,format = "%b/%d"),
  datessorted$Freq, main="spam comments", xlab="date", ylab="count", type="l")


This gives me a basic idea of the problem and further analysis will be available in Part 2

Note: Since I sat down to write this post after clearing out the spam comments I now have 101 new spam comments.