This morning while checking the comments on this blog I was surprised at the amount of spam comments caught by the Akismet plugin, so I decided to dive in with some logfile analysis using R to see if I could lessen the scourge.
Grab the data from my nginx logs, since I get very few comments, we can assume that everything is spam.
echo '"IP", "DATE"' > ~/tmp/data_analysis/blogspam.csv zgrep '/wp-comments-post.php' /var/log/nginx/acc* | perl -ne 'if (m/.*:(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) \- \- \[(\d+)\/(\w+)\/2014/) {print "\",$1, "\",\"", $3, "/", $2, "\"\n"}' >> ~/tmp/data_analysis/blogspam.csv
Install R and start it
$ sudo apt-get install -y r-base-core $ R
Load the data into R
spammers <- read.csv(file="blogspam.csv", head=TRUE,sep=",")
Let’s find the biggest IP and heavest days:
> summary(spammers) IP Date 1.1.1.1 : 2135 Sep/07 : 1364 2.2.2.2 : 2069 Oct/02 : 1353 3.3.3.3 : 1971 Oct/03 : 1348 4.4.4.4 : 1864 Sep/09 : 1344 5.5.5.5 : 1819 Oct/01 : 1333 6.6.6.6 : 1712 Sep/30 : 1328 (Other) :50435 (Other):53935
Histogram by IP Frequency
iplist <- as.data.frame(table(spammers$IP)) hist(iplist$Freq, breaks=100, xlab="ip distribution", main="Spammer IPs", col="darkblue")
This shows that there is no single IP causing all of the trouble, so there is no simple solution of blocking a single IP.
Graph the number of spam comments per day.
Note: you need to sort the data by date or your lines will be all over the place and the graph unreadable
dates <- as.data.frame(table(spammers$Date)) datessorted <- dates[order(as.Date(dates$Var1,format = "%b/%d")),] plot(as.POSIXct(datessorted$Var1,format = "%b/%d"), datessorted$Freq, main="spam comments", xlab="date", ylab="count", type="l")
This gives me a basic idea of the problem and further analysis will be available in Part 2
Note: Since I sat down to write this post after clearing out the spam comments I now have 101 new spam comments.