Chris's Blog

Devops Shokunin

Blogspam Analysis with R Part 1

Comments Off on Blogspam Analysis with R Part 1

This morning while checking the comments on this blog I was surprised at the amount of spam comments caught by the Akismet plugin, so I decided to dive in with some logfile analysis using R to see if I could lessen the scourge.

Grab the data from my nginx logs, since I get very few comments, we can assume that everything is spam.

echo '"IP", "DATE"' > ~/tmp/data_analysis/blogspam.csv
zgrep '/wp-comments-post.php' /var/log/nginx/acc* |
perl -ne 'if (m/.*:(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) \- \- \[(\d+)\/(\w+)\/2014/)
{print "\",$1, "\",\"", $3, "/", $2, "\"\n"}' >>

Install R and start it

$ sudo apt-get install -y r-base-core
$ R

Load the data into R

spammers  <- read.csv(file="blogspam.csv", head=TRUE,sep=",")

Let’s find the biggest IP and heavest days:

 > summary(spammers)
               IP             Date        : 2135   Sep/07 : 1364        : 2069   Oct/02 : 1353        : 1971   Oct/03 : 1348        : 1864   Sep/09 : 1344        : 1819   Oct/01 : 1333        : 1712   Sep/30 : 1328  
 (Other)        :50435   (Other):53935

Histogram by IP Frequency

iplist <-$IP))
hist(iplist$Freq, breaks=100, xlab="ip distribution",
      main="Spammer IPs",  col="darkblue")


This shows that there is no single IP causing all of the trouble, so there is no simple solution of blocking a single IP.

Graph the number of spam comments per day.
Note: you need to sort the data by date or your lines will be all over the place and the graph unreadable

dates <-$Date))
datessorted <- dates[order(as.Date(dates$Var1,format = "%b/%d")),]
plot(as.POSIXct(datessorted$Var1,format = "%b/%d"),
  datessorted$Freq, main="spam comments", xlab="date", ylab="count", type="l")


This gives me a basic idea of the problem and further analysis will be available in Part 2

Note: Since I sat down to write this post after clearing out the spam comments I now have 101 new spam comments.

Plotting Time Series Data with Gnuplot


When dealing with external customers and non-technical people I find it beneficial to provide some sort of visual representation.

Dumping a ton of data on people rarely conveys the message effectively.

My go to tool for generating graphs, especially of time-series data, is gnuplot.  It’s free, flexible and runs everywhere.

Data files httpa.reqs and httpb.reqs are comma separated with the first column as time in epoch seconds and the second the captured value.


create the following file and save as http.gnuplot

set datafile separator ","
set terminal png size 900,400
set title " Web Traffic"
set ylabel "Requests per second"
set xlabel "Date"
set xdata time
set timefmt "%s"
set format x "%m/%d"
set key left top
set grid
plot "httpa.reqs" using 1:2 with lines lw 2 lt 3 title 'hosta', \
     "httpb.reqs" using 1:2 with lines lw 2 lt 1 title 'hostb'

Generate the graph and save it

gnuplot < http.gnuplot  > requests.png

The break down of the lines is as folows:

set datafile separator ","

set the field delimiter to a comma , the default is a space. Just do not include this line for space separated data.

set terminal png size 900,400

Have gnuplot output a PNG file with the size specified. You can also run gnuplot in interactive mode where you do no need this line.

set title " Web Traffic"

Set the title of the graph at the top.

set ylabel "Requests per second"
set xlabel "Date"

Always label your graph so that when it gets passed around to people unfamiliar with the history of the request it is readily apparent to what is going on. This will also keep my high school math teacher quiet.

set xdata time

Tell gnuplot that the x axis will be time data. This allows for more flexible time series manipulations.

set timefmt "%s"

Let gnuplot know what format the time string will be in. “%s” is epoch seconds, “%m/%d/%Y:%H:%M:%S” will read 01/28/2011:00:01:14 in as a time value. Not having to convert date formats with some script first on my data is one of the big wins from using gnuplot.

set format x "%m/%d"

Set the date output format for the x axis.

set key left top

Set the legend to the top left corner.

set grid

Turn on grid lines, they are off by default.

plot "httpa.reqs" using 1:2 with lines lw 2 lt 3 title 'hosta', 
"httpb.reqs" using 1:2 with lines lw 2 lt 1 title 'hostb'

1:2 is the field numbers. The first is the x-axis value and the second the y-axis values.

“with lines” uses lines instead of simply data points.

lw is the line width, with 1 being the default.

lt is the line type or color, gnuplot will pick colors for you automatically if you do not specify them.

title is for the legend.

You can plot multiple data sources on the same graph.

The gnuplot documentation is available here

An excellent gallery of the amazing capabilities of gnuplot is here