February 2014

Gavin Pickin

Dev Ops - From Mail Logs to DB Stats in a CFML Dashboard - Part 4

CFML Language, Server Admin

Welcome back to the fourth post in this mini series on how to take some Mail Logs from our qmail servers, and massage them and move them around, and finish with them on a cfml box ready to massage them a little more, and make some pretty and readable graphs and tables, to keep our Mail Servers in the good books, and off those pesky Spam Blacklists. In post one, two and three, we found the files, identified what we wanted out of the log files, planned out how we were going to use cron to move files, reset perms, and finally grep the important stuff out of the logs, and get it ready for our cfml engine to do its thing.

Today, we're going to look at the start of my scripts (rusty at first, I admit), that read in the log files, and import them into the database. As I mentioned in previous posts, I think I have all the data out of the logs that I need, although, I bet I haven't found all the ways I will want to use this data, so I decided to import the data in raw format for step 1, and step 2, I will crunch the raw data to pull out the records of use. By leaving the raw data, with some vitals information along side that, like Date Time Stamp, I can always go back and crunch that data for another pattern, when I find it in the future, and by storing the Date Time Stamp, I can of course purge old Raw Data as time passes, and I care less and less about it.

To get my scripts up and running, I'm starting in a folder, with some plain and simple single files, no framework, no real organization, just single pages of code, while I experiment and get things working. There are some serious duplication going on, which I will refactor out in time, so I am sure you will see a million things wrong with the code, but try and think about what I'm doing, compared to how I'm doing it. Although, of course, we will refractor this code, and if you want to give advice, please do, just letting you know its raw at this point, and I expect a lot of it to be refactored into functions etc.

As I build this, I am aware that cfml as a language is not built for reading logs and manipulating, although, I think it does pretty well, but to keep track of it, I'm going to throw in some timers, to keep an eye on how things take. Like most people, we usually knock up our own little tick debugging notes, and its pretty effective, but I decided to try out my buddy Adam Cameron's little stopwatch function he wrote about a while ago. It does the same job, but uses an object, stores the information, and makes it easy to add laps, and then dump the whole object when you're done. It takes care of total time, incremental time, and is simple to use. Why invent the wheel, right… so here is the link to his little stopwatch… and I'm including it in the header of my file.

<cfsetting requesttimeout="600">
<cfinclude template="stopwatch.cfm">
<cfset ds = "dev_mydatasource">

<cfparam name="thelogname" default="clean_maillogH.log">
<cfset stopwatch = makeStopwatch()>


First, I give the file a nice long timeout, knowing this will be a pretty long running task. I include the function for the stopwatch, and then set my datasource. Since I'm going to be working on 2 types of files, I want to pass that in, I'll default it to the hourly log file, and then create an instance of the stopwatch to use on my page. No rocket science there.

<cfset stopwatch.start("Begin timing")>
<cfquery name="ind_control" datasource="#ds#">
select * from ind_control
where filename = <cfqueryparam value="#thelogname#" cfsqltype="cf_sql_varchar">


Next, I start my stopwatch, labelled "Begin timing" and then query my control table. My control table is pretty simple, I have the following fields 

  • id - primary key auto increment
  • filename - name of the file, differentiates between daily and hourly log
  • dte_good - date of the first record in the file
  • int_good_end - last record processed in the file
  • int_endoffile - was the end of the file reached

Since I have 2 log files I'm working on, I have 2 rows in my control table. One for clean_maillog.log for daily, and clean_maillogH.log for hourly. As I process records, I am going to check the date of the first entry, if the date matches the date in the log file, we will continue processing from where we left off, the int_good_end value. If the date does not match, the file is a new file, and we must start processing from the first row. End Of File is recorded if we reached the end of the file in our last batch of processing. In the hourly, this will be set hourly, but if the process runs, and finds the the last row processed is no longer the end of the file, this end of file flag is reset, and processing will continue as it did previously.

To show you how good cfml really is with reading files like this, we'll build the file and show you a few numbers as we go… so first, lets read our file, and count the number of lines in the file.

<cfset thecounter = 1>
<cfloop file="/home/maillogagent/#thelogname#" index="thelog" >
    <cfset thecounter = thecounter + 1>
<cfset stopwatch.lap("Counter Loop done")>
<cfdump var="#stopWatch.getTimeline()#">

Running on the current file at the time of writing this, 132322 lines

    It takes 219 milliseconds to increment the counter, and output the counter at the end

    It takes 238 milliseconds to increment the counter, and output the counter for each line and at the end.

That is pretty quick.
Lets try a REALLY big file, and see the difference.

One of the raw, ungrepped files reaches 1.179gb… lets see what our cfml can do with that.

That is almost 10 million records, and it takes under 10 seconds to read and loop through every line, incrementing the counter, and outputting it. The worse week we have had in the several months I have been looking at logs, hits a peak of 400k, and that was with a huge outbreak of mail…. so in the scheme of things, cfml can read the files, and read them pretty quick… on average about 1 second per million lines, without any processing… so lets see what happens when we want to do some basic manipulation and preparing for database activity.

We're going back to our weekly log, of approx 132000 records for the next set of tests.Now, inside the loop, we're going to pull out the date, and create date time objects for each line. When we go to pull the actual data in, we'll pull it in in batches, but for now, for timing, lets see how this builds out.

<cfif find(' independence ', theline) gt 0>
    <cfset cleanedup_val = left(theline, find(' independence ', theline)-1)>
    <cfset cleanedup_val = left(theline, find(' new-independence ', theline)-1)>
<cfset thedate = parsedatetime(dateformat(now(), "yyyy") & " " & cleanedup_val)>
<cfif thedate gt now()>
    <cfset thedate = dateadd("yyyy", -1, thedate)>


First, we pull the first piece of the line off, as it contains the date. We have to use an if statement to find that, as our server name changed as we migrated servers, so some log entries have independence as its name, and others as new-independence… so we find the location of that string, and grab everything to the left of it.

We then add the year to the string, since it didn't have it in the logs, and then parse the concatenated string to get use a date time object. Of course, if I parse anything from Dec 20th, and put the 2014 in front of it, the time is wrong, so if the date is greater than now, the log must be last years, so we just take a year off of it, and we have our date object ready for manipulation.

Lets run our test and see how long the read and date parsing takes.

So cfml loops through a file with 132000 lines, parsing the date on each one, and we're still under 1 second. So far so good, of course, the more we do, the more overhead we're adding to each and every line. Next, lets add a database call to the mix.

Each and every line, we're going to check to see if that date exists in the database already.

<cfquery name="logCheck" datasource="#ds#">
select id from ind_log
where dte_log = #thedate#


Now we're getting into some big numbers, 132000 database calls, across the network to our database server, and this is where you see the bottleneck. Trying to process 132000 records with 1.6 millions records in the table (I have been loading data as I write this) and the request times out, at the servlet level. I assume there is a timeout which I have no overridden in Tomcat, and 132k does not success. I set an if statement, only running the query for the first 100,000 loops, and these are the results

To see how much affect the data has on the timing, I duplicated the table, with only a few records in it, and ran the same code, and these are the results.

So 39512 / 100000 = 0.395        with 1.6 million records to be searched (with an index on the field)

Vs 55188 / 132332 =  0.417      with only a handful of records

Seeing there was not much of a difference in the results, I decided to run the tests again, and interestingly enough, against 1.6 million records, and it completed in 52 seconds, so it looks like my initial test that failed was because of another task, job, or external factor affecting the speed, and putting the request over that 60 second time limit I believe is imposed at the servlet level on my server right now.

The point is, we can crunch quite a lot of data with cfml, and even with 100k-130k of database queries (albeit reads) we can get a lot done.... not too bad I think for a language where this isn't its strong point, but it does enough to be useful in these situations, especially for what I'm doing here.
I think we have covered a lot, next post, we'll continue processing our logs, and look at insertion speeds, and decide whats a safe and logical batch size. 





Blog Search