Blog

07
March 2014

Gavin Pickin

Dev Ops - From Mail Logs to DB Stats in a CFML Dashboard - Part 5

CFML Language, Server Admin

Its been a month since my last post in this series, so if you weren't reading my blog then, jump back into late Jan early Feb, and catch up on where we are at. We've done some testing, and we ready to clean up our log file import into the database.

Again, the code is raw, single page, no framework, not the prettiest thing around, but I think I have it pretty well nailed down now.
I have put the file in a gist, and will walk through it, discussing sections, and why I did what.

https://gist.github.com/gpickin/9406303


<cfsetting requesttimeout="600">
<cfparam name="thelogname" default="clean_maillogH.log">
<cfinclude template="stopwatch.cfm">
<cfset ds = "dev_bktools">
<cfset stopwatch = makeStopwatch()>
 
<cfoutput>
 
<cfset stopwatch.start("Begin timing")>

 

Lines 1 - 9.
As I mentioned last time, sometimes I will pass the log name in, because I have 2 logs I am processing, one updated hourly by our cron scripts, and the other, at the end of the week, in which we just check to make sure we don't miss any log items. I initiate our stopwatch, and begin timing.


<cfquery name="ind_control" datasource="#ds#">
	select * from ind_control
	where filename = <cfqueryparam value="#thelogname#" cfsqltype="cf_sql_varchar">
</cfquery>

 

Line 11 - 14
I am querying my control table, based on the name of the file, we want to see if the log we're processing is the same file we last processed, and if it is, where did we get to last time, so we can continue processing.


<cfset thecounter = 1>

 

Line 16 - Simple counter, so we know what line of the log file we're on.


<cfloop file="/pathtomylog/#thelogname#" index="thelog" >

 

Line 17 - We're going to loop through the log, line by line… as we progress, we'll count each line, and use that to compare to our last processed line, and see if we're at the end of the file, or not.


<cfif thecounter eq 1>

 

Line 18 - If we're on the first line of the file, we want to check and see if we have processed this before or not.


<cfset thedate = parseDateFromLine(thelog)>

 

Line 19 - Calls a UDF to take the line as input, and parse out the date/time, and return a date object.


<cfset batchsize = 50000>

 

Line 21 - We set the batch size to 50,000. It seemed to be quite a good size for reading and inserting into the database, it it should only take a few runs to roll through even large logs.


<cfif ind_control.dte_good neq thedate>
	<cfset startrow = 1>

 

Line 22 - 23 - If the date does not match the start date we had in the control file, this is a new file, so we need to start processing from the beginner no matter what the control file says.


<cfelse>
	<cfset startrow = ind_control.int_good_end + 1>
	<cfif ind_control.int_endoffile is 1>
		<cfset endoffile_found = 1>
	</cfif>	
</cfif>	

 

Line 24 - 29 - We set the starting row as the last row we processed, plus 1, and if last time we were at the end of the file, we set a variable to state that, later, we'll check that, to see if the number of rows have changes, and reset it or not.


<cfset endrow = startrow + batchsize - 1>

 

Line 30 - We just set the endow, for this batch run to be the start row + the batch size - 1, so we process the batch size number of records… since the start row is 1, 1 + 50000 is 50001, take one off, to keep nice round numbers.


Now we're setup the start of the file, we can process the actual logs.


<cfif thecounter gte startrow and thecounter lte endrow and not structkeyexists(variables, "endoffile_found")>

 

Line 33 - If we are supposed to process this row, we'll jump in and do some work… if not we skip all the following processing, and just count the line. 

  • We check for if the counter (row number) is greater than equal to the start row. We don't want to process any we have already processed.
  • We check for see if the counter (row number) is less than or equal to our endow, we only want to process those in our batch size
  • We only want to process if there is not an EOF flag. If there is an EOF flag, we'll just count the rows, and at the end, we'll reset the EOF flag in the control table if we found more rows than we thought the log file had (this is mainly for the hourly log, that grows every hour). 

If it is a row we should process, we jump into the next block of code.


<cfset thedate = parseDateFromLine(thelog)>

 

Line 34 - Get the Date/Time so we have it.


<cfquery name="logCheck" datasource="#ds#">
	select id from ind_log
	where dte_log = #thedate_log#
	and txt_log = <cfqueryparam value="#theline#" cfsqltype="cf_sql_longvarchar" >	
</cfquery>

 

Line 36 - 39 Look in the DB for an entry with the same date time stamp, and the actual log line as well. I know just using the log itself would work, but a date time index is faster, than the text index, so by using that as the first where clause, we can speed the look up dramatically.


<cfif logCheck.recordcount eq 0>
	<cfquery datasource="#ds#">
		insert into ind_log
		set 
		dte_log = #thedate_log#,
		txt_log = <cfqueryparam value="#theline#" cfsqltype="cf_sql_longvarchar" >	
	</cfquery>

 

Line 40 - 47 If that record was not found in our DB then we'll insert it into the db, inserting the date time stamp, and the line itself, for future processing. Note, in the hourly processing, with the code setup this way, it should never find a duplicate… but, at the end of the week, the hourly log gets saved as the weeks log, and we will process through that, checking for any that happened to get missed, so we double check. 


<cfelse>
	<p>Duplicate found</p>		
</cfif>

 

Line 48 - 50 state if a duplicate was found.


<cfset thecounter = thecounter + 1>

 

Line 53 - Increments the counter, whether we tried to process it or not. and Line 54 closes out the loop.


</cfloop>
#thecounter#	
<cfset stopwatch.lap("Loop and Save Data")>

 

Line 55 - 57 - We dump the total count of the rows after looping through them all. Not needed for the scheduled task, but if running manually, its useful. We set the lap on the watch.


<cfif endrow gt thecounter>
	<cfset endrow = thecounter>
	<cfset endoffile = 1>
<cfelse>
	<cfset endoffile = 0>	
</cfif>	

 

Line 59 - 64 If the endow was greater than the counter, we processed all the rows in the file, so we mark the end of the file, and set the endow to the number of rows, so if we start processing again, we start where we ended.
If the endow was not greater, we mark the end of file was not reached, and next time this is run, it will continue processing.


<cfif structkeyexists(variables, "endoffile_found")>
	<cfif thecounter gt ind_control.int_good_end>
		<cfset endoffile = 0>
		<cfset endrow = ind_control.int_good_end>
	</cfif>
</cfif>	

 

Line 67 - 72 We check to see if the counter was gt that the last processed row in the control file… if it is, we reset the endoffile, and make sure the endow was the control. We could probably remove this logic, but its initially how I wrote it, although, we could adjust it in the Line 33 block looking back at it.


<cfquery name="ind_control" datasource="#ds#">
	update ind_control
	set dte_good = #thedate#,
	int_good_end = #endrow#,
	int_endoffile = #endoffile#
	where filename = <cfqueryparam value="#thelogname#" cfsqltype="cf_sql_varchar">
</cfquery>

 

Line 74 - 80 - Update the control file… so we know where we got to, and where we're going.


<cfset stopwatch.lap("Saved to Control File")>	

<cfdump var="#stopWatch.getTimeline()#">

 

Line 82 - Lap the clock
Line 84 - Dump the stopwatch output, so we can see what time was spent processing.


<cffunction name="parseDateFromLine">
	<cfargument name="thelog" required="yes">
	
	<cfif left(arguments.thelog,7) eq "maillog">
			<cfset theline = right(arguments.thelog, ( len(arguments.thelog)-find(":", arguments.thelog) ) )>
		<cfelse>
			<cfset theline = arguments.thelog>
		</cfif>		
		<cfif find(' independence ', theline) gt 0>
			<cfset cleanedup_val = left(theline, find(' independence ', theline)-1)>
		<cfelse>
			<cfset cleanedup_val = left(theline, find(' new-independence ', theline)-1)>
		</cfif>			
		<cfset thedate = parsedatetime(dateformat(now(), "yyyy") & " " & cleanedup_val)>
		<cfif thedate gt now()>
			<cfset thedate = dateadd("yyyy", -1, thedate)>
		</cfif>	
		<cfreturn thedate>
</cffunction>		

 

Line 87 - 105 - The UDF Function that takes the line, and parses out the Date.

Our logs have several names, due to the migration. So we look for the start of the line, if it says maillog, we remove that from the start of the line.
We look for the server name, independence, if its there (without the new- in front of it) we grab everything in front of it.
If not, we grab everything in front of new-independence.
We add the year, since the year is not given in the logs. We guess and add this year
We parse the date… if the date is in the future, then the year we added was off by 1 year, so we remove a year from the date.
And then we return the date.


Its a lot of simple tasks, but it seems to be working pretty well.
As I stated, its not the most elegant code (yeah, its in tags too), but its processing the logs, and prepping them for more crunching. 
The schedule task runs every few minutes, after the files have been transferred by the cron tasks. So every hour, new data is fed into the DB.

Next we'll write a couple of shorter scripts that take these log lines, and pull them apart, and save more refined data into new tables. 
I will add a flag to each log row, so we can tell which log rows have been processed, for each type of mined data.

We're almost ready to make that pretty dashboard. 
We're into ColdFusions world now, Code and DBs.

Check back soon for that.
Gavin

Blog Search