Access logs are simply text files that store information about requests sent to your Apache or WebLogic server. They include values like requested resource, time of the request, response code, and responnse time among other values.
Access logs are valubale for gaining insight into what your HTTP server is doing. There are many tools out there that analyze access logs, but we'll focus on how you can do your own ad-hoc log filtering, which is a technique I've found useful in initial troubleshooting steps.
There are 3 access log types that you can use:
The Common Log Format is the most common log format used by HTTP servers. It is typically defined as:
LogFormat "%h %l %u %t \"%r\" %>s %b" common
Here’s what this log would contain:
127.0.0.1 - - [3/Nov/2023:12:00:00 -0000] "GET /example HTTP/1.1" 200 3001
The Combined Log Format is very similar to the Common Log Format but includes two additional headers – the referrer and the user agent. Its definition looks as follows:
LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\"" combined
An example of the log line:
127.0.0.1 - - [3/Nov/2023:12:00:00 -0000] "GET /example HTTP/1.1" 200 3001 "http://www.example.com/i
The Custom Log Format is very similar to the Common Log Format but includes two additional headers – the referrer and the user agent. Its definition looks as follows:
CustomLog
'%R:%h %l %u %t "%r" %>s %b %D "%U"'
An example of the custom log line:
access_log.205188100:127.0.0.1 - - [03/Nov/2023:00:00:00 +0000] "GET /example HTTP/1.1" 200 1517 6170 "-"
This custom format has a couple additions to the Common Log Format:
In order to run your own ad-hoc analysis of server access logs, I typically turn to Python for its ease of use and rich library of public frameworks and modules. One framework that is particularly useful for log parsing is lars, which is a small framework for working with httpd logs.
The primary purpose of lars is to provide a set of conversions to ease loading such logs into relational databases, but numerous other possibilities can be realized with a little imagination. Which is to say: lars is not a web log analyzer. However, it is a toolkit that makes it quite easy to construct your own analyzer.
sdsd
A simple lars script opens a log source and uses the source and target wrappers provided by lars to convert the log entries into another format while potentially filtering the entries. Here's an example:
import io
from lars import apache, csv
with io.open('sample_log_file.log', 'r') as infile, <span class="n" style="white-space: pre-wrap;">io</span><span class="o" style="white-space: pre-wrap;">.</span><span class="n" style="white-space: pre-wrap;">open</span><span class="p" style="white-space: pre-wrap;">(</span><span class="s1" style="white-space: pre-wrap;">'output.csv'</span><span class="p" style="white-space: pre-wrap;">,</span><span style="white-space: pre-wrap"> <span class="s1">'wb'</span><span class="p">)</span> <span class="k">as</span> <span class="n">outfile</span><span class="p">:</span></span>
with apache.ApacheSource(infile) as source<span class="p" style="white-space: pre-wrap;">,</span><span style="white-space: pre-wrap"> <span class="n">csv</span><span class="o">.</span><span class="n">CSVTarget</span><span class="p">(</span><span class="n">outfile</span><span class="p">)</span> <span class="k">as</span> <span class="n">target:</span></span>
for row in source:
target.write(row)
The above example opens the 'sample_log_file.log' as an Apache source, with the common log format (default). It then iterates through each row and writes each row to 'output.csv'.
Dynamic Filtering
In order to dynamically filter the log rows, we'll use the Python eval() function. The expression argument passed to eval() is parsed and evaluated as a Python expression. Note, both globals and locals are omitted and the expression is executed with the globals and locals in the environment where eval is called.
We can place our eval where the expression has access to each lars row object, which has attributes named after the columns of the source (with characters that cannot appear in Python identifiers replaced with underscores). These row attributes match the descriptions for the Common, Combined, or Custom Log Formats.
Here's an example of how we can filter the access log using the eval() function and return all 'POST requests.
import io
from lars import apache, datatypes, csv
# Custom Apache log format
CUSTOM_LOG_FORMAT = '%R:%h %l %u %t "%r" %>s %b %D "%U"'
expr = ''row.request.method==\"POST\"'
# Row format for output
NewRow = datatypes.row('host', 'path', 'time', 'size', 'time_taken_ms')
rows_found=0
with io.open(input_file, 'r') as f:
with apache.ApacheSource(f, CUSTOM_LOG_FORMAT) as source:
for row in source:
if eval(expr):
rows_found+= 1
# Create new output row
new_row = NewRow(row.remote_host, row.request.url.path_str, row.time, row.size, row.time_taken_ms)
print(new_row)
print(f'{rows_found} rows found for expression {expr}')
That's interesting, but not compelling, right? Let's the ability to filter log entries based on datetime. To do so, we'll add the below function.
def is_time_between(begin_time, end_time, check_time=None):
# If check time is not given, default to current UTC time
check_time = check_time or datetime.utcnow().time()
if begin_time < end_time:
return check_time >= begin_time and check_time <= end_time
else: # crosses midnight
return check_time >= begin_time or check_time <= end_time
With the is_time_between function, we can filter log entries with code like this:
expr = 'row.time==datetime.strptime(\"09/15/2023 00:00:03\", \"%m/%d/%Y %H:%M:%S\")'```
expr = 'is_time_between(datetime.strptime(\"09/15/2023 00:00:01\", \"%m/%d/%Y %H:%M:%S\"), datetime.strptime(\"09/15/2023 00:00:03\", \"%m/%d/%Y %H:%M:%S\"), row.time)'```
To get everything working, create a function that wraps the above logic.
def filter_access_log(input_file, output_file, expr):
# Row format for output
NewRow = datatypes.row('host', 'path', 'time', 'size', 'time_taken_ms')
rows_found=0
with io.open(input_file, 'r') as f, io.open(output_file, 'wb') as o:
with apache.ApacheSource(f, CUSTOM_LOG_FORMAT) as source, csv.CSVTarget(o, header=True) as target:
for row in source:
total_rows+=1
if eval(expr):
rows_found+= 1
# Create new output row
new_row = NewRow(row.remote_host, row.request.url.path_str, row.time, row.size, row.time_taken_ms)
target.write(new_row)
print(f'{rows_found} rows of {total_rows} found for expression {expr}')
Here's some example filter expressions you can use.
To find rows that match...
I use this script to run an initial pass at analyzing access logs - to do things like filter a set of requests between two times or get all POST requests with response times over a certain time. It's a great way to make a quick determination about what you should do with the logs.
Happy hunting
~sn
Previous Post
Next Post