29 KiB
iwla
Introduction
iwla (Intelligent Web Log Analyzer) is basically a clone of awstats. The main problem with awstats is that it's a very monolithic project with everything in one big PERL file. In opposite, iwla has been though to be very modular : a small core analysis and a lot of filters. It can be viewed as UNIX pipes. Philosophy of iwla is : add, update, delete ! That's the job of each filter : modify statistics until final result. It's written in Python.
Nevertheless, iwla is only focused on HTTP logs. It uses data (robots definitions, search engines definitions) and design from awstats. Moreover, it's not dynamic, but only generates static HTML page (with gzip compression option).
Usage
./iwla [-c|--config-file file] [-C|--clean-output] [-i|--stdin] [-f FILE|--file FILE] [-d LOGLEVEL|--log-level LOGLEVEL] [-r|--reset year/month] [-z|--dont-compress] [-p] [-D|--dry-run]
-c : Configuration file to use (default conf.py)
-C : Clean output (database and HTML) before starting
-i : Read data from stdin instead of conf.analyzed_filename
-f : Analyse this log file, multiple files can be specified (comma separated). gz files are acceptedRead data from FILE instead of conf.analyzed_filename
-d : Loglevel in ['DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL']
-r : Reset analysis to a specific date (month/year)
-z : Don't compress databases (bigger but faster, not compatible with compressed databases)
-p : Only generate display
-d : Dry run (don't write/update files to disk)
Basic usage
In addition to command line, iwla read parameters in default_conf.py. User can override default values using conf.py file. Each module requires its own parameters.
Main values to edit are :
- analyzed_filename : web server log
- domaine_name : domain name to filter
- pre_analysis_hooks : List of pre analysis hooks
- post_analysis_hooks : List of post analysis hooks
- display_hooks : List of display hooks
- locale : Displayed locale (en or fr)
You can also append an element to an existing default configuration list by using "_append" suffix. Example : multimedia_files_append = ['xml'] or multimedia_files_append = 'xml' Will append 'xml' to current multimedia_files list
Then, you can launch iwla. Output HTML files are created in output directory by default. To quickly see it, go into output and type
python -m SimpleHTTPServer 8000
Open your favorite web browser at http://localhost:8000. Enjoy !
Warning : The order in hooks list is important : Some plugins may requires others plugins, and the order of display_hooks is the order of displayed blocks in final result.
Interesting default configuration values
- DB_ROOT : Default database directory (default ./output_db)
- DISPLAY_ROOT : Default HTML output directory (default ./output)
- log_format : Web server log format (nginx style). Default is apache log format
- time_format : Time format used in log format
- pages_extensions : Extensions that are considered as a HTML page (or result) in opposit to hits
- viewed_http_codes : HTTP codes that are cosidered OK (200, 304)
- count_hit_only_visitors : If False, don't count visitors that doesn't GET a page but resources only (images, rss...)
- multimedia_files : Multimedia extensions (not accounted as downloaded files)
- css_path : CSS path (you can add yours)
- compress_output_files : Files extensions to compress in gzip during display build
Plugins
As previously described, plugins acts like UNIX pipes : statistics are constantly updated by each plugin to produce final result. We have three type of plugins :
- Pre analysis plugins : Called before generating days statistics. They are in charge to filter robots, crawlers, bad pages...
- Post analysis plugins : Called after basic statistics computation. They are in charge to enlight them with their own algorithms
- Display plugins : They are in charge to produce HTML files from statistics.
To use plugins, just insert their file name (without .py extension) in pre_analysis_hooks, post_analysis_hooks and display_hooks lists in conf.py.
Statistics are stored in dictionaries :
- month_stats : Statistics of current analysed month
- valid_visitor : A subset of month_stats without robots
- days_stats : Statistics of current analysed day
- visits : All visitors with all of its requests
- meta : Final result of month statistics (by year)
Create a Plugins
To create a new plugin, it's necessary to subclass IPlugin (_iplugin.py) in the right directory (plugins/xxx/yourPlugin.py).
Plugins can defines required configuration values (self.conf_requires) that must be set in conf.py (or can be optional). They can also defines required plugins (self.requires).
The two functions to overload are load(self) that must returns True or False if all is good (or not). It's called after init. The second is hook(self) that is the body of plugins.
For display plugins, a lot of code has been wrote in display.py that simplify the creation on HTML blocks, tables and bar graphs.
Plugins
Optional configuration values ends with *.
- iwla.py
- plugins/display/all_visits.py
- plugins/display/browsers.py
- plugins/display/feeds.py
- plugins/display/filter_users.py
- plugins/display/hours_stats.py
- plugins/display/ip_to_geo.py
- plugins/display/istats_diff.py
- plugins/display/operating_systems.py
- plugins/display/referers_diff.py
- plugins/display/referers.py
- plugins/display/robot_bandwidth.py
- plugins/display/top_downloads_diff.py
- plugins/display/top_downloads.py
- plugins/display/top_hits.py
- plugins/display/top_pages_diff.py
- plugins/display/top_pages.py
- plugins/display/top_visitors.py
- plugins/post_analysis/browsers.py
- plugins/post_analysis/feeds.py
- plugins/post_analysis/filter_users.py
- plugins/post_analysis/google_console_api.py
- plugins/post_analysis/hours_stats.py
- plugins/post_analysis/ip_to_geo.py
- plugins/post_analysis/iptogeo.py
- plugins/post_analysis/operating_systems.py
- plugins/post_analysis/referers.py
- plugins/post_analysis/reverse_dns.py
- plugins/post_analysis/search_analytics_api_sample.py
- plugins/post_analysis/top_downloads.py
- plugins/post_analysis/top_hits.py
- plugins/post_analysis/top_pages.py
- plugins/pre_analysis/page_to_hit.py
- plugins/pre_analysis/robots.py
iwla
Main class IWLA
Parse Log, compute them, call plugins and produce output
For now, only HTTP log are valid
Plugin requirements :
None
Conf values needed :
analyzed_filename
domain_name
locales_path
compress_output_files
excluded_ip
Output files :
DB_ROOT/meta.db
DB_ROOT/year/month/iwla.db
OUTPUT_ROOT/index.html
OUTPUT_ROOT/year/_stats.html
OUTPUT_ROOT/year/month/index.html
Statistics creation :
meta :
last_time
start_analysis_time
stats =>
year =>
month =>
viewed_bandwidth
not_viewed_bandwidth
viewed_pages
viewed_hits
nb_visits
nb_visitors
month_stats :
viewed_bandwidth
not_viewed_bandwidth
viewed_pages
viewed_hits
nb_visits
days_stats :
day =>
viewed_bandwidth
not_viewed_bandwidth
viewed_pages
viewed_hits
nb_visits
nb_visitors
visits :
remote_addr =>
remote_addr
remote_ip
viewed_pages{0..31} # 0 contains total
viewed_hits{0..31} # 0 contains total
not_viewed_pages{0..31}
not_viewed_hits{0..31}
bandwidth{0..31}
last_access
requests =>
[fields_from_format_log]
extract_request =>
http_method
http_uri
http_version
extract_uri
extract_parameters*
extract_referer* =>
extract_uri
extract_parameters*
robot
hit_only
is_page
keep_requests
valid_visitors:
month_stats without robot and hit only visitors (if not conf.count_hit_only_visitors)
Statistics update :
None
Statistics deletion :
None
plugins.display.all_visits
Display hook
Create All visits page
Plugin requirements :
None
Conf values needed :
display_visitor_ip*
Output files :
OUTPUT_ROOT/year/month/all_visits.html
OUTPUT_ROOT/year/month/index.html
Statistics creation :
None
Statistics update :
None
Statistics deletion :
None
plugins.display.browsers
Display hook
Create browsers page
Plugin requirements :
post_analysis/browsers
Conf values needed :
max_browsers_displayed*
create_browsers_page*
Output files :
OUTPUT_ROOT/year/month/browsers.html
OUTPUT_ROOT/year/month/index.html
Statistics creation :
None
Statistics update :
None
Statistics deletion :
None
plugins.display.feeds
Display hook
Display feeds parsers
Plugin requirements :
post_analysis/feeds
Conf values needed :
create_all_feeds_page*
Output files :
OUTPUT_ROOT/year/month/index.html
OUTPUT_ROOT/year/month/all_feeds.html
Statistics creation :
None
Statistics update :
None
Statistics deletion :
None
plugins.display.filter_users
Display hook
Filter users
Plugin requirements :
None
Conf values needed :
create_filtered_page*
Output files :
OUTPUT_ROOT/year/month/index.html
OUTPUT_ROOT/year/month/filtered_users.html
Statistics creation :
None
Statistics update :
None
Statistics deletion :
None
plugins.display.hours_stats
Display hook
Display statistics by hour/week day
Plugin requirements :
post_analysis/hours_stats
Conf values needed :
None
Output files :
OUTPUT_ROOT/year/month/index.html
Statistics creation :
None
Statistics update :
None
Statistics deletion :
None
plugins.display.ip_to_geo
Display hook
Add geo statistics
Plugin requirements :
post_analysis/ip_to_geo
Conf values needed :
create_geo_page*
Output files :
OUTPUT_ROOT/year/month/index.html
Statistics creation :
None
Statistics update :
None
Statistics deletion :
None
plugins.display.istats_diff
Display hook interface
Enlight new and updated statistics
Plugin requirements :
None
Conf values needed :
None
Output files :
None
Statistics creation :
None
Statistics update :
None
Statistics deletion :
None
plugins.display.operating_systems
Display hook
Add operating systems statistics
Plugin requirements :
post_analysis/operating_systems
Conf values needed :
create_families_page*
Output files :
OUTPUT_ROOT/year/month/index.html
Statistics creation :
None
Statistics update :
None
Statistics deletion :
None
plugins.display.referers_diff
Display hook
Enlight new and updated key phrases in in all_key_phrases.html
Plugin requirements :
display/referers
Conf values needed :
None
Output files :
None
Statistics creation :
None
Statistics update :
None
Statistics deletion :
None
plugins.display.referers
Display hook
Create Referers page
Plugin requirements :
post_analysis/referers
Conf values needed :
max_referers_displayed*
create_all_referers_page*
max_key_phrases_displayed*
create_all_key_phrases_page*
Output files :
OUTPUT_ROOT/year/month/referers.html
OUTPUT_ROOT/year/month/key_phrases.html
OUTPUT_ROOT/year/month/index.html
Statistics creation :
None
Statistics update :
None
Statistics deletion :
None
plugins.display.robot_bandwidth
Display hook
Display top 10 robot bandwidth use
Plugin requirements :
None
Conf values needed :
display_visitor_ip*
create_all_robot_bandwidth_page*
Output files :
OUTPUT_ROOT/year/month/top_robots_bandwidth.html
OUTPUT_ROOT/year/month/index.html
Statistics creation :
None
Statistics update :
None
Statistics deletion :
None
plugins.display.top_downloads_diff
Display hook
Enlight new and updated downloads in in top_downloads.html
Plugin requirements :
display/top_downloads
Conf values needed :
None
Output files :
None
Statistics creation :
None
Statistics update :
None
Statistics deletion :
None
plugins.display.top_downloads
Display hook
Create TOP downloads page
Plugin requirements :
post_analysis/top_downloads
Conf values needed :
max_downloads_displayed*
create_all_downloads_page*
Output files :
OUTPUT_ROOT/year/month/top_downloads.html
OUTPUT_ROOT/year/month/index.html
Statistics creation :
None
Statistics update :
None
Statistics deletion :
None
plugins.display.top_hits
Display hook
Create TOP hits page
Plugin requirements :
post_analysis/top_hits
Conf values needed :
max_hits_displayed*
create_all_hits_page*
Output files :
OUTPUT_ROOT/year/month/top_hits.html
OUTPUT_ROOT/year/month/index.html
Statistics creation :
None
Statistics update :
None
Statistics deletion :
None
plugins.display.top_pages_diff
Display hook
Enlight new and updated pages in in top_pages.html
Plugin requirements :
display/top_pages
Conf values needed :
None
Output files :
None
Statistics creation :
None
Statistics update :
None
Statistics deletion :
None
plugins.display.top_pages
Display hook
Create TOP pages page
Plugin requirements :
post_analysis/top_pages
Conf values needed :
max_pages_displayed*
create_all_pages_page*
Output files :
OUTPUT_ROOT/year/month/top_pages.html
OUTPUT_ROOT/year/month/index.html
Statistics creation :
None
Statistics update :
None
Statistics deletion :
None
plugins.display.top_visitors
Display hook
Create TOP visitors block
Plugin requirements :
None
Conf values needed :
display_visitor_ip*
Output files :
OUTPUT_ROOT/year/month/index.html
Statistics creation :
None
Statistics update :
None
Statistics deletion :
None
plugins.post_analysis.browsers
Post analysis hook
Detect browser information from requests
Plugin requirements :
None
Conf values needed :
None
Output files :
None
Statistics creation :
visits :
remote_addr =>
browser
month_stats :
browsers =>
browser => count
Statistics update :
None
Statistics deletion :
None
plugins.post_analysis.feeds
Post analysis hook
Find feeds parsers (first hit in feeds conf value and no viewed pages if it's a robot)
If there is ony one hit per day to a feed, merge feeds parsers with the same user agent
as it must be the same person with a different IP address.
Plugin requirements :
None
Conf values needed :
feeds
feeds_referers*
merge_one_hit_only_feeds_parsers*
Output files :
None
Statistics creation :
remote_addr =>
feed_parser
Statistics update :
None
Statistics deletion :
None
plugins.post_analysis.filter_users
Post analysis hook
Filter users with given user conditions
Plugin requirements :
None
Conf values needed :
filtered_users : list of filters
filtered_ip : list of ip (string)
create_filtered_page*
Filter can be a function or a list of filter description combined by AND operator
Filter description can be a function or a list of 3 elements :
* Field to match in visits
* Operator '=', '==', '!=', '>', '>=', '<', '<=' for int value
* Operator '=', '==', '!=', 'in', 'match' for str value
* Target value
For easiest config, you can indicate both 'remote_addr' or 'ip' in field element
function prototype is func(iwla, hit) and must return True or False
Example :
def my_filter(iwla, hit):
return True
filtered_users = [
[['viewed_pages', '>=', '5'], ['viewed_hits', '>=', '5']],
[['viewed_hits', '>=', '5'], my_filter],
my_filter,
]
Output files :
None
Statistics creation :
visits :
remote_addr =>
filtered
geo_location
Statistics update :
visits :
remote_addr =>
keep_requests
Statistics deletion :
None
plugins.post_analysis.google_console_api
Post analysis hook
Extract key phrases from Google console API
Plugin requirements :
None
Conf values needed :
domain_name
Output files :
None
Statistics creation :
None
Statistics update :
month_stats :
key_phrases =>
phrase => count
Statistics deletion :
None
plugins.post_analysis.hours_stats
Post analysis hook
Count pages, hits and bandwidth by hour/week day
Plugin requirements :
None
Conf values needed :
None
Output files :
None
Statistics creation :
month_stats:
hours_stats =>
00 .. 23 =>
pages
hits
bandwidth
days_stats =>
0 .. 6 =>
pages
hits
bandwidth
Statistics update :
None
Statistics deletion :
None
plugins.post_analysis.ip_to_geo
Post analysis hook
Get country code from IP address
Plugin requirements :
None
Conf values needed :
iptogeo_remote_addr*
iptogeo_remote_port*
Output files :
None
Statistics creation :
geo =>
country_code => count
None
Statistics update :
valid_visitors:
country_code
Statistics deletion :
None
plugins.post_analysis.iptogeo
plugins.post_analysis.operating_systems
Post analysis hook
Detect operating systems from requests
Plugin requirements :
None
Conf values needed :
None
Output files :
None
Statistics creation :
visits :
remote_addr =>
operating_system
month_stats :
operating_systems =>
operating_system => count
os_families =>
family => count
Statistics update :
None
Statistics deletion :
None
plugins.post_analysis.referers
Post analysis hook
Extract referers and key phrases from requests
Plugin requirements :
None
Conf values needed :
domain_name
Output files :
None
Statistics creation :
None
Statistics update :
month_stats :
referers =>
pages => count
hits => count
robots_referers =>
pages => count
hits => count
search_engine_referers =>
pages => count
hits => count
key_phrases =>
phrase => count
Statistics deletion :
None
plugins.post_analysis.reverse_dns
Post analysis hook
Replace IP by reverse DNS names
Plugin requirements :
None
Conf values needed :
reverse_dns_timeout*
Output files :
None
Statistics creation :
None
Statistics update :
valid_visitors:
remote_addr
dns_name_replaced
dns_analyzed
Statistics deletion :
None
plugins.post_analysis.search_analytics_api_sample
from __future__ import print_function
import argparse
import sys
from googleapiclient import sample_tools
# Declare command-line flags.
argparser = argparse.ArgumentParser(add_help=False)
argparser.add_argument('property_uri', type=str,
help=('Site or app URI to query data for (including '
'trailing slash).'))
argparser.add_argument('start_date', type=str,
help=('Start date of the requested date range in '
'YYYY-MM-DD format.'))
argparser.add_argument('end_date', type=str,
help=('End date of the requested date range in '
'YYYY-MM-DD format.'))
def main(argv):
service, flags = sample_tools.init(
argv, 'webmasters', 'v3', __doc__, __file__, parents=[argparser],
scope='https://www.googleapis.com/auth/webmasters.readonly')
# First run a query to learn which dates we have data for. You should always
# check which days in a date range have data before running your main query.
# This query shows data for the entire range, grouped and sorted by day,
# descending; any days without data will be missing from the results.
request = {
'startDate': flags.start_date,
'endDate': flags.end_date,
'dimensions': ['query']
}
response = execute_request(service, flags.property_uri, request)
print_table(response, 'Available dates')
return
# Get totals for the date range.
request = {
'startDate': flags.start_date,
'endDate': flags.end_date
}
response = execute_request(service, flags.property_uri, request)
print_table(response, 'Totals')
# Get top 10 queries for the date range, sorted by click count, descending.
request = {
'startDate': flags.start_date,
'endDate': flags.end_date,
'dimensions': ['query'],
'rowLimit': 10
}
response = execute_request(service, flags.property_uri, request)
print_table(response, 'Top Queries')
# Get top 11-20 mobile queries for the date range, sorted by click count, descending.
request = {
'startDate': flags.start_date,
'endDate': flags.end_date,
'dimensions': ['query'],
'dimensionFilterGroups': [{
'filters': [{
'dimension': 'device',
'expression': 'mobile'
}]
}],
'rowLimit': 10,
'startRow': 10
}
response = execute_request(service, flags.property_uri, request)
print_table(response, 'Top 11-20 Mobile Queries')
# Get top 10 pages for the date range, sorted by click count, descending.
request = {
'startDate': flags.start_date,
'endDate': flags.end_date,
'dimensions': ['page'],
'rowLimit': 10
}
response = execute_request(service, flags.property_uri, request)
print_table(response, 'Top Pages')
# Get the top 10 queries in India, sorted by click count, descending.
request = {
'startDate': flags.start_date,
'endDate': flags.end_date,
'dimensions': ['query'],
'dimensionFilterGroups': [{
'filters': [{
'dimension': 'country',
'expression': 'ind'
}]
}],
'rowLimit': 10
}
response = execute_request(service, flags.property_uri, request)
print_table(response, 'Top queries in India')
# Group by both country and device.
request = {
'startDate': flags.start_date,
'endDate': flags.end_date,
'dimensions': ['country', 'device'],
'rowLimit': 10
}
response = execute_request(service, flags.property_uri, request)
print_table(response, 'Group by country and device')
# Group by total number of Search Appearance count.
# Note: It is not possible to use searchAppearance with other
# dimensions.
request = {
'startDate': flags.start_date,
'endDate': flags.end_date,
'dimensions': ['searchAppearance'],
'rowLimit': 10
}
response = execute_request(service, flags.property_uri, request)
print_table(response, 'Search Appearance Features')
def execute_request(service, property_uri, request):
"""Executes a searchAnalytics.query request.
Args:
service: The webmasters service to use when executing the query.
property_uri: The site or app URI to request data for.
request: The request to be executed.
Returns:
An array of response rows.
plugins.post_analysis.top_downloads
Post analysis hook
Count TOP downloads
Plugin requirements :
None
Conf values needed :
None
Output files :
None
Statistics creation :
None
Statistics update :
month_stats:
top_downloads =>
uri => count
Statistics deletion :
None
plugins.post_analysis.top_hits
Post analysis hook
Count TOP hits
Plugin requirements :
None
Conf values needed :
None
Output files :
None
Statistics creation :
None
Statistics update :
month_stats:
top_hits =>
uri => count
Statistics deletion :
None
plugins.post_analysis.top_pages
Post analysis hook
Count TOP pages
Plugin requirements :
None
Conf values needed :
None
Output files :
None
Statistics creation :
None
Statistics update :
month_stats:
top_pages =>
uri => count
Statistics deletion :
None
plugins.pre_analysis.page_to_hit
Pre analysis hook
Change page into hit and hit into page into statistics
Plugin requirements :
None
Conf values needed :
page_to_hit_conf*
hit_to_page_conf*
Output files :
None
Statistics creation :
None
Statistics update :
visits :
remote_addr =>
is_page
Statistics deletion :
None
plugins.pre_analysis.robots
Pre analysis hook
Filter robots
Plugin requirements :
None
Conf values needed :
None
Output files :
None
Statistics creation :
None
Statistics update :
visits :
remote_addr =>
robot
keep_requests
Statistics deletion :
None