A web server log file records users transactions in the web. Usually, the web log
file contains information about the user IP address, the requested page, time of
request, the volume of the requested page, its referrer, and other useful information.
The web log file can have different format, but there is a common log file
format that is mostly used. The common log file has the following format
remotehost rfc931 authorship "request" status bytes
remotehost represents remote hostname (or IP number if DNS hostname is not
available), rfc931 represents the remote logname of the user, authuser represents
the username as which the user has authenticated himself, [date] represents date
and time of the request, ”request” represents the request line exactly as it came
from the client, status represents the HTTP status code returned to the client,
and finally bytes represents the content-length of the document transferred. The
WWW Consortium (W3C) presented an extended format for web server log file
that is able to record a wide range of data to make an advanced analysis of
the web log file. Web log file is the main source of data analysis in web mining.
A lot of preprocessing efforts need to be performed in order to prepare the web
log file to be mined as we will see in the next sections.
0 comments:
Post a Comment