HttpAgent.pm
- NAME
- SYNOPSIS
- PACKAGE VARIABLES
- API
load_cache($dbh)
update_cache($dbh)
- identify_crawler
- identify_and_check_crawler
register_agent($dbh,$agent_string,$ip,$id,$status)
- IP ADDRESS MATCHING
NAME
Communiware::HttpAgent - recognition of HTTP agents.
SYNOPSIS
load_cache($dbh); ($item_id,$status,$template) = identify_crawler($user_agent_string,$ip_address); register_agent($dbh,$user_agent_string,$ip_address); update_cache($dbh); =head1 DESCRIPTION
This module contain routines which deal with http agent (especially crawler) identification in Communiware
PACKAGE VARIABLES
@KnownCrawlers - list of non-interactive user-agents @KnownBrowsers - list of interactive user-agents
API
load_cache($dbh)
Loads info about all known HTTP_AGENTs from Communiware database
update_cache($dbh)
Loads records which appear in the database since cache was last time loaded or updated
identify_crawler
($agent, $status, $template) = identify_crawler($agent_string, $ip);
Checks whether given user agent identification string and IP address correspond
to any known agent. If so, return agent's ID, status and an appropriate
template. If not $agent
is guaranteed to be undefined.
identify_and_check_crawler
($agent, $status, $template) = identify_and_check_crawler($global_context);
Checks the identity of a crawler defined by `USER_AGENT
' and 'REMOTE_IP
'
attributes of a $global_context
. If a crawler is not identified, $agent
is guaranteed to be undefined. If a crawler is identified, it is checked vs.
its status and server's LA and if forbidden now, appropriate exception is
raised. $global_context
's Apache request, if defined, is used to set
appropriate headers, if any.
This function relays that special context 'document_item'
is set. If not, it
will suppose that document item is not error document.
register_agent($dbh,$agent_string,$ip,$id,$status)
Recieves agent identification string and ip address along with id status, returned by previous identify_crawler
If status was undef, creates new agent record with AUTODETECT status, if status was INTERACTIVE, creates new agent record with TEMPORARY status.
for any other status updates AGENT_LAST_VISIT field.
IP ADDRESS MATCHING
ip_in_range($ip,$range_start,$range_end)
returns true if given IP is in given range. First argument is specified in dotted decimal form. Borders of range are specified as 32-bit integer
process_robots_txt
process_robots_txt($r, $user_agent, $remote_ip);
Processes request to `/robots.txt' URI. Returns through $r
content
disallowing to index pictures and scripts and accounts the client as a crawler
according to crawler rules.