HttpAgent.pm


NAME

 Communiware::HttpAgent - recognition of HTTP agents.


SYNOPSIS

  load_cache($dbh);
  ($item_id,$status,$template) = identify_crawler($user_agent_string,$ip_address);
  register_agent($dbh,$user_agent_string,$ip_address);
  update_cache($dbh);
 
=head1 DESCRIPTION

This module contain routines which deal with http agent (especially crawler) identification in Communiware


PACKAGE VARIABLES

@KnownCrawlers - list of non-interactive user-agents @KnownBrowsers - list of interactive user-agents


API

load_cache($dbh)

Loads info about all known HTTP_AGENTs from Communiware database

update_cache($dbh)

Loads records which appear in the database since cache was last time loaded or updated

identify_crawler

        ($agent, $status, $template) = identify_crawler($agent_string, $ip);

Checks whether given user agent identification string and IP address correspond to any known agent. If so, return agent's ID, status and an appropriate template. If not $agent is guaranteed to be undefined.

identify_and_check_crawler

        ($agent, $status, $template) = identify_and_check_crawler($global_context);

Checks the identity of a crawler defined by `USER_AGENT' and 'REMOTE_IP' attributes of a $global_context. If a crawler is not identified, $agent is guaranteed to be undefined. If a crawler is identified, it is checked vs. its status and server's LA and if forbidden now, appropriate exception is raised. $global_context's Apache request, if defined, is used to set appropriate headers, if any.

This function relays that special context 'document_item' is set. If not, it will suppose that document item is not error document.

register_agent($dbh,$agent_string,$ip,$id,$status)

Recieves agent identification string and ip address along with id status, returned by previous identify_crawler

If status was undef, creates new agent record with AUTODETECT status, if status was INTERACTIVE, creates new agent record with TEMPORARY status.

for any other status updates AGENT_LAST_VISIT field.


IP ADDRESS MATCHING

ip_in_range($ip,$range_start,$range_end)

returns true if given IP is in given range. First argument is specified in dotted decimal form. Borders of range are specified as 32-bit integer

process_robots_txt

        process_robots_txt($r, $user_agent, $remote_ip);

Processes request to `/robots.txt' URI. Returns through $r content disallowing to index pictures and scripts and accounts the client as a crawler according to crawler rules.

16 октябрь 2007 13:45