start page | rating of books | rating of authors | reviews | copyrights

Book HomePHP CookbookSearch this book

11.16. Program: Finding Fresh Links

Example 11-6, fresh-links.php, is a modification of the program in Recipe 11.15 that produces a list of links and their last modified time. If the server on which a URL lives doesn't provide a last modified time, the program reports the URL's last modified time as the time the URL was requested. If the program can't retrieve the URL successfully, it prints out the status code it got when it tried to retrieve the URL. Run the program by passing it a URL to scan for links:

% fresh-links.php http://www.oreilly.com
http://www.oreilly.com/index.html: Fri Aug 16 16:48:34 2002
http://www.oreillynet.com: Mon Aug 19 10:18:54 2002
http://conferences.oreilly.com: Fri Aug 16 19:41:46 2002
http://international.oreilly.com: Fri Mar 29 18:06:32 2002
http://safari.oreilly.com: 302
http://www.oreilly.com/catalog/search.html: Tue Apr  2 19:05:57 2002
http://www.oreilly.com/oreilly/press/: 302
...

This output is from a run of the program at about 10:20 A.M. EDT on August 19, 2002. The link to http://www.oreillynet.com is very fresh, but the others are of varying ages. The link to http://www.oreilly.com/oreilly/press/ doesn't have a last modified time next to it; it has instead, an HTTP status code (302). This means it's been moved elsewhere, as reported by the output of stale-links.php in Recipe 11.15.

The program to find fresh links is conceptually almost identical to the program to find stale links. It uses the same pc_link_extractor( ) function from Recipe 11.10; however, it uses the HTTP_Request class instead of cURL to retrieve URLs. The code to get the base URL specified on the command line is inside a loop so that it can follow any redirects that are returned.

Once a page has been retrieved, the program uses the pc_link_extractor( ) function to get a list of links in the page. Then, after prepending a base URL to each link if necessary, sendRequest( ) is called on each link found in the original page. Since we need just the headers of these responses, we use the HEAD method instead of GET. Instead of printing out a new location for moved links, however, it prints out a formatted version of the Last-Modified header if it's available.

Example 11-6. fresh-links.php

require 'HTTP/Request.php';

function pc_link_extractor($s) {
    $a = array();
    if (preg_match_all('/<A\s+.*?HREF=[\"\']?([^\"\' >]*)[\"\']?[^>]*>(.*?)<\/A>/i',
                       $s,$matches,PREG_SET_ORDER)) {
        foreach($matches as $match) {
            array_push($a,array($match[1],$match[2]));
        }
    }
    return $a;
}

$url = $_SERVER['argv'][1];

// retrieve URLs in a loop to follow redirects 
$done = 0;
while (! $done) {
    $req = new HTTP_Request($url);
    $req->sendRequest();
    if ($response_code = $req->getResponseCode()) {
        if ((intval($response_code/100) == 3) &&
            ($location = $req->getResponseHeader('Location'))) {
            $url = $location;
        } else {
            $done = 1;
        }
    } else {
        return false;
    }
}

// compute base url from url
// this doesn't pay attention to a <base> tag in the page 
$base_url = preg_replace('{^(.*/)([^/]*)$}','\\1',$req->_url->getURL());

// keep track of the links we visit so we don't visit each more than once
$seen_links = array();

if ($body = $req->getResponseBody()) {
    $links = pc_link_extractor($body);
    foreach ($links as $link) {
        // skip https URLs
        if (preg_match('{^https://}',$link[0])) {
            continue;
        }
        // resolve relative links
        if (! (preg_match('{^(http|mailto):}',$link[0]))) {
            $link[0] = $base_url.$link[0];
        }
        // skip this link if we've seen it already
        if ($seen_links[$link[0]]) {
            continue;
        } 
        
        // mark this link as seen
        $seen_links[$link[0]] = true;

        // print the link we're visiting
        print $link[0].': ';
        flush();
        
        // visit the link
        $req2 = new HTTP_Request($link[0],
                                 array('method' => HTTP_REQUEST_METHOD_HEAD));
        $now = time();
        $req2->sendRequest();
        $response_code = $req2->getResponseCode();
        
        // if the retrieval is successful
        if ($response_code == 200) {
            // get the Last-Modified header
            if ($lm = $req2->getResponseHeader('Last-Modified')) {
                $lm_utc = strtotime($lm);
            } else {
                // or set Last-Modified to now
                $lm_utc = $now;
            }
            print strftime('%c',$lm_utc);
        } else {
            // otherwise, print the response code
            print $response_code;
        }
        print "\n";
    }
}



Library Navigation Links

Copyright © 2003 O'Reilly & Associates. All rights reserved.