php curl采集页面内容并提取所有的链接

作者: phper
时间: 2015-02-28
分类: php函数

提取链接是一个很简单的做法了，下面这个例子相对来讲是比较全面了，下面我们一起来看看这个php curl采集页面内容并提取所有的链接例子．

本文承接上面两篇，本篇中的示例要调用到前两篇中的函数，做一个简单的URL采集。一般php采集网络数据会用file_get_contents、file和cURL。不过据说cURL会比file_get_contents、file更快更专业，更适合采集。今天就试试用cURL来获取网页上的所有链接。示例如下：

<?php
/*
 * 使用curl 采集phprm.com下的所有链接。
*/
include_once ('function.php');
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://www.111cn.net/');
// 只需返回HTTP header
curl_setopt($ch, CURLOPT_HEADER, 1);
// 页面内容我们并不需要
// curl_setopt($ch, CURLOPT_NOBODY, 1);
// 返回结果，而不是输出它
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$html = curl_exec($ch);
$info = curl_getinfo($ch);
if ($html === false) {
    echo "cURL Error: " . curl_error($ch);
}
curl_close($ch);
$linkarr = _striplinks($html);
// 主机部分，补全用
$host = 'http://www.111cn.net/';
if (is_array($linkarr)) {
    foreach ($linkarr as $k => $v) {
        $linkresult[$k] = _expandlinks($v, $host);
    }
}
printf("<p>此页面的所有链接为：</p><pre>%s</pre>n", var_export($linkresult, true));
?>

function.php内容如下（即为上两篇中两个函数的合集）：

<?php
function _striplinks($document) {
    preg_match_all("/<\s*a\s.*?href\s*=\s*([\"\'])?(?(1)(.*?)\1|([^\s>]+))/isx", $document, $links);
    // catenate the non-empty matches from the conditional subpattern
    while (list($key, $val) = each($links[2])) {
        if (!empty($val)) $match[] = $val;
    }
    while (list($key, $val) = each($links[3])) {
        if (!empty($val)) $match[] = $val;
    }
    // return the links
    return $match;
}
/*===================================================================*
 Function: _expandlinks
 Purpose: expand each link into a fully qualified URL
 Input:  $links   the links to qualify
    $URI   the full URI to get the base from
 Output:  $expandedLinks the expanded links
*===================================================================*/
function _expandlinks($links, $URI) {
    $URI_PARTS = parse_url($URI);
    $host = $URI_PARTS["host"];
    preg_match("/^[^?]+/", $URI, $match);
    $match = preg_replace("|/[^/.]+.[^/.]+$|", "", $match[0]);
    $match = preg_replace("|/$|", "", $match);
    $match_part = parse_url($match);
    $match_root = $match_part["scheme"] . "://" . $match_part["host"];
    $search = array(
        "|^http://" . preg_quote($host) . "|i",
        "|^(/)|i",
        "|^(?!http://)(?!mailto:)|i",
        "|/./|",
        "|/[^/]+/../|"
    );
    $replace = array(
        "",
        $match_root . "/",
        $match . "/",
        "/",
        "/"
    );
    $expandedLinks = preg_replace($search, $replace, $links);
    return $expandedLinks;
}
?>

具体想要和file_get_contents做一个比较的话，可以利用linux下的time命令查看两者执行各需多长时间。据目前测试看是CURL更快一些。最后链接下上面两个函数相关介绍。

匹配链接函数： function _striplinks()

相对路径转绝对：function _expandlinks()

本文地址：http://www.phprm.com/function/79228.html

转载随意，但请附上文章地址:-)

标签：foreach include preg_match curl_setopt

PHP入门

php curl采集页面内容并提取所有的链接

相关文章

发表留言