php小偷程序实例代码

作者: php100
时间: 2013-09-05
分类: php框架

小偷程序其实就是利用了php中的一特定函数实现采集别人网站的内容，然后通过正则分析把我们想要的内容保存到自己本地数据库了，下面我来介绍php小偷程序的实现方法，有需要的朋友可参考。

在下面采集数据过程中file_get_contents函数是关键了，下面我们来看看file_get_contents函数语法

string file_get_contents ( string $filename [, bool $use_include_path = false [, resource $context [, int $offset = -1 [, int $maxlen ]]]] )

和 file() 一样，只除了 file_get_contents() 把文件读入一个字符串。将在参数 offset 所指定的位置开始读取长度为 maxlen 的内容。如果失败， file_get_contents() 将返回 FALSE。

file_get_contents() 函数是用来将文件的内容读入到一个字符串中的首选方法。如果操作系统支持还会使用内存映射技术来增强性能。

例

<?php
$homepage = file_get_contents('http://www.hzhuti.com/');
echo $homepage;
?>

这样$homepage就是我们采集网的内容给保存下来了，好了说了这么多我们开始吧。

例

<?php
function fetch_urlpage_contents($url) {
    $c = file_get_contents($url);
    return $c;
}
//获取匹配内容
function fetch_match_contents($begin, $end, $c) {
    $begin = change_match_string($begin);
    $end = change_match_string($end);
    $p = "{$begin}(.*){$end}";
    if (eregi($p, $c, $rs)) {
        return $rs[1];
    } else {
        return "";
    }
} //转义正则表达式字符串
function change_match_string($str) {
    //注意，以下只是简单转义
    //$old=array("/","$");
    //$new=array("/","$");
    $str = str_replace($old, $new, $str);
    return $str;
}
//采集网页
function pick($url, $ft, $th) {
    $c = fetch_urlpage_contents($url);
    foreach ($ft as $key => $value) {
        $rs[$key] = fetch_match_contents($value["begin"], $value["end"], $c);
        if (is_array($th[$key])) {
            foreach ($th[$key] as $old => $new) {
                $rs[$key] = str_replace($old, $new, $rs[$key]);
            }
        }
    }
    return $rs;
}
$url = "http://www.phprm.com"; //要采集的地址
$ft["title"]["begin"] = "<title>"; //截取的开始点
$ft["title"]["end"] = "</title>"; //截取的结束点
$th["title"]["中山"] = "广东"; //截取部分的替换
$ft["body"]["begin"] = "<body>"; //截取的开始点
$ft["body"]["end"] = "</body>"; //截取的结束点
$th["body"]["中山"] = "广东"; //截取部分的替换
$rs = pick($url, $ft, $th); //开始采集
echo $rs["title"];
echo $rs["body"]; //输出
?>

以下代码从上一面修改而来，专门用于提取网页所有超链接，邮箱或其他特定内容

<?php
function fetch_urlpage_contents($url) {
    $c = file_get_contents($url);
    return $c;
}
//获取匹配内容
function fetch_match_contents($begin, $end, $c) {
    $begin = change_match_string($begin);
    $end = change_match_string($end);
    $p = "#{$begin}(.*){$end}#iU"; //i表示忽略大小写，U禁止贪婪匹配
    if (preg_match_all($p, $c, $rs)) {
        return $rs;
    } else {
        return "";
    }
} //转义正则表达式字符串
function change_match_string($str) {
    //注意，以下只是简单转义
    $old = array(
        "/",
        "$",
        '?'
    );
    $new = array(
        "/",
        "$",
        '?'
    );
    $str = str_replace($old, $new, $str);
    return $str;
}
//采集网页
function pick($url, $ft, $th) {
    $c = fetch_urlpage_contents($url);
    foreach ($ft as $key => $value) {
        $rs[$key] = fetch_match_contents($value["begin"], $value["end"], $c);
        if (is_array($th[$key])) {
            foreach ($th[$key] as $old => $new) {
                $rs[$key] = str_replace($old, $new, $rs[$key]);
            }
        }
    }
    return $rs;
}
$url = "http://www.phprm.com"; //要采集的地址
$ft["a"]["begin"] = '<a'; //截取的开始点<br />
$ft["a"]["end"] = '>'; //截取的结束点
$rs = pick($url, $ft, $th); //开始采集
print_r($rs["a"]);
?>

小提示file_get_contents很是容易被防采集了，我们可以使用curl来模仿用户对网站进行访问，这算比上面要高级不少哦，file_get_contents()效率稍低些，常用失败的情况、curl()效率挺高的，支持多线程，不过需要开启下curl扩展。下面是curl扩展开启的步骤：

1、将PHP文件夹下的三个文件php_curl.dll,libeay32.dll,ssleay32.dll复制到system32下;

2、将php.ini(c:WINDOWS目录下)中的;extension=php_curl.dll中的分号去掉;

3、重启apache或者IIS。

简单的抓取页面函数,附带伪造 Referer 和 User_Agent 功能

<?php
function GetSources($Url, $User_Agent = '', $Referer_Url = '') //抓取某个指定的页面
{
    //$Url 需要抓取的页面地址
    //$User_Agent 需要返回的user_agent信息 如“baiduspider”或“googlebot”
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $Url);
    curl_setopt($ch, CURLOPT_USERAGENT, $User_Agent);
    curl_setopt($ch, CURLOPT_REFERER, $Referer_Url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    $MySources = curl_exec($ch);
    curl_close($ch);
    return $MySources;
}
$Url = "http://www.phprm.com"; //要获取内容的也没
$User_Agent = "baiduspider+(+http://www.baidu.com/search/spider.htm)";
$Referer_Url = 'http://www.jb51.net/';
echo GetSources($Url, $User_Agent, $Referer_Url);
?>

本文地址：http://www.phprm.com/frame/php1005359.html

转载随意，但请附上文章地址:-)

标签：none

PHP入门

php小偷程序实例代码

发表留言