thinkphp自动采集怎么实现

博主最大Lv63

thinkphp自动采集怎么实现

thinkphp实现自动采集功能的三种方法：

方法一：QueryList

个人感觉比较好用，采集详情比较不错的选择，但是采集复杂一点的列表，不好用。具体使用：

控制器示例：

public function index(){

// 使用采集类

// 使用手册：http://www.php.cn/php/php-QueryList3-ThinkPHP.html

import('Org.QL.QueryList');

$url = "http://www.zyctd.com/gqqg/";

$reg = array();

$reg['title'] = array('.sulist_title','text');

$reg['shuliang'] = array('.su_li1','html');

$obj = new \QueryList($url,$reg);

$data = $obj->jsonArr;

// foreach($data as $v){

// echo " ".$v['title'].'___'.$v['shuliang']." ";

// }

p($data);

}

相关推荐：《ThinkPHP教程》

方法二：simple_html_dom

这个方法比较适合采集一点结构简单的页面，HTML标签的类名比较明确的页面，还不错。具体使用：

控制器示例：

public function index(){

// 参考文档：http://microphp.us/plugins/public/microphp_res/simple_html_dom/manual.htm#section_quickstart

// 下载地址：https://github.com/samacs/simple_html_dom/edit/master/simple_html_dom.php

// 使用方法：http://www.thinkphp.cn/topic/21635.html

import("Org.Util.simple_html_dom", '', '.php');

$html = file_get_html('http://www.zyctd.com/gqqg/');

$ret = $html->find('.supply_list_box ul',0)->first_child();

foreach($ret as $v){

echo $v;

};

}

方法三：获取页面HTMl，进行正则匹配采集

举例一个Demo：

采集一个页面：

http://www.zyctd.com/gqqg/

我要获取上面的四个信息：标题，数量，时间，跳转链接。

获取这些信息，通过上面两种方法都采集不到，最后才选用的正则来采集。具体方法：

public function index(){

$url = "http://www.zyctd.com/gqqg/";

// http://www.zyctd.com/gqqg-p1.html

$supplyDB = M('supply');

$urlList = array();

$array = array();

for($x=1; $x<=1; $x++) {

array_push($urlList,"http://www.zyctd.com/gqqg-p".$x.".html");

};

foreach($urlList as $v){

$curPageList = $this->getInfo($v);

array_push($array,$curPageList);

};

foreach($array as $v){

foreach($v as $vv){

//echo $vv['title']."__".$vv['weight']."__".$vv['time']." ";

$data = array();

$data['title'] = $vv['title'];

$data['weight'] = $vv['weight'];

$data['add_time'] = $vv['add_time'];

$data['url'] = $vv['url'];

//$res = $supplyDB->add($data);

//echo $res;

echo "".$vv['title']."

".$vv['weight']."

".$vv['add_time']."

".$vv['url']."";

}

// 获取信息

//$curPageList = $this->getInfo($html);

//p($curPageList);

}

private function getInfo($url){

$html = $this->getHtml($url);

$array = array();

// 匹配所有的标题

preg_match_all("#<divclass=\"sulist_title\">(.*?)</div>#",$html,$matches);

$all_title = $matches[1];

preg_match_all("#发布时间：(.*?)#",$html,$matches);

// 匹配所有的发布时间

$all_time = $matches[1];

// 匹配所有的求购数量

preg_match_all("#求购数量：(.*?)#",$html,$matches);

$all_weight = $matches[1];

// 匹配跳转链接

preg_match_all("#<atarget=\"_blank\"href=\"(.*?)\">#",$html,$matches);

$all_url = $matches[1];

// 组合

foreach($all_title as $k => $v){

$arr = array();

$arr['title'] = $v;

$arr['weight'] = $all_weight[$k];

$arr['add_time'] = $all_time[$k];

$arr['url'] = $all_url[$k];

array_push($array,$arr);

}

return $array;

}

private function getHtml($url){

$html = file_get_contents($url);

$html = preg_replace("#\n#","",$html);

$html = preg_replace("#\r#","",$html);

$html = preg_replace("#\\s#","",$html);

return $html;

}

以上就是thinkphp自动采集怎么实现的详细内容

0 已被阅读了1184次楼主 2020-06-23 13:27:56

回复列表

默认热门正序倒序