资料
WebMagic的架构设计参照了Scrapy
项目主页:http://webmagic.io/ github地址:https://github.com/code4craft/webmagic 项目文档:http://webmagic.io/docs/zh/
使用 IntelliJ IDEA 新建maven项目
1、依赖文件配置 WebMagicSpider/pom.xml
us.codecraft
webmagic-core
0.7.3
us.codecraft
webmagic-extension
0.7.3
us.codecraft
webmagic-extension
0.7.3
org.slf4j
slf4j-log4j12
2、日志文件配置 WebMagicSpider/src/main/resources/log4j.properties
log4j.rootLogger=WARN, stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n
项目构建
1、爬虫程序编写 WebMagicSpider/src/main/java/BaiduPageProcessor.java
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.pipeline.ConsolePipeline;
import us.codecraft.webmagic.pipeline.JsonFilePipeline;
import us.codecraft.webmagic.processor.PageProcessor;
public class BaiduPageProcessor implements PageProcessor {
private Site site = Site.me()
.setRetryTimes(1)
.setSleepTime(1000)
.setCharset("utf-8");
public void process(Page page) {
page.putField("title", page.getHtml().css("title", "text").toString());
}
public Site getSite() {
return site;
}
public static void main(String[] args) {
Spider.create(new BaiduPageProcessor())
.addUrl("http://www.baidu.com/")
.addPipeline(new ConsolePipeline())
.addPipeline(new JsonFilePipeline("/Users/qmp/myproject/WebMagicSpider"))
.thread(1)
.run();
}
}
2、执行程序
控制台输出
get page: http://www.baidu.com/
title: 百度一下,你就知道
文件输出
{"title":"百度一下,你就知道"}