且不能就此闲暇

时间是残忍的, 斑驳了记忆,侵蚀了年华,

在是非未辩时, 已为选择付出了该有的代价,

这或许就是时间的债, 欠下的是回不去的过去, 改变不了的决定,

对决定一次次修正中, 体会到的价值则是阅历,

阅历磨平了心中棱角, 对过去的不再耿耿于怀, 对决定不再优柔寡断,

这或许就是成长吧,且不能就此闲暇,别因懈怠辜负了时光。

—2015-08-06  杭州·下沙

对豆瓣电影的一些可视化分析|movie.douban.com data visualization

 

最近发现之前玩scrapy的时候写过一个豆瓣电影的爬虫https://github.com/DormyMo/movieman/tree/master/crawler,一直没用起来,最近正值片慌,就想爬点数据,挖下冷门的好电影,顺便做些分析,感觉豆瓣的电影数量还是不够多,早期的信息也不是很规范,不过结果还是有点意思的,下次有时间爬下imdb看看。

数据截至:2015-08-01
豆瓣历年影片数量

继续阅读

使用scrapy满足各种爬虫需求

玩爬虫也有很长一段时间了,一直没做一个记录,经常在同一个坑中跌倒,做个记录吧。

1.ImportError: Error loading object ‘scrapy.core.downloader.handlers.s3.S3DownloadHandler’: No module named win32api

安装pywin32 : http://sourceforge.net/projects/pywin32/

2.新版本(1.0+)如果spider再继承于scrapy.spider 不会执行rules,被这坑了好久,果断看了官方文档,发现变动还挺大的,几个包都变了

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
spider也得继承于CrawlSpider,如果不重载parse,会自动根据rules抓取

3.初始化爬虫用start_url不能满足需求,需要配置Request
重载函数

def start_requests(self):yield request

4.需要在middleware或parse的时候复用request并更改url

request.replace(url = new_url)

5.需要得到配置文件或更改爬虫配置
重载函数,cls 为当前类,若是自己写的扩展需要在settings.py中声明

@classmethod
def from_crawler(cls, crawler, *args, **kwargs):return cls(*args, **kwargs)

6.得到当前spider配置文件

from scrapy.utils import project
self.settings = project.get_project_settings()

7.获取crawler状态信息

crawler.stats.get_stats()

8.随机变换代理及useragent
定义自己的middleware,设置request.meta的保留字段proxy
useragent 为设置reques.headers
返回None,继续处理此request

def process_request(self,request,spider):
    request.meta['proxy'] ="http://127.0.0.1:8888"
    request.headers.setdefault('User-Agent', ua)

9.部署scrapyd,需要创建配置文件 /etc/scrapyd/scrapyd.conf

[scrapyd]
eggs_dir    = {your dir}/eggs
logs_dir    = {your dir}/logs
items_dir   = {your dir}/items
jobs_to_keep = 100
dbs_dir     ={your dir}/dbs
max_proc    = 0
max_proc_per_cpu = 4
finished_to_keep = 100
poll_interval = 10
http_port   = 6800
debug       = on
runner      = scrapyd.runner
application = scrapyd.app.application
launcher    = scrapyd.launcher.Launcher

[services]
schedule.json     = scrapyd.webservice.Schedule
cancel.json       = scrapyd.webservice.Cancel
addversion.json   = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json  = scrapyd.webservice.ListSpiders
delproject.json   = scrapyd.webservice.DeleteProject
delversion.json   = scrapyd.webservice.DeleteVersion
listjobs.json     = scrapyd.webservice.ListJobs
daemonstatus.json = scrapyd.webservice.DaemonStatus

10.使用scrapyd运行爬虫,出错
scrapyd CRITICAL: Unhandled error in Deferred:

'scrapy.telnet.TelnetConsole': None,
'scrapy.extensions.feedexport.FeedExporter': None,

zimuzu.tv自动签到脚本(Python)

python2.7,需要requests库

#coding:utf8
'''
@author modm
http://www.7mdm.com
'''
import requests
import urllib2
import string
import random
import uuid
import time
import re
def sign(account,password):
    print 'signing...'
    client_id=str(uuid.uuid1())
    headers={'Accept':' application/json, text/javascript, */*; q=0.01',
    'X-DevTools-Emulate-Network-Conditions-Client-Id':client_id,
    'X-Requested-With':'XMLHttpRequest',
    'X-FirePHP-Version':'0.0.6',
    'Host':'www.zimuzu.tv',
    'Origin':'http://www.zimuzu.tv',
    'Referer':'http://www.zimuzu.tv/user/login',
    'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36',
    'Referer':' http://www.zimuzu.tv/user/sign',
    'Accept-Encoding':' gzip, deflate, sdch',
    'Accept-Language':' zh-CN,zh;q=0.8'
    }
    resHeaders = requests.get('http://www.zimuzu.tv/user/login',headers=headers).headers
    session = resHeaders['set-cookie'][10:36]
    headers['Cookie']='PHPSESSID='+session+'; CNZZDATA1254180690=111511151-1430530153-%7C1430530153'
    data={'account':account,
    'password':password,
    'remember':'1',
    'url_back':'http://www.zimuzu.tv/user/sign'}
    res = requests.post('http://www.zimuzu.tv/User/Login/ajaxLogin',data=data,headers=headers)
    cookie = res.headers['set-cookie']
    cookie=cookie.replace('GINFO=deleted;','').replace('GKEY=deleted;','')
    GINFO=re.search('GINFO=uid[^;]+',cookie).group(0)+";"
    GKEY=re.search('GKEY=[^;]+',cookie).group(0)+";"
    CPS = 'yhd%2F'+str(int(time.time()))+";"
    Cookie =' PHPSESSID='+session+'; '+CPS+(GINFO+GKEY)*3
    headers['Cookie']=Cookie
    requests.get("http://www.zimuzu.tv/user/sign",headers=headers).content
    print 'wait for 20 seconds...'
    time.sleep(20)
    content = requests.get("http://www.zimuzu.tv/user/sign/dosign",headers=headers).json()
    print "sign success! " if content['data']!=False else ("signed! " if content['data']==False else "sign failed! "+str(content)),content['status']
sign('用户名','密码')

python post上传文件

首先我们先来看看一个上传图片的网页form结构

<form action=”images_upload.php”  method=”post” name=”form_1″ id=”form_1″ enctype=”multipart/form-data” >
<input name=”file” class=”div3″ id=”file” type=”file” onchange=”load_from()” title=”点击更改”>
<button type=”submit” value=”上传”/>
</form>

可以看到action提交表单的地址为images_upload.php,method为post,enctype=”multipart/form-data

由于不知道如何把图片上传,就google了下

发现了如下几个库及方法

Python中使用POST方式上传文件(poster)

http://oldj.net/article/python-upload-file-via-form-post/

using requests post a multipart encoded file(requests)

http://docs.python-requests.org/en/latest/user/quickstart/#post-a-multipart-encoded-file

Python模拟HTTP Post上传文件(urllib2)

http://www.cnblogs.com/chy710/p/3791317.html

打开chrome 查看network 请求信息,发现request payload的一栏信息如下:

——WebKitFormBoundary6W5jJP9OUllB3OgE

Content-Disposition: form-data; name=”file”; filename=”1.jpg”

Content-Type: image/jpeg

 

——WebKitFormBoundary6W5jJP9OUllB3OgE

Content-Disposition: form-data; name=”type”

 

 

img1

——WebKitFormBoundary6W5jJP9OUllB3OgE–

由于对于我要处理的网页1,2两种方法无效,遂尝试第三种方法

把图片的二进制文件放到content-type之后,配好data及headers,请求成功得到响应数据。

实现代码如下:
boundary = '----%s' % hex(int(time.time() * 1000))
data = []
fr=open(your imgDir,'rb')
print fr
data.append('--%s' % boundary)
data.append('Content-Disposition: form-data; name="file"; filename="1.jpg"')
data.append('Content-Type: %s\r\n' % 'image/jpeg')
data.append(fr.read())
fr.close()
data.append('--%s' % boundary)
data.append('Content-Disposition: form-data; name="type"')
data.append('')
data.append('--%s--' % boundary)
http_url='images_upload.php'
http_body='\r\n'.join(data)
try:
#buld http request
req=urllib2.Request(http_url, data=http_body)
req.unverifiable = True
#header
req.add_header('Content-Type', 'multipart/form-data; boundary=%s' % boundary)
req.add_header('User-Agent','Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)')
req.add_header('Referer','somewhere')
req.add_header('Cookie',add your cookie)
#post data to server
resp = urllib2.urlopen(req, timeout=5)
#get response
qrcont=resp.read()
print qrcont

except Exception,e:
print ‘http error’

DSpace数字仓储搭建与使用笔记

搭建

 

继续阅读

使用chrome调试xpath

相信玩过爬虫的都知道一些库,如lxml(python),可以使用xpath方便地对HTML进行提取,但当真正用的时候,问题就来了,想找到一个元素往往要调试好几遍,而且得先code,下面提供了几个工具及如何用chrome进行xpath test

1.XPath Helper

https://chrome.google.com/webstore/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl

1. Open a new tab and navigate to your favorite webpage.
2. Hit Ctrl-Shift-X to open the XPath Helper console.
3. Hold down Shift as you mouse over elements on the page. The query box will continuously update to show the full XPath query for the element below the mouse pointer. The results box to its right will show the evaluated results for the query.
4. If desired, edit the XPath query directly in the console. The results box will immediately reflect any changes.
5. Hit Ctrl-Shift-X again to close the console.

2.Xpath Finder – xpath查找助手

https://chrome.google.com/webstore/detail/xpath-finder/ijaobnmmgonppmablhldddpfmgpklbfh/related

在chrome开发者工具里增加一个面板,帮助你快速查找匹配XPATH表达式的元素。

1. 匹配的节点会用红色边框突出显示。
2. 在结果列表里可以直接inspect对应节点。
3. 记录XPATH查找历史,方便查看历史。

3.chrome console

F12->$x(“//title”)

[<title>Online Tools for Software Developers (Free)</title>]