杭电ACM机器人

作者 汪小祯 日期 2016-07-27
杭电ACM机器人

首先放效果图,爬虫放在自己的VPS上面运行了一下午,Rank从25W到了22名,全站4700道题其中一共AC掉2000+题
acmacrobot

相关连接

源码作者的博客听说你叫爬虫(11) —— 也写个AC自动机
杭电的OJ:杭电ACM
题解数据来源:CSDN搜索

整体思路

模拟登录-爬取题目-搜索-提取代码-提交-成功下一题/失败继续搜索

测试情况

首先自己将作者代码放在VPS上面测试,发现对于有些题目,正则表达式没有考虑全,会发生错误
而且默认是1000直接到最后一题

修改情况

自己根据情况,对异常进行了一个catch处理,这样保证发生了异常后跳过继续运行
对代码可以自行设置初始题号和终止题号

源码

运行在Python2.7,需要安装相关类

# coding='utf-8'
import requests, re, os, HTMLParser, time, getpass

host_url = 'http://acm.hdu.edu.cn'
post_url = 'http://acm.hdu.edu.cn/userloginex.php?action=login'
sub_url = 'http://acm.hdu.edu.cn/submit.php?action=submit'
csdn_url = 'http://so.csdn.net/so/search/s.do'
head = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36'}
html_parser = HTMLParser.HTMLParser()
s = requests.session()


def login(usr, psw):
    s.get(host_url);
    data = {'username': usr, 'userpass': psw, 'login': 'Sign In'}
    r = s.post(post_url, data=data)


def check_lan(lan):
    if 'java' in lan:
        return '5'
    return '0'


def parser_code(code):
    return html_parser.unescape(code).encode('utf-8')


def is_ac(pid, usr):
    tmp = requests.get('http://acm.hdu.edu.cn/userstatus.php?user=' + usr).text
    accept = re.search(
        'List of solved problems</font></h3>.*?<p align=left><script language=javascript>(.*?)</script><br></p>', tmp,
        re.S)
    if pid in accept.group(1):
        print '%s was solved' % pid
        return True
    else:
        return False


def search_csdn(PID, usr):
    get_data = {'q': 'HDU ' + PID, 't': 'blog', 'o': '', 's': '', 'l': 'null'}
    search_html = requests.get(csdn_url, params=get_data).text
    linklist = re.findall('<dd class="search-link"><a href="(.*?)" target="_blank">', search_html, re.S)
    for l in linklist:
        print l
        tm_html = requests.get(l, headers=head).text;
        title = re.search('<title>(.*?)</title>', tm_html, re.S).group(1).lower()
        if PID not in title:
            continue
        if 'hdu' not in title:
            continue
        tmp = re.search('name="code" class="(.*?)">(.*?)</pre>', tm_html, re.S)
        if tmp == None:
            print 'code not find'
            continue
        LAN = check_lan(tmp.group(1))
        CODE = parser_code(tmp.group(2))
        if r'include' in CODE:
            pass
        elif r'import java' in CODE:
            pass
        else:
            continue
        print PID, LAN
        print '--------------'
        submit_data = {'check': '0', 'problemid': PID, 'language': LAN, 'usercode': CODE}
        s.post(sub_url, headers=head, data=submit_data)
        time.sleep(5)
        if is_ac(PID, usr):
            break


if __name__ == '__main__':
    usr = raw_input('input your username:')
    psw = getpass.getpass('input your password:')
    start = input('start problem(1000-5762):')
    end = input('end problem:')
    login(usr, psw)
    pro_cnt = start
    while pro_cnt <= end:
        PID = str(pro_cnt)
        if is_ac(PID, usr):
            pro_cnt += 1
            continue
        try:
           search_csdn(PID, usr)
        except:
           continue
        pro_cnt += 1