FOFA资产爬取脚本

脚本可批量爬取FOFA的搜索结果,基于FOFA的一个不是公共的API,因为这个API并不是专门供用户使用的,基本和爬网页差不多,毕竟爬网页不需要高级会员,只要不爬太多,问题不大的。

脚本环境:

python2

使用方法:

修改脚本中的查询条件的base64字符串部分(第12行)、需要爬取的链接还是IP还是HOST,正则中的改为ip、link、host(第53行),另外脚本同目录下要放一个config.txt,里面填写自己的fofa_token。

脚本代码:

import sys
defaultencoding = 'utf-8'
if sys.getdefaultencoding() != defaultencoding:
    reload(sys)
    sys.setdefaultencoding(defaultencoding)

import requests
from lxml import etree
import re

qbase64 = "YXBwPSJXVVpISUNNUyI%3D"
config = open('config.txt','r')
cookie_config=config.readline().strip()
header = {
        'Authorization':cookie_config
    }

def request(url):
    try:
        text = requests.get(url,headers=header).text
        return text
    except requests.exceptions.ConnectTimeout as a:
        print(a)
    except requests.exceptions.ProxyError as b:
        print(b)
    except requests.exceptions.ConnectTimeout as c:
        print(c)
    except requests.exceptions.ConnectionError as d:
        print(d)

def pn_count(url):
    text = request(url)
    total_number = re.findall('"total":(\d*)',text)
    total_number=int(total_number[0])
    if (total_number % 10):
        pn = total_number/10 + 1
    else:
        pn = total_number/10
    return pn

def spider():
    current_url = "https://api.fofa.so/v1/search?qbase64=" + qbase64
    pn = pn_count(current_url)
    print("spider website is :"+current_url)
    print("The results are {} pages in total".format(pn))
    stop_page = raw_input("please input stop page: \n")
    doc = open("result.txt", "w+")
    for i in range(1,100000):
        print("Now write " + str(i) + " page")
        pageurl = requests.get('https://api.fofa.so/v1/search?pn=' + str(i) + '&qbase64=' + qbase64,headers=header)
        urllist = re.findall('"link":"(.*?)"', pageurl.text)
        try:
            for j in urllist:
                doc.write(j + "\n")
        except:
            print("error!!")
        if i == long(stop_page):
            break
    doc.close()
    print("OK,Spider is End .")

def main():
    spider()

if __name__ == '__main__':
    main()

脚本没有做的特别完善,只是一个自用的小工具,用到的时候稍微改一点就可以了,就没有加解析参数的部分,毕竟爬人家网页爬太多不太合适。

PS:如果白帽汇的师傅们觉得不太好的话,联系我,我把文章和脚本删了~