基于Scrapy-Redis分布式数据采集平台的设计与实现

设为首页

收藏本站

网站地图 | English | 公务邮箱

读者指南

学术客户端

NSTL服务站

科技查新

基于Scrapy-Redis分布式数据采集平台的设计与实现

详细信息查看全文 | 推荐本文 |

英文篇名：Design and implementation of distributed data collection system based on Scrapy-Redis
作者：严慧 ; 彭绪富 ; 朱小婉 ; 熊旭辉 ; 董叶豪
英文作者：YAN Hui;PENG Xu-fu;ZHU Xiao-wan;XIONG Xu-hui;DONG Ye-hao;College of Computer Science and Technology, Hubei Normal University;College of Arts and Science, Hubei Normal University;College of Educational Science, Hubei Normal University;
关键词：微博平台 ; 数据采集 ; 分布式 ; 网络爬虫 ; Scrapy-Redis
英文关键词：micro-blog platform;;data collection;;distributed;;web vrawler;;Scrapy-Redis
中文刊名：HBSF
英文刊名：Journal of Hubei Normal University(Natural Science)
机构：湖北师范大学计算机科学与技术学院;湖北师范大学文理学院;湖北师范大学教育科学学院;
出版日期：2019-03-25
出版单位：湖北师范大学学报(自然科学版)
年：2019
期：v.39;No.167
基金：湖北省高等学校优秀中青年科技创新团队计划项目(T201430)
语种：中文;
页：HBSF201901004
页数：7
CN：01
ISSN：42-1891/N
分类号：23-29

摘要

针对微博平台大数据的采集、挖掘、分析等热点问题,深入介绍并分析了采集平台的相关理论技术,通过对采集平台功能结构及后台数据库设计、页面爬取和解析、反爬虫的应用技术设计、分布式策略设计等四个方面的技术研究,设计并实现了一种基于分布式的微博数据采集平台;给出了主从模式系统架构;达到了用户只需根据需要输入待爬取微博页面的ID,并选择要采集的数据类型,即可获得所需数据的目的。经测试,系统搭建成本低,爬取性能高,可运用于微博数据的舆情分析和数据调研等研究方面的基础数据采集。
Focusing on hot issues such as collecting, mining, and analyzing big data of micro-blog platform, in-depth introduction and analysis of the relevant theory and technology of the collection platform, research on the function structure of collection platform, the design of the back-end database, page crawling and parsing, application design of anti-reptiles and distribution strategy design, designed and implemented a distributed micro-blog data collection platform; provided a master-slave mode system architecture; achieved the user only needs to enter the ID of the micro-blog page to be crawled as needed, and select the type of data to be collected to obtain the desired data. After testing, the system has low construction cost,high crawl performance,and can be applied to the basic data collection in public opinion analysis and data research of micro-blog data.

引文

[1]CNNIC. 第41次中国互联网络发展状况统计报告[EB /OL]. [2018-1-31. http://www. cnnic.net.cn/gywm/ xwzx/rdxw/20172017_7047/ 201802/t20180202_70230.htm.[2018-05-06
    [2]中国互联网络信息中心(CNNIC). 中国互联网络发展状况统计报告(2018年2月)[EB/OL]:http://www. cnnic.net.cn.[2018-05-06
    [3]黎小红. 微博如何应用于高校学生思想政治教育——以新浪微博为例[J]. 南方论刊,2012,(04):109～110.
    [4]廉捷,周欣,曹伟,等. 新浪微博数据挖掘方案[J]. 清华大学学报(自然科学版),2011, 51(10): 1300～1305.
    [5]周中华,张惠然,谢江. 基于Python的新浪微博数据爬虫[J]. 计算机应用,2014,34(11): 3131～3134.
    [6]安子建. 基于Scrapy框架的网络爬虫实现与数据抓取分析[D].长春市:吉林大学,2017.
    [7]马联帅. 基于Scrapy的分布式网络新闻抓取系统设计与实现[D].西安市:西安电子科技大学,2015.

常见问题　|　交通位置　|　联系我们　|　OA远程办公

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700