基于Ad-Sim算法的代码克隆检测方法

设为首页

收藏本站

网站地图 | English | 公务邮箱

读者指南

学术客户端

NSTL服务站

科技查新

基于Ad-Sim算法的代码克隆检测方法

详细信息查看全文 | 推荐本文 |

英文篇名：Code cloning detection method based on Ad-Sim algorithm
作者：王卫红 ; 谷永亮 ; 毛怡伟 ; 张政豪
英文作者：WANG Weihong;GU Yongliang;MAO Yiwei;ZHANG Zhenghao;College of Computer Science and Technology, Zhejiang University of Technology;
关键词：Simhash ; 汉明距离 ; TF-IDF ; 马尔科夫模型
英文关键词：Simhash;;Hamming distance;;TF-IDF;;Markov model
中文刊名：ZJGD
英文刊名：Journal of Zhejiang University of Technology
机构：浙江工业大学计算机科学与技术学院;
出版日期：2019-06-14
出版单位：浙江工业大学学报
年：2019
期：v.47;No.200
基金：国家自然科学专项基金资助项目(61340058);; 浙江省自然科学基金重点资助项目(LZ14F020001)
语种：中文;
页：ZJGD201904010
页数：6
CN：04
ISSN：33-1193/T
分类号：61-66

摘要

代码克隆检测在代码抄袭检测、代码审查、软件更迭和错误检测等方面有重要作用。为提高代码克隆检测的准确率,结合TF-IDF及马尔科夫模型提出了一种改进的Simhash算法Ad-Sim。该算法首先对代码进行归一化预处理;其次在Simhash计算指纹签名的过程中使用TF-IDF计算各关键词权重,并利用马尔科夫模型优化关键词权重;最后比较待检测代码指纹签名的汉明距离相似度,从而判断待检测代码是否为克隆代码。实验结果表明:Ad-Sim算法在代码克隆检测上的准确率及召回率相比Simhash有所提升,尤其在少量代码的检测准确率上提升更明显。
Code cloning detection plays an important role in code plagiarism detection, code review, software change and error detection. In order to improve the accuracy of code cloning detection, this paper proposes an optimization method named Ad-Sim based on original Simhash algorithm. In the algorithm, the code is normalizated firstly. Secondly, the TF-IDF method is used to calculate weight of each keyword in the process of calculating fingerprint signature in Simhash. The Markov model is used to optimize the weight. Finally, the Hamming distance similarity between detected codes is estimated in order to determine whether the code is a clone code or not. The experimental results show the optimized algorithm get a higher code cloning detection precision and recall rate, especially in a small amount of code detection.

引文

[1] MOCKUS A.Large-scale code reuse in open source software[C]//Proceedings of the 1st International Workshop on Emerging Trends in FLOSS Research and Development.Washington,DC:IEEE Computer Society,2007:1-7.
    [2] 郭颖,陈峰宏,周明辉.大规模代码克隆的检测方法[J].计算机科学与探索,2014,8(4):417-426.
    [3] 王卫红,朱雨辰.基于N-Gram与加权分类器集成的恶意代码检测[J].浙江工业大学学报,2017,45(6):604-609.
    [4] BAKER B S.A program for identifying duplicated code[J].Computing science and statistics,2012,24:49-57.
    [5] KAMIYA T,KUSUMOTO S,INOUE K.CCFinder:a multi-linguistic token-based code clone detection system for large scale source code[J].IEEE transactions on software engineering,2012,28(7):654-670.
    [6] RAINER K.Large-scale inter-system clone detection using suffix trees and hashing[J].Journal of software:evolution and process,2014,26(8):747-769.
    [7] QIU J,SU X H,MA P J.Library functions identification in binary code by using graph isomorphism testing[C]//Proceedings of IEEE 22nd International Conference on Software Analysis,Evolution and Reengineering.Montreal:IEEE,2015:261-270.
    [8] KAUR M,LAL M.Code clone detection using function based similarities and metrics[J].International journal of emerging research in management and technology,2015,4(7):156-159.
    [9] 舒翔.基于索引和序列匹配的代码克隆检测技术研究[D].杭州:杭州电子科技大学,2015.
    [10] CHARIKAR M S.Similarity estimation techniques from rounding algorithms[C]//Proceedings of 34th Annual ACM Symposium on Theory of Computing.Montreal:ACM,2002:380-388.
    [11] 刘瑞阳,王良芳.基于语义词典和词汇链的关键词提取法[J].浙江工业大学学报,2013,41(5):545-551.
    [12] 黄洪,陈德锐.基于语义依存的汉语句子相似度改进算法[J].浙江工业大学学报,2017,45(1):6-9.
    [13] 张焕炯,王国胜,钟义信.基于汉明距离的文本相似度计算[J].计算机工程与应用,2001,37(19):21-22.
    [14] 苏振魁.基于马尔科夫模型的文本相似度研究[D].大连:大连理工大学,2007.
    [15] 盛骤,谢式千,潘承毅.概率论与数理统计[M].北京:高等教育出版社,2003.

常见问题　|　交通位置　|　联系我们　|　OA远程办公

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700