scigen detection

SCIgen is a program that generates random Computer Science research papers, including graphs, figures, and citations. It uses a hand-written context-free grammar to form all elements of the papers. Our aim here is to maximize amusement, rather than coherence.

One useful purpose for such a program is to auto-generate submissions to conferences that you suspect might have very low submission standards. A prime example, which you may recognize from spam in your inbox, is SCI/IIIS and its dozens of co-located conferences (check out the very broad conference description on the WMSCI 2005 website). There's also a list of known bogus conferences. Using SCIgen to generate submissions for conferences like this gives us pleasure to no end. In fact, one of our papers was accepted to SCI 2005! See Examples for more details. scigen 是一个自动的论文生成软件，利用上下文无关文法自动生成无意义的英文科学研究论文，内容包含图片、表格、流程图和参考文献等。有的scigen生成的论文在竟然被权威机构给收录！（如WMSCI 2005）本博文探讨如何进行对scigen生成的伪论文进行判别。即类似于 scigen detection 的工作

一、任务目标

给定一篇（些）论文，判断其是否为真论文，即非scigen生成的论文。

二、数据预处理

(一) 获取训练样本

正样本来源：由老师给定的各种论文样本
负样本来源：于MIT scigen上爬取下来的论文

Python爬虫爬取的代码如下：(保存在j://unrealpdf/文件夹中，每次捕获200篇)

# -*-coding:utf-8 -*-
'''
author:  hrwhipser
功能    ： 下载scigen生成的论文
'''
import requests
from bs4 import BeautifulSoup

def download_file(url,local_filename):
    r = requests.get(url, stream=True)
    with open(local_filename, 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024):
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)
                f.flush()
    return local_filename

savePath = 'j://unrealpdf/'
url = 'http://pdos.csail.mit.edu/cgi-bin/sciredirect.cgi?author=&author=&author=&author=&author='
for i in xrange(200):
    try:
        r=requests.get(url)
        if r.status_code==200:
            soup=BeautifulSoup(r.text)   
            link = soup.find_all('a')[1]
            new_link='http://scigen.csail.mit.edu'+link.get('href')
            filename = r.url.split('/')[-1][:-4]+'pdf'
            file_path=download_file(new_link,savePath + filename)
            print str(i+1) + '  downloading: '+filename
        else:
            print 'errors: ' +str(i)
    except Exception,e:
        print e

print 'ok'

(二)将pdf转化为txt

该步骤仍是预处理的阶段，目的是便于处理相应的论文。

# -*-coding:utf-8 -*-
'''
author:  hrwhipser
功能    ： 下载scigen生成的论文
'''
import requests
from bs4 import BeautifulSoup

def download_file(url,local_filename):
    r = requests.get(url, stream=True)
    with open(local_filename, 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024):
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)
                f.flush()
    return local_filename

savePath = 'j://unrealpdf/'
url = 'http://pdos.csail.mit.edu/cgi-bin/sciredirect.cgi?author=&author=&author=&author=&author='
for i in xrange(200):
    try:
        r=requests.get(url)
        if r.status_code==200:
            soup=BeautifulSoup(r.text)   
            link = soup.find_all('a')[1]
            new_link='http://scigen.csail.mit.edu'+link.get('href')
            filename = r.url.split('/')[-1][:-4]+'pdf'
            file_path=download_file(new_link,savePath + filename)
            print str(i+1) + '  downloading: '+filename
        else:
            print 'errors: ' +str(i)
    except Exception,e:
        print e

print 'ok'

转化结果还是十分不错的。

(三) 文档预处理：

我将文档的去除分为以下几个方面：

去除图和表格
1. 论文的图和表格主要是起解释说明的作用，而且难以进行判断
2. 去除方法：在pdf转化为txt已经自动进行了“去除”
去除标题和作者信息
1. 至于作者只是署名文章的所有权，对文意并不会产生影响
2. 去除方法：论文读取从“Abstract”开始
去除Reference 信息
1. 参考文献，由于大部分是人名/书名等，去除其对论文判别几乎不会产生任何影响（毕竟scigen也可以加入这些信息）。
2. 去除方法：读取到reference时候该论文不再读取
去除论文引用的标号 []
1. 文章的一些[]括号主要是标注了参考文献的信息，可以删去
2. 去除方法：正则表达式
将其他括号，标点符号替换为空
1. （）主要是起到解释说明的作用，去除这些符号对原文大意几乎无影响。而其他的标点符号主要用于断句，也可考虑去除。
2. 去除方法：正则表达式
所有单词转化为小写
1. 大小写其实是一个单词，只有是是否位于句首的区别而已。若不统一，可能会出现Smart 和 smart 被认为是两个单词。
2. 解决方法：用python内置函数转化为小写字母
数字归一化处理
1. 论文中的数字可能只是表示数量多少，大小等等，可以将其统一转为某个表明其为数字的符号，或者用某个具体的数来替代，减少一些重复。
2. 解决方法：正则表达式

# -*-coding:utf-8 -*-
'''
author:  hrwhipser
功能    :  对pdf转化成的txt进行预处理，如删除作者信息
'''
import re, os
refer = re.compile(r'\[\d*\]')
other = re.compile(r'[(),<>-_"]')
num = re.compile(r'\d+(.\d+)?')
txtpath = r'./testpaper/'
outpath = r'./testpaper/'

files = os.walk(txtpath).next()[-1]
for file in files:
    print file
    with open(txtpath+file,'r') as f:
        content = f.read().lower()
        #去除作者信息
        temp_index = content.find('abstract') 
        if temp_index != -1: content = content[temp_index+8:]
        #去除参考文献信息
        temp_index = content.rfind('references')    
        if temp_index != -1: content = content[:temp_index]
        f.close()
        with open(outpath+file, 'w') as wf:   
            wf.write(num.sub('1',other.sub(' ',refer.sub('',content))))

三、识别方法探究

判别scigen可以说就是一个二分类问题，给定一篇论文，判断其是否为scigen生成的。

在以上的几个步骤中，已经完成了样本从PDF转为TXT纯文本形式，并对其进行了预处理。

接下来主要在于进行提取文档的特征，来加以区分真假论文。

(一)词同现网络

1.词同现网络简介及网络构建

词同现网络：若两个词汇在同一单元（如邻接、段落）中共现，则认为它们存在关联关系。

词同现网络的思想是：对于每一个单词，对应词同现网络中的一个节点。若一个句子中的两个词之间间隔为n时存在词同现关系，即在网络中有边相连。

而大量的实践表明，n=2是比较合适的。两个词在句子中相邻出现是比较常见的。

如：lovely girl 间距为1，buy an apple的buy 和apple间距为2。

而如果n取过大，会引入大量的无关的词语，增加模型的复杂度。

在我构建的词同现网络中，段和段之间相对是独立的，而一个段落里说明的内容是相关的。即一个段落中的两个相邻句子之间是有关联的。对于一个段落中相邻的两个句子，将其处理为第一个句子的最后一个词和第二个句子的第一个词之间的间距为2。

对于一篇论文的词同现网络，我们可以计算出它的：计算出它的结点数、边数、平均度数、平均路径长度、网络直径、聚集系数，并且用这六个特征来表示这个网络。

2.分类器的选择

我分别尝试了SVM和KNN进行训练和识别，结果如下：

▲SVM

▲KNN

在测试样本一共894篇的情况下，SVM有97.87%的准确率，而KNN有99.33%的准确率。

# -*-coding:utf-8 -*-
'''
author:  hrwhipser
功能    ： 使用knn/svm进行判别
'''

from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier

knn = False
realPath = './data/paper.data'
unRealPath = './data/scigen.data'
testPath = './data/test.data'
realNum = 176
unrealNum = 176

def getTestCase(path,isReal):
    test_input ,fileNames = [],[] 
    with open(path,'r') as f:
        for line in f:
            line = line.strip()
            content = line.split(' ')
            fileName = content[0]
            fileNames.append(fileName)
            temp = []
            for i in content[1:-1]:
                temp.append(float(i))
            test_input.append(temp)
    return fileNames,test_input, [isReal for i in xrange(len(test_input))]

fileNames,input1,output1 = getTestCase(realPath,True)
fileNames,input2 , output2 = getTestCase(unRealPath,False)
train_input , train_output = input1[:realNum] + input2[:unrealNum] ,output1[:realNum]+output2[:unrealNum]
fileNames,test_input , test_output = getTestCase(testPath,True)

clf = KNeighborsClassifier(n_neighbors=3) if knn else svm.SVC(gamma=0.4,C=2)
clf.fit(train_input,train_output)
predicted = clf.predict(test_input)
for filename,result in zip(fileNames,predicted):
    print filename,result

(二) 层次聚类

1.文档距离描述

2. 层次聚类（Hierarchical clustering）

层次聚类通过不断的将最为相似的群组两两合并，来构造出一个群组的层级结构。具体的可以由如下图表示：

同样的，我用上述的距离描述方法，对正负样本进行了层次聚类，结果如下（200个的也是同样的效果显著）

scigen-detection-hierarchical-clustering-result

可以看出的是，scigen生成的伪论文都聚在一起。

证明这个距离的度量方法还是可靠的。

3. 判别方法

对于一篇论文直接找最近距离的效果并不是很好，于是我想到了类似KNN的思想，求出距离后，进行排序，看更接近哪个类。

结果如图：

识别率高达99.55%

# -*-coding:utf-8 -*-
'''
author:  hrwhipser
功能    ： 使用bag of words测试
'''

import os
import re
from math import fabs

truePath = r'./data/true'
falsePath = r'./data/false'
testPaper = r'./testpaper/'
separator = re.compile(r'[!?.\n\t\r]')

def calFrequent(path,num,isTrain=True):
    frequencies = {}
    files = os.walk(path).next()[-1][:num] if isTrain else os.walk(path).next()[-1]
    for fileName in files:
        content = ''
        with open(path + '/' + fileName, 'r') as f:
            content = f.read()
        content = separator.sub(' ', content)
        words = content.split(' ')
        curDic = {}
        for word in words:
            if word:
                curDic.setdefault(word, 0)
                curDic[word] += 1
                
        frequencies[fileName] = curDic
    # dic{ filename dic{word:wordCnt} }
    return frequencies

def calPaperDistance(dicA, dicB):
    NA = sum([cnt for word, cnt in dicA.items()])
    NB = sum([cnt for word, cnt in dicB.items()])
    #if NA==0 or NB==0:return 0x7ffffff
    rate = NA * 1.0 / NB
    dis = 0
    for word in dicA:
        if word  in dicB: dis = dis + fabs(dicA[word] - dicB[word] * rate)
        elif (dicA[word] >> 1) != 0: dis = dis + dicA[word]
    for word in dicB:
        if word not in dicA : dis = dis + dicB[word]
    return dis * 1.0 / (NA << 1)

trainNum = 176
trueFre = calFrequent(truePath,trainNum,isTrain=True)
falseFre = calFrequent(falsePath,trainNum,isTrain=True)
train_data = dict(trueFre , **falseFre)
train_fileName = [name[0] for i , name in enumerate(train_data.items())]

test_data = calFrequent(testPaper,trainNum,isTrain=False)
test_fileName = [name[0] for i , name in enumerate(train_data.items())]

k = 3
for fileName , wordCnt in test_data.items():
    distances  = [(fileTrain,calPaperDistance(wordCnt, wordCntTrain)) for  fileTrain,wordCntTrain in train_data.items()]
    distances.sort(key=lambda x:x[1])
    truePaper = falsePaper = 0
    for name , dis in distances[:k]:
        temp_tag = False if name.find('scimakelatex')!=-1 else True
        if temp_tag: truePaper+=1
        else: falsePaper+=1
    print fileName,True if truePaper > falsePaper else False

四、小结

本次通过分类器的方法，对论文的真假进行了判断，是在前两次实验的基础上建立的。

第一次实验进行了scigen源代码的解读，了解了其生成论文的方法(上下文无关法)

第二次主要是对样本的收集及数据的预处理。

无论是用词同现网络还是分层聚类，能得出良好结果的根本原因都在于scigen生成的论文词汇量有限，与自然语言还存在着一定的区别。

本文所有代码详见github : https://github.com/hrwhisper/scigen-detection