0%

pdf to txt ( win32com 编程 )

最近需要将 pdf 批量转为 txt ,用软件效果挺差,想起 word 2013 可以打开pdf ,试了下,效果挺不错的。

然后 word 可以保存 txt 。

问题是如何做呢? word 自带 com编程,直接用python 调用 windows api 最好用SaveAs2,这是2010 和2013的API。之前版本的是SaveAs。

代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# -*- coding: utf-8 -*-
# author: hrwhisper
# blog: hrwhisper.me
from win32com import client as wc
import os
class PdfTotxt:
def __init__(self,pdfPath,savePath):
self.pdfPath = pdfPath
self.savePath = savePath
self.word = wc.Dispatch('Word.Application')
# 后台运行,不显示,不警告
self.word.Visible = 0
self.word.DisplayAlerts = 0

def startChange(self):
for path, subdirs, files in os.walk(self.pdfPath ):
for pdfFile in files:
pdfFullName = os.path.join(path, pdfFile)
dotIndex = pdfFile.rfind(".")
fileSuffix = pdfFile[(dotIndex + 1) : ]

if fileSuffix == "pdf" :
try:
doc = self.word.Documents.Open(pdfFullName)
#至少两百字
if doc.Words.count < 200:
doc.Close()
continue

fileName = pdfFile[ : dotIndex] +".txt"
fileName = os.path.join(self.savePath, fileName)
print path+ '\\' + pdfFile+" ====> " + fileName
#, SaveAs method is used in versions before Word 2007. If you use Office 2010, I suggest you try Document.SaveAs2 Method.
#https://social.msdn.microsoft.com/Forums/en-US/a4f00910-cb6e-4861-bf96-97b0cfc6cf8f/convert-word-files-from-doc-to-docx-using-python?forum=worddev
doc.SaveAs2(fileName, FileFormat=2)
doc.Close()
except Exception,e:
print '********************ERROR',path+ '\\' + pdfFile,e
self.word.Quit()

pdfPath = r'J:\realpdf\test'
savePath = r'J:\realpdf\test\out'
task = PdfTotxt(pdfPath,savePath)
task.startChange()
print 'ok'

 

附录

其他的文件保存形式

只需要修改SaveAs2的参数FileFormat即可(如我保存为TXT FileFormat=2,如果Html则为10)

Name Value Description
wdFormatDocument 0 Microsoft Office Word 97 – 2003 binary file format.
wdFormatDOSText 4 Microsoft DOS text format.
wdFormatDOSTextLineBreaks 5 Microsoft DOS text with line breaks preserved.
wdFormatEncodedText 7 Encoded text format.
wdFormatFilteredHTML 10 Filtered HTML format.
wdFormatFlatXML 19 Open XML file format saved as a single XML file.
wdFormatFlatXML 20 Open XML file format with macros enabled saved as a single XML file.
wdFormatFlatXMLTemplate 21 Open XML template format saved as a XML single file.
wdFormatFlatXMLTemplateMacroEnabled 22 Open XML template format with macros enabled saved as a single XML file.
wdFormatOpenDocumentText 23 OpenDocument Text format.
wdFormatHTML 8 Standard HTML format.
wdFormatRTF 6 Rich text format (RTF).
wdFormatStrictOpenXMLDocument 24 Strict Open XML document format.
wdFormatTemplate 1 Word template format.
wdFormatText 2 Microsoft Windows text format.
wdFormatTextLineBreaks 3 Windows text format with line breaks preserved.
wdFormatUnicodeText 7 Unicode text format.
wdFormatWebArchive 9 Web archive format.
wdFormatXML 11 Extensible Markup Language (XML) format.
wdFormatDocument97 0 Microsoft Word 97 document format.
wdFormatDocumentDefault 16 Word default document file format. For Word 2010, this is the DOCX format.
wdFormatPDF 17 PDF format.
wdFormatTemplate97 1 Word 97 template format.
wdFormatXMLDocument 12 XML document format.
wdFormatXMLDocumentMacroEnabled 13 XML document format with macros enabled.
wdFormatXMLTemplate 14 XML template format.
wdFormatXMLTemplateMacroEnabled 15 XML template format with macros enabled.
wdFormatXPS 18 XPS format.

详见https://msdn.microsoft.com/en-us/library/ff839952.aspx

请我喝杯咖啡吧~