最后更新: 2025年1月15日

使用Python从PDF文件提取文本
在本文中,我们将告诉您如何使用Python从PDF文件中提取文本。
PDF代表便携式文档格式,是一种流行的数字文档格式。这种格式的设计旨在无论软件、硬件还是操作系统如何,都能轻松可靠地查看或共享文档。PDF文件的扩展名是**.pdf**。
要使用Python从PDF文件中提取文本,通常会使用这些库。我们将向您展示如何使用它们中的每一个从PDF提取文本。
如何在Python中使用pypdf从PDF文件中提取文本
以下是步骤。
- 安装pypdf
- 运行本文提供的代码
- 查看输出
安装pypdf
您可以使用以下命令安装pypdf
pip install pypdf
使用pypdf从PDF中提取文本的示例代码
sample.pdf - 下载链接(此示例PDF将在代码中使用,但您也可以用自己的PDF。)
sample.pdf的截图
代码
以下是一个使用pypdf从PDF提取文本的完整代码示例。
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from pypdf import PdfReader | |
# Specify the PDF file path | |
pdf_file_path = "sample.pdf" | |
# Create a PDF reader object | |
reader = PdfReader(pdf_file_path) | |
# Initialize a variable to store the extracted text | |
extracted_text = "" | |
# Iterate through the pages and extract text | |
for page in reader.pages: | |
extracted_text += page.extract_text() | |
# Print the extracted text | |
print("Extracted Text using pypdf:") | |
print(extracted_text) |
输出
以下是上面提供的示例代码的输出。
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Extracted Text using pypdf: | |
This is a sample pdf. Page 1 | |
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the | |
industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type | |
and scrambled it to make a type specimen book. It has survived not only five centuries, but also the | |
leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with | |
the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop | |
publishing software like Aldus PageMaker including versions of Lorem Ipsum. | |
This is sample pdf. Page 2 | |
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the | |
industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type | |
and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap | |
into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the | |
release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing | |
software like Aldus PageMaker including versions of Lorem Ipsum. | |
This is sample pdf. Page 3 | |
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the | |
industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type | |
and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap | |
into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the | |
release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing | |
software like Aldus PageMaker including versions of Lorem Ipsum. |
如何在Python中使用PyMuPDF从PDF文件中提取文本
以下是步骤。
- 安装PyMuPDF
- 运行本文提供的代码
- 查看输出
安装PyMuPDF
使用以下命令安装PyMuPDF,也称为fitz。
pip install pymupdf
使用PyMuPDF从PDF中提取文本的示例代码
我们使用了之前相同的pdf文件。
sample.pdf - 下载链接(此示例PDF将在代码中使用,但您也可以用自己的PDF。)
代码
以下是一个使用PyMuPDF从PDF提取文本的完整代码示例。
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import fitz # PyMuPDF library | |
# Specify the PDF file path | |
pdf_file_path = "sample.pdf" | |
# Open the PDF file | |
pdf_document = fitz.open(pdf_file_path) | |
# Initialize a variable to store the extracted text | |
extracted_text = "" | |
# Iterate through all the pages and extract text | |
for page_number in range(len(pdf_document)): | |
page = pdf_document[page_number] # Get the page | |
extracted_text += page.get_text() # Extract text from the page | |
# Close the PDF file | |
pdf_document.close() | |
# Print the extracted text | |
print("Extracted Text using PyMuPDF:") | |
print(extracted_text) |
输出
以下是上面提供的示例代码的输出。
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Extracted Text using PyMuPDF: | |
This is a sample pdf. Page 1 | |
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been | |
the | |
industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type | |
and scrambled it to make a type specimen book. It has survived not only five centuries, but also the | |
leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with | |
the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop | |
publishing software like Aldus PageMaker including versions of Lorem Ipsum. | |
This is sample pdf. Page 2 | |
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been | |
the | |
industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type | |
and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap | |
into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the | |
release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing | |
software like Aldus PageMaker including versions of Lorem Ipsum. | |
This is sample pdf. Page 3 | |
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been | |
the | |
industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type | |
and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap | |
into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the | |
release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing | |
software like Aldus PageMaker including versions of Lorem Ipsum. |
结论
在本文中,我们提供了示例Python代码、示例文件及其输出,以展示如何使用两个库:PyPDF和PyMuPDF从PDF中提取文本。
如果您有任何问题或在运行代码时遇到任何问题,请随时在我们的论坛中发表评论!