最后更新: 2025年1月15日

标题 - 使用Python从PDF文件提取文本

使用Python从PDF文件提取文本

在本文中,我们将告诉您如何使用Python从PDF文件中提取文本

PDF代表便携式文档格式,是一种流行的数字文档格式。这种格式的设计旨在无论软件、硬件还是操作系统如何,都能轻松可靠地查看或共享文档。PDF文件的扩展名是**.pdf**。

要使用Python从PDF文件中提取文本,通常会使用这些库。我们将向您展示如何使用它们中的每一个从PDF提取文本。

  1. pypdf
  2. PyMuPDF

如何在Python中使用pypdf从PDF文件中提取文本

以下是步骤。

  1. 安装pypdf
  2. 运行本文提供的代码
  3. 查看输出

安装pypdf

您可以使用以下命令安装pypdf

pip install pypdf

使用pypdf从PDF中提取文本的示例代码

sample.pdf - 下载链接(此示例PDF将在代码中使用,但您也可以用自己的PDF。)

sample.pdf的截图

示例输入PDF截图

代码

以下是一个使用pypdf从PDF提取文本的完整代码示例。

from pypdf import PdfReader
# Specify the PDF file path
pdf_file_path = "sample.pdf"
# Create a PDF reader object
reader = PdfReader(pdf_file_path)
# Initialize a variable to store the extracted text
extracted_text = ""
# Iterate through the pages and extract text
for page in reader.pages:
extracted_text += page.extract_text()
# Print the extracted text
print("Extracted Text using pypdf:")
print(extracted_text)

输出

以下是上面提供的示例代码的输出。

Extracted Text using pypdf:
This is a sample pdf. Page 1
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the
industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type
and scrambled it to make a type specimen book. It has survived not only five centuries, but also the
leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with
the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop
publishing software like Aldus PageMaker including versions of Lorem Ipsum.
This is sample pdf. Page 2
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the
industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type
and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap
into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the
release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing
software like Aldus PageMaker including versions of Lorem Ipsum.
This is sample pdf. Page 3
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the
industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type
and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap
into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the
release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing
software like Aldus PageMaker including versions of Lorem Ipsum.

如何在Python中使用PyMuPDF从PDF文件中提取文本

以下是步骤。

  1. 安装PyMuPDF
  2. 运行本文提供的代码
  3. 查看输出

安装PyMuPDF

使用以下命令安装PyMuPDF,也称为fitz

pip install pymupdf

使用PyMuPDF从PDF中提取文本的示例代码

我们使用了之前相同的pdf文件。

sample.pdf - 下载链接(此示例PDF将在代码中使用,但您也可以用自己的PDF。)

代码

以下是一个使用PyMuPDF从PDF提取文本的完整代码示例。

import fitz # PyMuPDF library
# Specify the PDF file path
pdf_file_path = "sample.pdf"
# Open the PDF file
pdf_document = fitz.open(pdf_file_path)
# Initialize a variable to store the extracted text
extracted_text = ""
# Iterate through all the pages and extract text
for page_number in range(len(pdf_document)):
page = pdf_document[page_number] # Get the page
extracted_text += page.get_text() # Extract text from the page
# Close the PDF file
pdf_document.close()
# Print the extracted text
print("Extracted Text using PyMuPDF:")
print(extracted_text)

输出

以下是上面提供的示例代码的输出。

Extracted Text using PyMuPDF:
This is a sample pdf. Page 1
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been
the
industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type
and scrambled it to make a type specimen book. It has survived not only five centuries, but also the
leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with
the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop
publishing software like Aldus PageMaker including versions of Lorem Ipsum.
This is sample pdf. Page 2
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been
the
industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type
and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap
into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the
release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing
software like Aldus PageMaker including versions of Lorem Ipsum.
This is sample pdf. Page 3
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been
the
industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type
and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap
into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the
release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing
software like Aldus PageMaker including versions of Lorem Ipsum.

结论

在本文中,我们提供了示例Python代码、示例文件及其输出,以展示如何使用两个库:PyPDF和PyMuPDF从PDF中提取文本。

如果您有任何问题或在运行代码时遇到任何问题,请随时在我们的论坛中发表评论!

另请参阅