Being a Java application developer who builds document processing applications, you may be interested in providing the capability of reading Word documents in your Java application and extracting text from these documents programmatically. You can leverage the power of DOCX4J API to read DOCX files and extract text from these files from your Java application.
In this article, we are going to show how to use DOCX4J API to achieve our goal of working with DOCX files for the extraction of text.
How to Read and Extract Text from Word Documents in Java?
Before you start writing code for creating a DOCX file using the DOCX4J API, you must have DOCX4J API configured in your development environment. If you haven’t already installed and configured DOCX4J API, you can have a look at our article on how to install DOCX4J API.
Extract Text from Word Document in Java
At this stage, we assume that you have set up your development environment and are ready to start using the DOCX4J API for extracting text from Word documents. The following sample code can be used for this purpose. You can directly copy it to the main method of your console-based Java application and execute it.
// Load document
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(new File("FileFormat.docx"));
// Load main document part
MainDocumentPart mainDocumentPart = wordMLPackage.getMainDocumentPart();
// Extract nodes
String textNodesXPath = "//w:t";
List<Object> textNodes= mainDocumentPart.getJAXBNodesViaXPath(textNodesXPath, true);
// Print text
for (Object obj : textNodes) {
Text text = (Text) ((JAXBElement) obj).getValue();
String textValue = text.getValue();
System.out.println(textValue);
}
Let’s have a look at how this code works.
The Word document is loaded using the WordprocesingMLPackage and is further processed using the MainDocumentPart class of the DOCX4J API. Once the document is loaded, the list of text nodes is obtained from the MainDocumentPart object containing the actual document. Each text node is then read as JAXBElement object by traversing over it.
Conclusion
Working with Word documents from within your Java application is easy with DOCX4J API. The API helps you add document processing functionality to your Java applications without getting into the internal details of the underlying file format. Keep following this blog for more examples of working with DOCX4J API.