Three Open-Source .NET APIs for Word Processing Documents

To automate the manipulation of documents within our applications we need some reliable APIs. The market offers both Open Source Software (OSS) and Closed Source Softwares (CSS) to work with Word Processing Documents. Closed source APIs are often costly. There are a bunch of free APIs available with both basic and advanced features, following are a few of them:

Getting Started with Free APIs

Let’s get started with the installation and basic usage of APIs.


Open XML SDK requires .NET Framework 3.5 or above. You can install the library from NuGet using the following command.

Install-Package DocumentFormat.OpenXml

After you are done with the installation, you can create a simple DOCX document free using the following code.

// Open an existing word processing document
using (WordprocessingDocument wordprocessingDocument = WordprocessingDocument.Open("fileformat.docx", true))
    Body body = wordprocessingDocument.MainDocumentPart.Document.Body;
    // Add paragraph
    Paragraph para = body.AppendChild(new Paragraph());
    Run run = para.AppendChild(new Run());
    run.AppendChild(new Text("File Format Developer Guide"));

For details please visit this link.


NPOI is a .NET version of the POI Java Project. Just like Open XML SDK, you can install in using NuGet.

Install-Package NPOI -Version 2.4.1

Similarly, creating a document with NPOI is even simpler. You can create a DOCX file using a few lines of code.

using (FileStream sw = File.Create("fileformat.docx"))
    XWPFDocument doc = new XWPFDocument();

For details please visit this link.


Using DocX you can manipulate Word 2007/2010/2013 files easily. To get started with DocX you can install it using.

Install-Package DocX -Version 1.5.0

Like Open XML SDK & NPOI, creating a document with DocX is pretty simple

using (DocX document = DocX.Create("fileformat.docx"))
    // Add a new Paragraph to the document.
    Paragraph pagagraph = document.InsertParagraph();
    // Append some text.
    pagagraph.Append("File Format Developer Guide").Font("Arial Black");
    // Save the document.

For details please visit this link.

Posted in File Formats | Tagged , , , , , , | Leave a comment

Create a Word Document using PHPWord

PHPWord is a powerful open-source API, written in PHP to create and read file-formats including DOC, DOCX, ODT, RTF, HTML, and PDF. Using the API you can create a document, set document properties, insert images, insert charts and more. Let’s get started with creating a simple DOCX file using PHPWord.


To create a word document using the PHPWord you need the following resources installed in your operating system:

composer require zendframework/zend-escaper
composer require zendframework/zend-stdlib

How to Install PHPWord

After that, you have your pre-requisites ready, you can install PHPWord using a simple composer command:

composer require phpoffice/phpword

Create a Word Document using PHP

Creating a word document is simple. You need to create a new document using PhpWord() method, create a new section using addSection() method and add text in it using addText() method. The following is the code snippet to create a simple word document.

require_once 'vendor\phpoffice\phpword\bootstrap.php';

// Create the new document..
$phpWord = new \PhpOffice\PhpWord\PhpWord();

// Add an empty Section to the document
$section = $phpWord->addSection();
// Add Text element to the Section
    'File Format Developer Guide - '
    . 'Learn about computer files that you come across in '
    . 'your daily work at:'
// Save document
$objWriter = \PhpOffice\PhpWord\IOFactory::createWriter($phpWord, 'Word2007');

The following is the output document:

Posted in File Formats, Word Processing | Leave a comment

Getting Started with Apache POI – Java API for Documents

Ofttimes, we need to automate our processes and manipulate the documents programmatically. We need to create documents in bulk, read, process, and save the resultant documents. We need to work with a bunch of different file formats simultaneously. Luckily, for Java developers, we have an open-source API to work with Word, Spreadsheet, Presentation, Email, and Diagram file-formats – Apache POI. This cross-platform API is designed to work with Java Virtual Machine (JVM) based languages.

How to Install

Installing Apache POI is effortless. All you need to do is add the dependency in your maven based project. You can add the following dependency in your pom.xml and get started with Apache POI.

<!-- -->

Create a Word Document

Using Apache POI you can create a word document using XWPFDocument and insert a paragraph in it using the XWPFParagraph class. The following code snippet shows how to create a Word document using the API.

// initialize a blank document
XWPFDocument document = new XWPFDocument();
// create a new file
FileOutputStream out = new FileOutputStream(new File("createdocument.docx"));
// create a new paragraph paragraph
XWPFParagraph paragraph = document.createParagraph();
XWPFRun run = paragraph.createRun();
run.setText("File Format Developer Guide -  " +
            "Learn about computer files that you come across in " +
            "your daily work at: ");
System.out.println("Document created successfully")

The following is the resultant output document:

Posted in File Formats | Leave a comment

Difference Between XLS and XLSX

XLS and XLSX extensions represent popular Excel file formats that were introduced by Microsoft as part of its Office suite over a period of time. XLS being the oldest and widely used file type is also known to be the Excel97-2003 file format. The XLSX file format was introduced as a replacement of XLS file type with the launch of Excel 2007. Common users may not know the underlying differences between the two file formats, however, XLS is different than XLSX file format in several ways as detailed below.


So what is it that is actually different between XLS and XLSX? Following is a list of differences between the XLS and XLSX file formats.

The File Format Difference

The underlying file format is what makes the main difference between the XLS and XLSX files.

XLS files are based on the Binary Interchange File Format (BIFF) and store information in binary format as per XLS File Format Specifications.  Data is arranged in an XLS file as binary streams in the form of a compound file as described in [MS-XLS].

In contrast, an XLSX file is based on Office Open XML format that stores data in compressed XML files in ZIP format. The underlying structure and files can be examined by simply unzipping the .xlsx file. A sample XLSX file when renamed to .zip and extracted, its contents can be observed in a folder as any other folder of files.

Support for Macros

XLS files, being old format, provided the support for Macros which are programs that are written by end-users and are used for automation of tasks such as opening files, data comparison, etc. Macros at one end facilitate users to automate tasks, but on the other hand can be risky as well since these run directly when you open an Excel file.

In contrast, XLSX files do not support Macros. If you need to embed and execute Macros, you will have to save your file as XLSM which is an Excel Open XML Macro-Enabled spreadsheet file format.

Excel Supportability

XLS files can be opened with all versions of Excel due to the backward compatibility. However, XLSX can only be opened with Excel 2007 and lateral versions only.

Have any further queries about the internal details of XLS or XLSX file formats? You can get in touch with file format experts over the file format forum to have guidance for your questions.

Posted in Spreadsheet | Leave a comment

Excel File Formats: XLSX, XLSM, XLS, XLTX, XLTM

A file with XLSX, XLSM, XLS, XLTX or XLTM extension is a Microsoft Excel file that uses specific standard file format. You can show or display file extension on Windows OS from Folder Options. MS Excel lets you save files in any of these file formats using the Save As option. These Excel file formats serve different purposes for working with Spreadsheet files as explained in this article.

In addition to standard file formats, Excel indirectly uses other file formats as well for a set of different operations. For example, it uses Windows Metafile Format (WMF) or Windows Enhanced Metafile Format (EMF) when a windows metafile picture is copied and paste into Excel Worksheet.


What is XLSX file?

An XLSX file is the default file format for Microsoft Excel that was introduced with Office 2007. It is based on Office Open XML standard that can be opened by a number of applications as well as APIs. The contents inside an XLSX file can be viewed by renaming XLSX extension to ZIP and opening it with any archiving software.

What is XLS file?

An XLS file is a spreadsheet file that is created in Excel Binary Interchange File Format (BIFF) and is proprietary to Microsoft. It can be created with Excel 2003 and earlier versions. An XLS file can be opened in the latest version of Microsoft Excel and can be saved as the latest version of spreadsheet file format i.e. XLSX. Microsoft Excel viewer provides the capability to open these files in read-only mode for reading purpose.

What is XLSM file?

An XLSM file is a Macro-enabled spreadsheet file that can store instructions to record the steps that are performed repeatedly. Macros are programmed in Microsoft Visual Basic for Application (VBA) from within the Excel Workbook. The Visual Basic editor is used to record and run macros in Excel.

XLSM files are similar to XLM file formats but are based on the Open XML format introduced in Microsoft Office 2007. In other words, XLSM are XLSX files but with support of macros. By default, Excel itself provides several macros for common use. However, you can also record your own macros with required functions.

What is XLTX file?

An XLTX file is an Excel Template file that preserves user defined settings. Excel 2007 and above can open XLTX files for creating new XLSX files that retain the settings from template. XLTX file format is based on the Office Open XML standard and can be viewed by remaining its extension to ZIP. Excel comes with predefined templates as well that can be opened and populated with spreadsheet data.

What is XLTM File?

An XLTM file is a Macro-Enabled template file that is created with Microsoft Excel. These are similar to XLTX but with additional feature of macros. Such template files are used to generate and set the layout, formatting, and other settings along with the macros to facilitate creating similar XLSX files then.

Posted in Spreadsheet | Tagged | Leave a comment

Markup Language File Formats – A Survey

A markup language is a computer language that separates the elements of a document by tags. Unlike programming languages, it is in human-readable format and can be opened with almost all text editors. For its nature of defining elements by tags, such a file allows definition of wide range of elements. These tags doesn’t have anything to do with the graphical representation of the data, nor they are used to specify user defined settings such as fonts, dimensions, etc.

There are quite a number of markup languages available for use these days. Some of these are discussed here for general awareness.


HTML – Hypertext Markup Language

HTML (Hyper Text Markup Language) is the extension for web pages created for display in browsers. Known as language of the web, HTML has evolved with requirements of new information requirements to be displayed as part of web pages. The latest variant is known as HTML 5 that gives a lot of flexibility for working with the language. HTML pages are either received from server, where these are hosted, or can be loaded from local system as well. Each HTML page is made up of HTML elements such as forms, text, images, animations, links, etc. These elements are represented by tags such as <img>, <a>, <p> and several others where each tag has start and end. It can also embed applications written in scripting languages such as JavaScript and Style Sheets (CSS) for overall layout representation.

XML – Extended Markup Language

XML stands for Extensible Markup Language that is similar to HTML but different in using tags for defining objects. The whole idea behind creation of XML file format was to store and transport data without being dependent on software or hardware tools. Its popularity is due to it being both human as well as machine readable. This enables it to create common data protocols in the form of objects to be stored and shared over network such as World Wide Web (WWW). The “X” in XML is for extensible which implies that the language can be extended to any number of symbols as per user requirements. It is for these features that many standard file formats make use of it such as Microsoft Open XML, LibreOffice OpenDocument, XHTML and SVG.

XHTML – Extensible HyperText Markup Language

The XHTML is a text based file format with markup in the XML, using a reformulation of HTML 4.0. These files are well suited to be open or viewed in a web browser. XHTML was designed to be more structured, less scripting, generic; using all the existing facilities of XML and more device independent. XHTML provides a generally worthwhile set of elements and attributes, with extension options in combination with style sheets. The attributes are used from the metadata attributes collection. XHTML provides flexibility and accessibility by subordinating all HTML presentation elements to style sheets. Style sheets are more versatile than these presentational elements.  Specifications for HTML 4.01, HTML5 and XHTML are being dynamically developed by the World Wide Web Consortium (W3C).

XAML – XML based Markup Language

XAML, Extensible Application Markup Language, extension files describe the user interface elements for software applications based on Windows Presentation Foundation (WPF). Though a language, it doesn’t require to be programmed as it is based on standard format of XML which is easy to use and understand. XAML (pronounced as “zammel”) was developed by Microsoft with specific aim for creating user interfaces. Its acronym original stood for Extensible Avalon Markup Language, where Avalon was the code-name for WPF. XAML files are sometimes saved with XOML extension as well.

Few other markup languages include MHTML, HTM and XOML that use the base markup languages discussed above for their functionality. The use of respective markup language depends on the purpose. If the content is to be used for display purpose, then HTML, MHTML and HTM are used. However, if data description is the need, markup languages such as XML and those based on XML are used.

Posted in Web | Leave a comment

EPUB vs PDF: E-Publishing File Formats

With the increase in usage of smart devices, digital documents are replacing printed copies of the same. The ease of reading the content on your smart phone or tablet gives you freedom from carrying the hard copies of content everywhere. Several digital file reading formats are available for use, with eBooks taking an important role. PDF and EPub are two most popular eBook file formats that are widely used for reading digital content.

In this article, we’ll try to present a brief overview of both these types and then present some comparisons from several different perspectives.


PDF (Portable Document Format) is the famous and widely used standard for digital documents representation. Adobe introduced PDF in 1993 and it was followed by a series of standardization, leading to a family of PDF standards including PDF/A, PDF/E, PDF/UA, PDF/VT and PDF/X. PDF, in reality, is a digital representation of a paper document that has fixed layout. Having a PDF is like holding a printed copy of the document via a screen.


E-publishing or EPubs are digital representation of documents keeping in view the reading on  mobile devices. Compared to PDF, EPub files are flexible in terms of reflowable and are considered the primary choice for creation of ebooks. The format adjusts document layout according to the device screen, making it more convenient for reading.

EPub vs PDF

The Commonalities

The choice of EPub vs PDF depends on a number of factors. Since both the formats are used for digital representation of documents, the differences as well as the commonality between these two formats are of vital importance before opting for one. The common things between these are as follow:

  • Multiplatform Support: Both the formats are readable over multiple platforms andcan be opened with a variety of readers.
  • Security: PDF offers you security of content by applying a password on the file so that it can’t be opened without password. EPub provides content security via digital rights management (DRM) that protects the work from reproduction.

The Differences

With commonalities come the differences that give priority to one format over the other. Following are the differences between these two types.

  • Rich Media: Though widely in use, PDF doesn’t support rich interactive media such as video and audio. In contrast, EPub supports embedding video and audio links that make the content rich with these media types.
  • Editability: PDFs can be edited using publicly available applications as well as APIs. EPub files are generally read-only and can’t be edited.
  • Reading Experience: EPubs are reflowable as compared to PDF which makes them the obvious choice of readability on mobile devices and tablets. The auto adjustment of contents to fit the screen and around the images makes it the choice of reading on smart devices. In contrast, a PDF file is a fixed layout file format that constantly requires you to zoom, pinch and scroll for readability. However, if the relationship of text to image is essential (like in a children’s story book), PDF dominates.
  • Developer Perspective: From application developer’s perspective, EPub is more flexible than PDF. Based on standard XML and XHTML languages, EPub is easy to use with most types of software. In contrast, PDF is based on strict conforming rules and developers find it hard to write applications for writing PDF files.

EPub or PDF: Which format to use?

The choice of EPub or PDF depends on user requirements actually. If the purpose is to write and publish books, ePub is the obvious choice. However, if your business requires contents that needs to be printed, PDF should be preferred.

Posted in EBook, PDF | Leave a comment

Doc to Docx – A change worth considering to switch!

Working with latest Microsoft Word version, the default file format for saving the document is DOCX. As time moves on, the upcoming generations, working in technology domain, won’t even know how the DOCX format replaced the DOC file format which was the default format for Word 2003 and before. By moving from Doc to Docx, Microsoft fulfilled its promise of open file format standards that was long demanded by companies providing support for word documents.


Those who don’t know the technical details, they may ask if it is really worth considering changing from DOC to DOCX? The answer is none other than Yes! Microsoft had been supporting the DOC extension files since the beginning and new features were being added from time to time. However, the DOC file format’s limitations had large impact on the speed of new features introduction. 

Older Office file formats such as DOC and XLS were stored to disc as binary data and that is why the speed of storing and loading such files was quick. However, the binary file formats had their own limitations due to which it was becoming difficult to manage these with the passage of time. A short comparison of DOC vs DOCX below shows the need of switching from older file format to the new one.

  • DOC file format stores data to disc in binary format that is quicker but results in large file size. DOCX, on the other hand, is based on Office Open XML standards and provides a structured file format that is based on XML and encapsulated in ZIP archive, resulting in small file size.
  • The binary file structure of DOC file format must had to retain interfaces with every new version released in order to avoid crashes. The DOCX file format, based on XML file format, avoids this by having a well structured and organized file format that understands the older formats and supports backward compatibility which was otherwise difficult and tedious with DOC file format.
  • Being binary in nature, managing Object Linking and Embedding was subject to backward incompatibility if embedded object such as XLS chart was of different version than the supported one, resulting in conversion issues. DOCX, on the other hand, can support both backward and forward compatibility due to its XML structure and conversion issues due to version difference can be easily handled.
  • Older formats such as DOC and XLS are prone to the attacks of malware due to binary nature of their file structure, resulting in becoming a source of spreading virus. This is not the case with DOCX as malicious binary code can not be injected inside the documents. 

How to Open DOCX on old Microsoft Word Versions

Microsoft Word 2003 or before can not open DOCX files. However, Microsoft provides a compatibility pack that can be installed and used to open the DOCX file format on older versions of Microsoft Word. In addition, there are free online converters available that can help convert files from DOCX to DOC file format. 

Posted in Word Processing | Leave a comment

Survey: Image File Formats for Web

The importance of images can easily be estimated by the famous quote that says “An image is worth a thousand words”. The presence of images on a webpage plays an important role in attracting the visitors by giving an idea about the contents of the page.  It won’t be wrong to say that contents of a page goes hand in glove with images to give a clear idea of what it is all about and that is why several image file formats have been introduced with the passage of time.

Image File Formats

When we talk about digital images, we come across a variety of image types in our daily routine such as the well-known BMP, PNG, GIF, JPG, SVG, TIFF, WebP and several others. The use of a particular image type in web pages can have impact on page performance such as loading time which is considered one of the important factor in page’s ranking.

The prime competitors for usage over web include PNG, GIF, SVG, and JPG that are out there for decades now.  A recent survey by web technologies shows that the lion share of web usage is held by PNG and JPEG image file formats.

%age of websites using various image file formats

Let’s have a look at some of the most popular image formats, their applications and usage worldwide.


GIF (Graphical Interchange Format) was introduced in 1987 and uses lossless compression to retain the image quality. GIF typically allows up to 8 bits per pixel and up to 256 colours are allowed across the image. GIFs support animation as well which is its only unique characteristic that makes it different from other image file formats. An animated GIF combines numerous images or frames into a single file and displays them in a sequence to generate an animated clip or a short video. The colour limitations are up to 256 for each frame and are likely to be the least suitable for reproducing other images and photographs with colour gradient.


PNG (Portable Network Graphics) is widely used image file format that was created in 1995 to replace GIF. PNG uses lossless compression and does not support animations. It is supported on almost all operating systems by now. PNG gives you flexibility in working with complex images and supports upto 16 million colors which is one of the reason behind its somewhat large comparative file size. Some advantages that make PNG superior to GIF include:


JPEG (Joint Photographic Expert Group) was introduced to reduce the image file size by using lossy compression techniques. The output image, as a result of compression, is a trade-off between storage size and image quality. JPG is the obvious choice where storage is the main concern and speed is required over slow networks. Users can adjust the compression level to achieve the desired quality vs file size. JPG, however, doesn’t support transparency and animation, and can’t be used over the web where any such features are required. The format has been the choice of storing and transmitting photographic images on the web. shares the details of JPEG file format specifications.


SVG (Scalable Vector Graphics) files use XML based text format for describing the appearance of an image. It is one of the mostly used format for building website and print graphics in order to achieve scalability. SVG achieves scalability from the mathematically declared shapes and curves that it uses for drawing images. And that is why SVG is independent of resolution as well.

SVG file size is large as compared to GIF and PNG as it lies in the category of lossless image compression file formats. SVG files can be viewed/opened in almost all modern browsers including Chrome, Internet Explorer, Firefox, and Safari. Brief description of SVG file format can be found as detailed by


The WebP image is a modern raster web image file format that is based on lossless and lossy compression. The format focuses on keeping the image quality while reducing the image size for faster web experience. WebP is comparatively new and it will take some time for this file format to be commonly used over the web. As per Google, WebP lossless images are 26% smaller in size compared to PNGs, while WebP lossy images are 25-34% smaller than comparable JPEG images.

WebP is comparatively new file format and is supported on Chrome and Opera browser. It will take some time for this new file format to be commonly used across the web.

Usage on Web

As mentioned earlier, the use of image types on web is subject to the requirements. If the page needs representation of contents in animated form, GIF should be used. JPEG is the obvious choice if file size restrictions is kept in consideration. PNG helps when more detailed and quality images are required. SVGs are scalable and can be used if file size isn’t a concern.

The latest file format introduced by Google, WebP, is the obvious choice of use over the web once it is commonly used. An important factor considered, while using selected image file format on web, is file size which affects website loading time and plays an important role to improve SEO.

Posted in Image | Leave a comment

PDF File Formats at

Your File Format GuidePortable Document Format (PDF) is widely used page layout file format that is gaining popularity day by day. Introduced in early 1990s by Adobe, it completely stores a document in one file. PDF file format was initially used for desktop publishing of documents such as posters, flyers, and other similar types of files for physical printing. With the passage of time, Adobe not only introduced free PDF Reader, but enhanced the format to be light weight and compatible to become the file standard for fixed documents. is your one stop for guidance about notes taking file formats. Its unique combination of file format wiki, news and support forums gives you the opportunity to get knowledge about file formats and engage in fruitful discussions with file format community.

PDF Standards

The PDF file format category at includes file format standards introduced with the passage of time. These PDF standards were created in accordance with the industrial needs and have certain limitations and restrictions to fulfil specific requirements.


PDF/A is an ISO standard format for archiving of electronic documents in PDF format. Its primary reason for coming into being was to meet the requirements of long term archiving. The standard ensures the opening of archived files even after long time by imposing certain limitations on document integral parts to achieve conformance. The format is now widely adopted across all industries . PDFA/A viewers like Adobe Acrobat Reader, ensure that files saved with this format can be opened even in future in accordance with the information shared by this Standard.


The “E” in PDF/E stands for Engineering. PDF/E was published as ISO 24517 in 2008 as a standard for creating PDF based Engineering documents to be used in a variety of application areas. Key areas making use of PDF/E file format include geospatial, construction and manufacturing workflows. The PDF/E standard provides a mechanism for the exchange and archiving of engineering documents based on the PDF format. PDF/E comes with the support of interactive media, including animations and 3D engineering model data.


PDF/VT, published as ISO 16612-2 in August 2010 as a standard, is designed to enable variable document printing (VDP) in a variety of environments. The standard makes Variable information and Transactional printing as its basis for the standard. The Variable data printing is used where part of information is different for each recipient of the content. The Transactional printing includes invoices, statements and other documents that combine billing information with marketing information. This results in a mix of improved processing of images, text and other content types. PDF/VT enables reliable and dynamic management of pages for High Volume Transactional Output (HVTO) print data by using the document part metadata (DPM) concept. PDF/VT files can be opened in Adobe Acrobat viewer without the need of adding any other component.


PDF/X is an ISO 15930 standard published in 2001 with a subset of PDF functionality. The standard was established and published based on specific requirements of the printing and publishing industries. The requirements for this standard were all devised as per the diverse needs of printing and publishing industries. PDF/X requires the conforming files to be complete i.e. self-contained. This requires that elements like fonts used in the page should be part of the document. Contents such as 3D or video cannot be a part of PDF/X document. The information contained in PDF/X document requires it to be accurate.

See Also

File Format News – Your one stop for all the news related to file formats from around the world File Format Forums – Post your queries in file format forums to get useful information from file format experts and community users File Format Wiki –Explore file format categories for information about various file formats
Posted in PDF | Leave a comment