PHP

2 Perfect solutions to extract text from Docx word document

Problem: Extract text content from word document.

I wanted to extract text from docx word document I tried several ways doing it and was looking for kind of simple and convenient way to get text extracted, so basically I found two ways and have decided to share both the solution using this tutorial.

With the help of this article I just want share those ways to help other developers get this extraction job done easily.

As you know word documents are always complex when it comes to operate them from backend,

Solution 1: Extract Document using PHP (Docx to Text)

The first solution is really simple with PHP and I find it very useful, as it keep document format, eg. paragraphs and new lines.

To implement this solution all you have do is create new php file along with the following class and then while extracting the document you just need to create new object of the class with document path and call convertToText method.

DocxToTextConversion.php:

<?php

class DocxToTextConversion
{
    private $document;

    public function __construct($DocxFilePath)
    {
        $this->document = $DocxFilePath;
    }

    public function convertToText()
    {
        if (isset($this->document) && !file_exists($this->document)) {
            return 'File Does Not exists';
        }

        $fileInformation = pathinfo($this->document);
        $extension = $fileInformation['extension'];
        if ($extension == 'doc' || $extension == 'docx') {
            if ($extension == 'doc') {
                return $this->extract_doc();
            } elseif ($extension == 'docx') {
                return $this->extract_docx();
            }
        } else {
            return 'Invalid File Type, please use doc or docx word document file.';
        }
    }

    private function extract_doc()
    {
        $fileHandle = fopen($this->document, 'r');
        $allLines = @fread($fileHandle, filesize($this->document));
        $lines = explode(chr(0x0D), $allLines);
        $document_content = '';
        foreach ($lines as $line) {
            $pos = strpos($line, chr(0x00));
            if (($pos !== false) || (strlen($line) == 0)) {
            } else {
                $document_content .= $line . ' ';
            }
        }
        $document_content = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/", '', $document_content);
        return $document_content;
    }

    private function extract_docx()
    {
        $document_content = '';
        $content = '';

        $zip = zip_open($this->document);

        if (!$zip || is_numeric($zip)) {
            return false;
        }

        while ($zip_entry = zip_read($zip)) {
            if (zip_entry_open($zip, $zip_entry) == false) {
                continue;
            }

            if (zip_entry_name($zip_entry) != 'word/document.xml') {
                continue;
            }

            $content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));

            zip_entry_close($zip_entry);
        }

        zip_close($zip);

        $content = str_replace('</w:r></w:p></w:tc><w:tc>', ' ', $content);
        $content = str_replace('</w:r></w:p>', "\r\n", $content);
        $document_content = strip_tags($content);

        return $document_content;
    }
}

Sample example of using DocxToTextConversion class to extract the document.

<?php

include "DocxToTextConversion.php";

$converter = new DocxToTextConversion('test.docx');

echo $converter->convertToText();

Recommended – Advance way to export HTML to PDF – wkhtmltopdf

Solution 2: Using unzip package

It is simple and quick solution if you only want to deal with the content from the document without considering the format or line breaks

To implement this solution on you will need to have unzip package installed on your server.

You can test this from command line as showing below example, make sure you have unzip package is installed.

$ unzip -p test.docx word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'

Implement above solution with PHP:

<?php

echo exec("unzip -p test.docx word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'");

Let me know if you find any of the above solution useful, by using comment box below.

Yogesh Koli

Software engineer & Blogger live in India, has 8+ years of experience working with the Front-end and Back-end Web Application Development.

View Comments

  • When I run solution 1, the output does not contain any formatting, and it is just one long string, without paragraphs or linefeeds. Is it possible to get the line feeds?

Recent Posts

Complete guide of using Laravel 6 Eloquent Subquery Enhancements

Learn How to use laravel frameworks new improved feature called Eloquent Subquery and get example…

2 years ago

3 Useful examples of using Array Map function in PHP – Best Practices

Learn how to use php array map function with easy and essential tutorial to modify…

2 years ago

Working with PHP Array Filter Function – Best Practices

Learn how to use php array filter function with easy and essential tutorial to filter…

2 years ago

How to add Access Modifiers with Constructor Parameters in TypeScript

Want to know how to refactor your Typescript class, Learn here utilising Typescript of the…

2 years ago

What is Access Modifiers and how to use Access Modifiers in TypeScript ?

What is Access Modifiers in typescript, how to use Access Modifiers, when to use them,…

2 years ago

Top 10 Super Useful Packages to Improve Laravel applications in 2019

This tutorial provide ultimate list of package those are top 10 on packagist and super…

2 years ago