Table of Contents
Problem: Extract text content from word document.
I wanted to extract text from docx word document I tried several ways doing it and was looking for kind of simple and convenient way to get text extracted, so basically I found two ways and have decided to share both the solution using this tutorial.
With the help of this article I just want share those ways to help other developers get this extraction job done easily.
As you know word documents are always complex when it comes to operate them from backend,
Solution 1: Extract Document using PHP (Docx to Text)
The first solution is really simple with PHP and I find it very useful, as it keep document format, eg. paragraphs and new lines.
To implement this solution all you have do is create new php file along with the following class and then while extracting the document you just need to create new object of the class with document path and call convertToText method.
DocxToTextConversion.php:
<?php
class DocxToTextConversion
{
private $document;
public function __construct($DocxFilePath)
{
$this->document = $DocxFilePath;
}
public function convertToText()
{
if (isset($this->document) && !file_exists($this->document)) {
return 'File Does Not exists';
}
$fileInformation = pathinfo($this->document);
$extension = $fileInformation['extension'];
if ($extension == 'doc' || $extension == 'docx') {
if ($extension == 'doc') {
return $this->extract_doc();
} elseif ($extension == 'docx') {
return $this->extract_docx();
}
} else {
return 'Invalid File Type, please use doc or docx word document file.';
}
}
private function extract_doc()
{
$fileHandle = fopen($this->document, 'r');
$allLines = @fread($fileHandle, filesize($this->document));
$lines = explode(chr(0x0D), $allLines);
$document_content = '';
foreach ($lines as $line) {
$pos = strpos($line, chr(0x00));
if (($pos !== false) || (strlen($line) == 0)) {
} else {
$document_content .= $line . ' ';
}
}
$document_content = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/", '', $document_content);
return $document_content;
}
private function extract_docx()
{
$document_content = '';
$content = '';
$zip = zip_open($this->document);
if (!$zip || is_numeric($zip)) {
return false;
}
while ($zip_entry = zip_read($zip)) {
if (zip_entry_open($zip, $zip_entry) == false) {
continue;
}
if (zip_entry_name($zip_entry) != 'word/document.xml') {
continue;
}
$content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));
zip_entry_close($zip_entry);
}
zip_close($zip);
$content = str_replace('</w:r></w:p></w:tc><w:tc>', ' ', $content);
$content = str_replace('</w:r></w:p>', "\r\n", $content);
$document_content = strip_tags($content);
return $document_content;
}
}
Sample example of using DocxToTextConversion class to extract the document.
<?php
include "DocxToTextConversion.php";
$converter = new DocxToTextConversion('test.docx');
echo $converter->convertToText();
Recommended – Advance way to export HTML to PDF – wkhtmltopdf
Solution 2: Using unzip package
It is simple and quick solution if you only want to deal with the content from the document without considering the format or line breaks
To implement this solution on you will need to have unzip package installed on your server.
You can test this from command line as showing below example, make sure you have unzip package is installed.
$ unzip -p test.docx word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'
Implement above solution with PHP:
<?php
echo exec("unzip -p test.docx word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'");
Let me know if you find any of the above solution useful, by using comment box below.
Hi,
Thanks for this article, unfortunately when I try to read .doc files using Solution 1, it returns some strange text like.
g8xlB5Lj96LDrgW5Skr8fDJx3W) CZvq2Io(NMmSW0w-MAaP L 5ZQUiu/O/U52zzpY5S_tpQSf3X(rv9LotKOw65K,xflD_.qQt,.LJg o-qvi@l6F8VqlDM5,A3Kq6uCk8 nlYMKIArQs0,a F4ZYWCbXV iSOOz(NlZ7E8WuHwfraa8pgzO6OiO5US u8ft
How can we get the text out of it. Please help.
When I run solution 1, the output does not contain any formatting, and it is just one long string, without paragraphs or linefeeds. Is it possible to get the line feeds?