How to read Content from PDF and Word Document files using PHP? I got this question while working with one of interesting PHP project so got in mind to share my solution.
Basically, my task was to get content from PDF or Word Document file and store into the MySQL database. Here I am going to show you how we can extract the content from PDF as well Word Document file and print or display.
I am using Ubuntu 14.04 along with the PHP 5 installed, I am also going to install extra packages throughout this tutorial.
Let get started:
1. Read Content from PDF file:
Our first step is to install XPDF package which is going to help us extract pdf files:
$ sudo apt-get update $ sudo apt-get install xpdf
if you have successfully installed the XPDF package, try to run ` pdftotext` from the terminal to verify it is successfully installed.
You should get following output from pdftotext package:
We are ready to write a PHP script to execute shell command and extract pdf files:
Create new file and add following code to extract your pdf files, make sure with pdf file path. you can also use dynamic uploaded files you just need to replace the file URL and your good to go.
$pdf_file = __DIR__ . '/filename-01.pdf'; // URL of PDF file you can also replace this with the dynamic uploaded file $content = shell_exec('pdftotext ' . $pdf_file . ' -'); // convert PDF to text and store into the variable echo $content; // print content
If you run this file, you should see the text from pdf file.
2. Read content from Word Document file
As I said we will need to install package for word document, so to read text from doc we have to install package called `Antiword`. Before going to move on this package please note that it is not going to support docx files. If you still need to read docx files you simply need to convert them from docx to doc format, then it will work.
Let’s get started by first step to install `antiword` package:
$ sudo apt-get install antiword
We have our package installed and ready to use, let’s move on to the next to step to write php script and read the content from word document file.
Create new file and add following script:
$word_file = __DIR__ . '/doc-file-01.doc'; $content = shell_exec('antiword ' . $word_file . ''); echo $content; // print content
Try to run the above script, you should get your document content on the screen, if you get any issues, you can always comment in the comment section below. Thanks!
Hi Yogesh Koli,
Is there any way that I could format the output with Antiword?