Categories: PHPUbuntu

How to read Content from PDF and Word Document files using PHP?

How to read Content from PDF and Word Document files using PHP? I got this question while working with one of interesting PHP project so got in mind to share my solution.

Basically my task was to get content from PDF or Word Document file and store into the mysql database. Here I am going to show you how we can extract the content from PDF as well Word Document file and print or display.

I am using Ubuntu 14.04 along with the PHP 5 installed, I am also going to install extra packages throughout this tutorial.

Let get started:

1. Read Content from PDF file:

Our first step is to installed XPDF package which is going to help us extract pdf files:

XPDF Installation:

$ sudo apt-get update
$ sudo apt-get install xpdf

if you have successfully installed the XPDF package try to run ` pdftotext` from the terminal to verify it is successfully installed.

You should get following output from pdftotext package:

We are ready to write a PHP script to execute shell command and extract pdf files:

Create new file and add following code to extract your pdf files, make sure with pdf file path. you can also use dynamic uploaded files you just needs to replace the file URL and your good to go.

<?php

$pdf_file = __DIR__ . '/filename-01.pdf'; // URL of PDF file you can also replace this with the dynamic uploaded file

$content = shell_exec('pdftotext ' . $pdf_file . ' -'); // convert PDF to text and store into the variable

echo $content; // print content

?>

If you run this file you should see the text from pdf file.

2. Read content from Word Document file

As I said we will need to install package for word document, so to read text from doc we have to install package called `Antiword`. Before going to move on this package please note that it is not going to support docx files. If you still needs to read docx files you simple need to convert them from docx to doc format then it will work.

Let’s get started by first step to install `antiword` package:

~$ sudo apt-get install antiword

We have our package installed and ready to use, let’s move on to the next to step to write php script and read the content from word document file.

Create new file and add following script:

<?php

$word_file = __DIR__ . '/doc-file-01.doc';

$content = shell_exec('antiword  ' . $word_file . '');

echo $content; // print content

?>

Try to run the above script, you should get your document content on the screen, if you get any issues you can alway comment in the comment section below. Thanks!

Yogesh Koli

Software engineer & Blogger lives in India, has 6+ years of experience working with the front-end and back-end web app development.

View Comments

  • Hi Yogesh Koli,

    Is there any way that I could format the output with Antiword?

    Kind regards,
    Prakash

Recent Posts

Complete guide of using Laravel 6 Eloquent Subquery Enhancements

Learn How to use laravel frameworks new improved feature called Eloquent Subquery and get example…

7 months ago

3 Useful examples of using Array Map function in PHP – Best Practices

Learn how to use php array map function with easy and essential tutorial to modify…

7 months ago

Working with PHP Array Filter Function – Best Practices

Learn how to use php array filter function with easy and essential tutorial to filter…

7 months ago

How to add Access Modifiers with Constructor Parameters in TypeScript

Want to know how to refactor your Typescript class, Learn here utilising Typescript of the…

7 months ago

What is Access Modifiers and how to use Access Modifiers in TypeScript ?

What is Access Modifiers in typescript, how to use Access Modifiers, when to use them,…

7 months ago

Top 10 Super Useful Packages to Improve Laravel applications in 2019

This tutorial provide ultimate list of package those are top 10 on packagist and super…

7 months ago