Click here to Skip to main content
15,912,756 members
Please Sign up or sign in to vote.
1.00/5 (4 votes)
See more:
How to get no of words in all formats(.doc, .docx, .pdf and image) using php and Javascript.
Posted
Comments
Sergey Alexandrovich Kryukov 26-Jun-12 13:55pm    
One little word "all" makes this question totally invalid.
--SA

The simple answer is that you cannot do so. Getting the word count from .doc/.docx files will be tough without Word installed on the server. It will get even tougher when you try to get a word count out of an image, since you will need to perform OCR on the image first.

Using php and javascript to perform the word count on even one of these formats will be difficult. You will need to create a seperate mechanism for each format you want to support.
 
Share this answer
 
You can't.
For starters, image formats do not have any words. Lots and lots of pixels, but no words.

Each format is different. Some are text based, others are XML based, others are binary based.
There is nothing you can use to read all formats and get a word count, in PHP, Javascript, VB, C# or Martian.
 
Share this answer
 
The only format I can give you any help with here is PDF, and for that you can extract text with XPdf[^]. However: getting an accurate word count from some PDFs may be impossible, depending on how the program that created it decides to format the output (just because it appears as a word in a PDF viewer does not mean it was stored as a word in the document, PDF is a very complex format).

As has been mentioned here, getting word count from an image would require OCR, but I don't know enough about it to give you a recommendation (I do however know, that once again you may be unable to get an accurate word count with OCR).

.docx documents are essentially a zipped collection XML files, and shouldn't be too difficult to work with. But I don't know enough about the format to help there beyond that.

.doc documents are also a zipped collection of files, but I don't know anything about the format of the files contained within (they appear to be some binary format).

I think your best bet is to pick a single file type and stick with it.
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900