No of word count in all formates

Question

1.00/5 (4 votes)

See more:

Javascript

PHP

How to get no of words in all formats(.doc, .docx, .pdf and image) using php and Javascript.

Posted 26-Jun-12 7:51am

D.sujith kumar

Add a Solution

Comments

Sergey Alexandrovich Kryukov 26-Jun-12 13:55pm

One little word "all" makes this question totally invalid.
--SA

3 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Tim Corey · Answer 1 · 2012-06-26T08:02:00

The simple answer is that you cannot do so. Getting the word count from .doc/.docx files will be tough without Word installed on the server. It will get even tougher when you try to get a word count out of an image, since you will need to perform OCR on the image first.

Using php and javascript to perform the word count on even one of these formats will be difficult. You will need to create a seperate mechanism for each format you want to support.

OriginalGriff · Answer 2 · 2012-06-26T08:04:00

You can't.
For starters, image formats do not have any words. Lots and lots of pixels, but no words.

Each format is different. Some are text based, others are XML based, others are binary based.
There is nothing you can use to read all formats and get a word count, in PHP, Javascript, VB, C# or Martian.

lewax00 · Answer 3 · 2012-06-26T08:20:00

The only format I can give you any help with here is PDF, and for that you can extract text with XPdf[^]. However: getting an accurate word count from some PDFs may be impossible, depending on how the program that created it decides to format the output (just because it appears as a word in a PDF viewer does not mean it was stored as a word in the document, PDF is a very complex format).

As has been mentioned here, getting word count from an image would require OCR, but I don't know enough about it to give you a recommendation (I do however know, that once again you may be unable to get an accurate word count with OCR).

.docx documents are essentially a zipped collection XML files, and shouldn't be too difficult to work with. But I don't know enough about the format to help there beyond that.

.doc documents are also a zipped collection of files, but I don't know anything about the format of the files contained within (they appear to be some binary format).

I think your best bet is to pick a single file type and stick with it.

No of word count in all formates

3 solutions

Solution 1

Solution 2

Solution 3

Add your solution here

Preview 0