Click here to Skip to main content
15,880,796 members
Please Sign up or sign in to vote.
3.00/5 (2 votes)
See more:
I have a c# function that reads paragraphs from .doc / docx files. I use the familiar Microsoft system. The problem is that to read a normal size file it takes several hours and to read a 100mb file it takes all day and I can't use the pc for anything else. Microsoft has a fast system which is what they use in Word. Enough of the jokes Microsoft, you have exceeded my limit. Solutions?

Retrieve the Number of Pages (takes about 15 minutes):
C#
Microsoft.Office.Interop.Word.WdStatistic MyWdStatistic   = Microsoft.Office.Interop.Word.WdStatistic.wdStatisticPages;
int pages = MyWordDocument.ComputeStatistics(MyWdStatistic, ref Miss);


I have Update the code i Use
Read all the paragraphs (for a 20mb file it takes about 4 hours):
(I use the same system as in the example.)
C#
Microsoft.Office.Interop.Word.Application word = new Microsoft.Office.Interop.Word.Application();
            object miss = System.Reflection.Missing.Value;
            object path = @"YourFilepath\file.docx";
            object readOnly = true;
            Microsoft.Office.Interop.Word.Document MyWordDocument = word.Documents.Open(ref path, ref miss, ref readOnly, ref miss, ref miss,
                        ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss);

int Total_Pages = MyWordDocument.ComputeStatistics(Microsoft.Office.Interop.Word.WdStatistic.wdStatisticPages, ref Miss);

            foreach (Microsoft.Office.Interop.Word.Paragraph MyParagraph in MyWordDocument.Paragraphs)
                    {   Microsoft.Office.Interop.Word.Range MyRange = MyParagraph.Range;
                        string Text = MyRange.Text;
                        int Page  = MyRange.Information[Microsoft.Office.Interop.Word.WdInformation.wdActiveEndPageNumber]; 
                    }
                }
            docs.Close();
            word.Quit();


What I have tried:

in internet.in internet.in internet.in internet.
Posted
Updated 1-Nov-21 18:07pm
v8
Comments
Maciej Los 27-Oct-21 2:44am    
What version of MS Office is installed on your OS?
What version if PIA do you use?
See: Office primary interop assemblies - Visual Studio (Windows) | Microsoft Docs[^]
Member 14890678 28-Oct-21 23:29pm    
ATTENTION: Word file reader to base on planet earth. I am taking hours to read the paragraphs in a Word file. That's the problem. I have shared the test docx files and the code I use to read them. I need to know: 1.- The total number of pages. 2.- The page number of each paragraph. 3.- Read the entire file in seconds / minutes (the same as Word). Possible solution: OpenXml but I don't know how to know the page number of each paragraph. UNSOLVED PROBLEM. CALLING TO EARTH. Anyone there? Thanks.

Quote:
C# read a word file in the fastest way

Your problem is not reading the file, your problem is building the result.
C#
for (int i = 0; i < docs.Paragraphs.Count; i++)
{
    //Determine the beginning of an entire paragraph and intercept the table name
    //Get the column name
    //......
    totaltext +=  docs.Paragraphs[i + 1].Range.Text.ToString();
}

The cost of building totaltext is O(n)= n² with n the number of paragraph.
Instaed of building totaltext and then write it to result file, you can directly write each paragraph directly to resulting file.
An alternative is to use a string builder.
 
Share this answer
 
v2
Comments
Maciej Los 27-Oct-21 10:56am    
5ed!
Patrice T 27-Oct-21 11:18am    
Thank you
Member 14890678 27-Oct-21 11:57am    
I have discovered:
1.- If I check the number of pages it takes 10-15 minutes:
int Pages = MyWordDocument.ComputeStatistics (MyWdStatistic, ref Miss);

2.- If I don't check it, and then check the paragraph page, the first time it takes 4 minutes, then 1 second. If I query it, it always takes 1 second:
foreach (Microsoft.Office.Interop.Word.Paragraph MyParagraph in MyWordDocument.Paragraphs)
{ Page = MyRange.Information [Microsoft.Office.Interop.Word.WdInformation.wdActiveEndPageNumber];

3.- If I open the Word document in reading mode, then it reads the paragraphs faster, but to read 20mb it may take 40 minutes.

4.- If I use OpenXml to read a .docx file, I need to extract all the paragraphs knowing the page number of each of them. I also need to know the total number of pages. I can't find how to know the Page Number to which the read paragraph belongs.
BillWoodruff 28-Oct-21 9:21am    
voted #3; i do not believe this post addresses the key issue for the OP which involves Office interop stuff in ASP. You ignore the possibility OpenXML use may be involved (mentioned in the OP's comments), and, assume the OP knows how to use4 Office Interop when it is clear he does not know how to use it.
Patrice T 28-Oct-21 15:25pm    
Downvoting an accepted solution is rather rude IMO.
My solution concentrate on the string concatenation part which is a very well known runtime bottleneck. I think it fit rather well the speed problem.
There's few things you should know:

1. String is immutable. See: String Class (System) | Microsoft Docs[^]
So, you have to use StringBuilder class[^] to get large portion of text.

2. Do NOT use Interop to read MS Word file via ASP.NET. Use OpenXML instead.
See: Preview Word files (docx) in HTML using ASP.NET, OpenXML and LINQ to XML - Maarten Balliauw {blog}[^]
 
Share this answer
 
Comments
Member 14890678 27-Oct-21 6:38am    
If I use OpenXml to read a .docx file, I need to extract all the paragraphs knowing the page number of each of them. I also need to know the total number of pages. I can't find how to know the Page Number to which the read paragraph belongs.
BillWoodruff 27-Oct-21 7:43am    
You get busy researching what facilities OPenXMl offers, and experiment: then ... ask questions.
BillWoodruff 27-Oct-21 7:52am    
H Maciej, Why not use interop with the App ? i can use interop in WinForms to get paragraph count, get the text of a specific paragraph, etc. i don't use ASP right now, and don't have the Office interop stuff installed, so, i can't test that.
Maciej Los 27-Oct-21 11:00am    
Bill, OP is using Response.Write method, which suggest that He is using Asp.Net rather than Winforms.
I've seen few articles where were stated that using Interop is not a secured way. I'll improve my answer, asap.
BillWoodruff 28-Oct-21 9:17am    
voted #3; i do not believe this post addresses the key issue for the OP which involves how to use Office interop stuff in ASP.

You tell the OP to use OpenXML, but, you do not say why, or indicate you have actually done this yourself.

Using StringBuilder is always good advice, but in this case, if the OP can use Office interop, they may not need to use it.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900