Click here to Skip to main content
15,890,186 members
Please Sign up or sign in to vote.
4.00/5 (1 vote)
See more:
How can I split a paragraph using following methods
First Method =  {A-Z} <br>
Second Method =  {A-Z}<br>
Third Method =  {A-Z}.<br>
Fourth Method =  {A-Z}./r/n



Note: Input may contain combination of all above methods.

Please help me to find out correct Regex Formula for splitting below paragraphs.

Input paragraphs:
C#
{A}<br>
Vasa. <a href="http:///full/87/7/540" target="_blank">"Nitric Oxide Activates Telomerase and Delays Endothelial Cell Senescence"</a> Circulation Research. 2000;87:540-542
<br><br>
For a more detailed discussion of the mechanisms underlying the relationship between nitric oxide and telomerase activation, see this study:<br>
Farsetti.  <a href="http://quot;>The telomerase tale in vascular aging"</a>  Journal of Applied Physiology January 2009 vol. 106 no. 1 333-337
<br><br>
{B} <br>
Chauhan. <a href="amp;3P945X0-8&_user=10&_rdoc=1&_fmt=&_orig=search&_sort=d&view=c&_version=1&_urlVersion=0&_userid=10&md5=c130cff602472f25bd5680ea3047490c" target="_blank">"Aging-Associated Endothelial Dysfunction in Humans Is Reversed by L-Arginine"</a> Journal of the American College of Cardiology Volume 28, Issue 7, December 1996, Pages 1796-1804
<br><br>
{C}.<br>
Monajemi H.<a href="http://11487030" target="_blank">Gene Expression in Atherogenesis"</a>.  <i>Thromb Haemost</i>. 2001 Jul;86(1):404-12.
<br>
{D}.
Britten M. The role of endothelial function of ischemic manifestations of coronary atherosclerosis
<br>
Kimura Y. Impaired endothelial function in hypertensive elderly patients evaluated by high..
<br><br>
{E}.<br>
9. In Cells, Aging and Human Disease, page 170, Michael Fossel writes:
<br>
In comparing young normal human aortic endothelial cells to senescent endothelial cells and endothelial cells imoortalized with hTERT, we find differences. Compared to young endothelial cells, senescent endothelial cells show a decreased production and activity of NO, changes critial in atherogenesis and hypertension. Similarly, senescent endothelial cells demonstrate increased monocyte adhesion, again implicated in atherogenesis. [..] In all cases, these differences are amerliorated or normalized by hTERT immortalization.
<br><br>
{F}.<br>
Chang E, Harley CB. Telomere length and replicative aging in human vascular tissues.
Posted
Updated 17-Jan-11 1:43am
v5
Comments
Dalek Dave 17-Jan-11 4:33am    
Edited for Readability.
Also, There is some question as to the viability of Telomerase Augmentation using NO2.
Some Catalyst Transcriptase has been shown to be effective.
Henry Minute 17-Jan-11 6:07am    
I am not fluent in regex but I think your fourth 'method' is going to be problematic. Not all full-stops denote a paragraph break. I think you will either have to forget that one or identify another character in addition to the stop.
justinonday 18-Jan-11 3:29am    
Thanks for suggestion Mr.Henry
#realJSOP 17-Jan-11 7:18am    
I changed your tags because your question is about all versions of C#, which means the single tag "C#" is adequate. Further, since it is C#, you don't also need to specify the ".Net" tag.
justinonday 18-Jan-11 3:30am    
ok thanks Mr.John Simmons

First, I would normalize the string's representation of line breaks and remove special characters:

C#
string myHtml = "...your html example goes here";
// normalize the line breaks
myHtml = myHtml.Replace("<br />", "<br>").Replace("<br/>", "<br>");
// remove special characters
myHtml = myHtml.Replace("\t", "").Replace("\r\n", "");

Then I would do a split:

C#
string[] parts = myHtml.Split("<br>");
</br>

At this point, each "paragraph" is split into separate array elements. According to the sample you've given, there will be some empty elements (where there are consecutive linebreak tags). At this point, you can process the parts array
 
Share this answer
 
Comments
justinonday 17-Jan-11 7:19am    
Thank you sir but its not working ?.....
The following code can be used to split the paragraph as specified in the question
C#
void Main()
{
    //Say, the paragraph text is stored in C:\ParaGraphText.txt file
    string fileText = System.IO.File.ReadAllText(@"C:\ParaGraphText.txt");
    //.NET   Regex . character class matches everything except \n
    //as a workaround I replaced \n with Alt 175 special character
    fileText = fileText.Replace("\n","»");
    //{\s*[A-Z]\s*} matches including zero or more spaces around  A-Z
    //(.*?)(?={\s*[A-Z]\s*}) captures a group when it is suffixed by {\s*[A-Z]\s*}
    //but {\s*[A-Z]\s*} itself will not be included in the group due to ?= operator
    //Alternately for the last paragraph which is not suffixed by {\s*[A-Z]\s*}
    //|{\s*[A-Z]\s*}(.*)) is used.
    MatchCollection matches =  System.Text.RegularExpressions.Regex.Matches(fileText,@"({\s*[A-Z]\s*}(.*?)(?={\s*[A-Z]\s*})|{\s*[A-Z]\s*}(.*))", RegexOptions.Multiline);
    string splitParaGraphText = "";
    foreach(Match mat in matches) {
        if (mat.Captures.Count > 0)
            splitParaGraphText += mat.Captures[0].Value +
            //Indicate paragraph separation
            "\r\n=============================================================\r\n";

    }
    //Replace special character back to \n character
    splitParaGraphText = splitParaGraphText.Replace("»","\n");
    //Save the split paragraph text to file
    System.IO.File.WriteAllText(@"C:\SplitParaGraphText.txt",finalText);
}

NOTE: for a quick test
Create a a text file, paste the paragraph text given in the question
Use LINQPad, which can be downloaded from http://www.linqpad.net/[^], with C# program option under Language combo box
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900