Click here to Skip to main content
15,885,875 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I have a program that speaks text:

Text To Speech For Windows[^]

I want a Regex to detect sources like

(Lee & others, 2012)


or

(Lee & others, 2012a)




In text to skip speaking them.

What I have tried:

I have tried this Regex but it doesn't work well...

*?\,\s\d\d\d\d([a-d]?)\)"

I want the .*? part be replaced with something that says the least number of repetitions, so I don't get normal text unspoken, you know my Regex sometimes skips from the first open parentheses of the first source to the second open parenthesis of the second source.

The .*? part is not greedy, but I still get many repetitions, I want the least number of repetition possible between the 2 parenthesis.
Posted
Updated 5-Jul-17 22:21pm
v2

1 solution

I think you need to check that: when I try your Regex in Expresso, or a simple app:
string input = @"I want a Regex to detect sources like (Lee & others, 2012) I want a Regex to detect sources like (Lee & others, 2012) I want a Regex to detect sources like (Lee & others, 2012)";
string output = Regex.Replace(input, @"\(.*?\,\s\d\d\d\d([a-d]?)\)", "%%");
Console.WriteLine(output);
I get exactly what I expect, and you ask for:
I want a Regex to detect sources like %% I want a Regex to detect sources like %% I want a Regex to detect sources like %%
I.e. it detects each bracket pair correctly (the "%%" are only there so you can see what was removed - it works fine with "" as well).

So have a look at your sample data, and try cutting it down until you have a minimum subset which still displays the problem - it may not be what you think it is!

"Would you please try on this:

and her colleagues
(2013) used carbon dating to measure the
growth of new neurons in the brains of
people aged 19 to 92. More than 50 years
ago, aboveground testing of nuclear bombs
released carbon-14, or C-14, into the atmosphere. Since these nuclear tests were
banned in 1963, levels of C-14 in the atmosphere have declined at a regular and
well-known rate. Measuring the amount of C-14 concentration in neurons therefore
provided a “time-stamp” for the neurons, allowing researchers to determine when
the neurons had been generated.
The carbon-14 signature was unmistakable: Spalding and her colleagues found
clear evidence of neurogenesis after birth and, in fact, throughout the lifespan. They
were also able to calculate the rate of neurogenesis in a specific region of the brain,
called the hippocampus, a brain region involved in learning and memory. It turned
out that an average of 1,400 new neurons were being generated each day. The rate
declined only slightly with age. Other regions of the brain, however, did not show
evidence of neurogenesis.
Research on neurogenesis in humans and animals has uncovered a number of
intriguing findings. It is now generally accepted that newborn neurons develop into
mature functioning neurons in at least two regions of the human brain—the hippocampus
and the olfactory bulb, responsible for odor perception (Lee & others, 2012).
These newly generated"


Well that works as well - it eliminates everything between the first open bracket and the first close bracket after a comma and a space - which is exactly what you told it to do!
One way to solve this is to use Balancing Groups - which I'm *not* going to explain :laugh: - and try and remove anything inside the brackets that ends with four digits:
\((?>\((?<c>)|[^()]+|\)(?<-c>))*(?(c)(?!))\d{4}\)

But ... a regex is probably the wrong approach - you may need a more complicated natural language processor or you will possibly miss some edge cases.

I'd suggest that you get a copy of Expresso[^] - it's free, and it examines and generates Regular expressions.

"I use Expresso, sorry, the regex you gave doesn't work at all..."

:doh: I missed that it captured the first one (2013) but not the second ... had teh digits in the wrong place :O
Try this:
\((?>\((?<c>)|[^()]*\d{4}|\)(?<-c>))*(?(c)(?!))\)
Which should catch both cases. Sorry about that ...

"I'm sorry, I din't quite understand. Is it hard for you to do without explaining: adding the possibility of 1 letter after the year number? I once worked with Regex Matches and groups, but they're not needed now..."

Try:
\((?>\((?<c>)|[^()]*\d{4}[a-zA-Z]?|\)(?<-c>))*(?(c)(?!))\)
 
Share this answer
 
v5
Comments
john1990_1 6-Jul-17 4:22am    
Would you please try on this:

and her colleagues
(2013) used carbon dating to measure the
growth of new neurons in the brains of
people aged 19 to 92. More than 50 years
ago, aboveground testing of nuclear bombs
released carbon-14, or C-14, into the atmosphere. Since these nuclear tests were
banned in 1963, levels of C-14 in the atmosphere have declined at a regular and
well-known rate. Measuring the amount of C-14 concentration in neurons therefore
provided a “time-stamp” for the neurons, allowing researchers to determine when
the neurons had been generated.
The carbon-14 signature was unmistakable: Spalding and her colleagues found
clear evidence of neurogenesis after birth and, in fact, throughout the lifespan. They
were also able to calculate the rate of neurogenesis in a specific region of the brain,
called the hippocampus, a brain region involved in learning and memory. It turned
out that an average of 1,400 new neurons were being generated each day. The rate
declined only slightly with age. Other regions of the brain, however, did not show
evidence of neurogenesis.
Research on neurogenesis in humans and animals has uncovered a number of
intriguing findings. It is now generally accepted that newborn neurons develop into
mature functioning neurons in at least two regions of the human brain—the hippocampus
and the olfactory bulb, responsible for odor perception (Lee & others, 2012).
These newly generated
OriginalGriff 6-Jul-17 5:04am    
Answer updated
john1990_1 6-Jul-17 4:51am    
I made it with a "for" and scanned the text and it seems to work, I would be happy to see the answer though...
john1990_1 6-Jul-17 5:18am    
I use Expresso, sorry, the regex you gave doesn't work at all...
OriginalGriff 6-Jul-17 5:33am    
Answer updated

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900