Click here to Skip to main content
15,891,657 members
Please Sign up or sign in to vote.
1.00/5 (1 vote)
See more:
I have thousands of lines of transcribed txt data that looks like this
NOTE Confidence: 0.25722745

a786057b-398f-4c16-8408-17ef2d53e3b6
00:00:07.810 --> 00:00:08.520
Right?

NOTE Confidence: 0.90559334

133f78b8-c7ea-47a7-b714-e09f5411f162
00:00:09.570 --> 00:00:14.003
Alright, so if you don't mind if
you could just start, I start

NOTE Confidence: 0.90559334

33416ed1-681b-4598-90f9-50462483cb3c
00:00:14.003 --> 00:00:15.367
out by giving us.

NOTE Confidence: 0.84606636

40827527-7e85-425e-a833-d9aefffe1b23
00:00:16.900 --> 00:00:19.770
A brief introduction. I'll talk
a little bit about your

NOTE Confidence: 0.84606636

80ad5cae-bed2-4cfa-b61f-2ff5658a41f8
00:00:19.770 --> 00:00:22.640
background and your career
experience and sort of what has

NOTE Confidence: 0.84606636


Also, for some odd reason there are line stops in the formatting so if I paste it into word the formatting is crazy. I'd like to go through and get rid of those.


What I desperately need to do is delete the junk and keep the words. I've only written code in MATLAB and have no idea where to start. Can anyone please help??

What I have tried:

Pasting into word
Exporting into Excel
Find/delete

I know if I can write something from the command line it would work but I don't have the coding skills in visual basic.
Posted
Updated 25-Mar-21 9:21am

It looks like you have three types of "noise":

- A string of hex digits and hyphens: hhhhhhhh-hhhh-hhhh-hhhhhhhhhhhh
- Time intervals: nn.nn:nn.nnn --> nn.nn.nn.nnn
- Automated voice recognition confidence levels: NOTE Confidence: 0.nnnnnnnn

It's easy to write three functions that detect the noise. I would just read each line, see if one of the functions recognizes it, and if not, add it to the transcript. Before you add it, you can strip out any full stops and convert upper case to lower case.
 
Share this answer
 
Comments
Kim Drnec 26-Mar-21 5:01am    
Thank you all for your help
Maybe you can use this library: GitHub - antonmilev/CText: C++ advanced text processing library[^]
You probably need to look at the RegEx routines, also see: best-regex-testing-tools[^]
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900