I'm writing a list of regular expressions to identify company names from text.
This is what the text is like-->
Summer Intern
Genisup India Pvt. Ltd., Hosur, Tamil Nadu
June 2021 – Aug 2021 1⁄2 Remote
• Internship on the topic NLP: Topic Modeling to assign the
theme or topic for any news article on internet using Machine
Learning techniques.
• Worked on proxy rotation and Web Scraping
• Performed LDA Topic Modeling on "The Hindu" news articles
and obtained precision score of 0.906.
Intern Trainee
VUGS Technologies Pvt. Ltd., Agra, Uttar Pradesh
May 2021 – June 2021 1⁄2 Remote
• Built an OCR using Pytesseract and NER Text Classification
Model to categorize detected text into Name, E-mail Address,
Phone number and Date using NLTK,SpaCy and BERT
• Created an OCR for Handwritten text [A-Z, 0-9] using CNN
architecture
• Built a Face Recognition Model and Face Mask Detection
Model using OpenCV and Haar Cascade Classifier.
Expected output-->
['Genisup India Pvt. Ltd.' 'VUGS Technologies Pvt. Ltd.']
Observed output-->
['Genisup India Pvt. Ltd.' 'S Technologies Pvt. Ltd.']
Why isn't "VUGS" getting printed completely?
What I have tried:
import re
import numpy as np
sub_patterns = ['[A-Z][a-z]* [A-Z][a-z]* Private Limited','[A-Z][a-z]* [A-Z][a-z]* Pvt. Ltd.','[A-Z][a-z]* [A-Z][a-z]* Inc.',
'[A-Z][a-z]* [A-Z][a-z]* Corporation', '[A-Z][a-z]* [A-Z][a-z]* Inc.', '[A-Z][a-z]* [A-Z][a-z]* Technologies', '[A-Z][a-z]* [A-Z][a-z]* Company', '[A-Z][a-z]* [A-Z][a-z]* Solutions',
'[A-Z][a-z]* [A-Z][a-z]* Services']
pattern = '({})'.format('|'.join(sub_patterns))
comp = re.findall(pattern, text)
comp_name = np.array(comp)
comp_un=np.unique(comp_name)
print(comp_un)