Click here to Skip to main content
15,881,516 members
Please Sign up or sign in to vote.
1.00/5 (1 vote)
See more:
Hello! I am a researcher and I have a database of about 20,000 pages across 700+ pdfs. The pdfs are searchable on a rudimentary level, but I'd need a coding tool (crawler?) that could quickly search through them. Additionally, it'd be great if the software could filter out noise, visualize results and aggregate data. Any suggestions of pre-existing software, or where to get something custom?

What I have tried:

I've looked into OCR but it doesn't seem like quite what I'm looking for? I'm wanting something more like Kibana.
Posted
Updated 7-Sep-18 19:22pm
Comments
RedDk 7-Sep-18 18:56pm    
Why Kibana exactly, since you bring it up?

1 solution

You're talking "ETL" (extract; translate; load).

You're still only at the "extract" phase; the rest (filter, aggregate, visualize) only comes "after".

You need to be more specific about the "content".

A "simple" "text" scanner can take a few minutes to develop; and even less to run.

(pdfs can contain "text")

http://www.antlr.org/

Once you've gotten at the (correct) "raw" data, you can start "translating" / filtering.
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900