Click here to Skip to main content
15,892,059 members
Please Sign up or sign in to vote.
1.00/5 (1 vote)
See more:
Is there a program or function that... for a lack of better terms, eats books?

Java
function Consume( "Lord of the Rings" ) {
 for( until end of book ) {
  letter found, analyzed, and documented.
  word found, analyzed, and documented.
  operator found, analyzed, and documented.
  }

 print( "Basic Info" );
  Language(s) found: []
  List of characters by popularity: []
  List of words in alphabetical order: []
  List of words by popularity: []
  List of words by (whatever) : []

  Number of letters found: big int
  Number of words found: big int
  Number of sentences found: big int

 "Advanced Info"
 "List of Names found: []"
 "List of Speech found: []"
}


There are communities out there dedicated to building a better sorting algorithm, and there are some amazing ones out there, but before I tackle an overwhelming project, does such a function already exist? Working with [a-z] is my native language, but how challenging is it to break down other languages? Like Chinese symbols broken back down into the hiragana romaji sets?

I recall the Star Wars movie being rerendered in alphabetical order, but after a tinsy bit of research, he did a lot of that work manually. Voice recognition is much more challenging that a letter character map, but is that true for other languages?

What I have tried:

I'm very new to databases and web programming. I wrote a program using [php,js,html] to take a list of words and add them to the DB; No fancy analysis yet. Thought I'd reach out to the community before reinventing the wheel in a very barbaric way. Thus far, my program has been incredibly slow with such a simple process.
Posted
Updated 21-Nov-18 0:56am
v2
Comments
Richard MacCutchan 21-Nov-18 7:14am    
What do you mean by "eats books"? It would help if you explain your problem in clearer terms.
madifier 21-Nov-18 19:15pm    
Were you not capable of reading the rest of my question? A function that reads through a book (eats it), processes the characters it finds and stores into a database (digests it).
Richard MacCutchan 22-Nov-18 4:07am    
So where would you get the text of the book from?
madifier 23-Nov-18 21:23pm    
I'm referring to digital copies. You're making this harder than it really needs to be. The hypothetical function "Consume()" would accept a file "Lord of the Rings" be it on the harddrive or url address, "read" through it and process the information accordingly. English characters are from chr(97-123) or ord('a'-'z'). I'm not familiar enough with other languages to identify the important characters of their own language. Other languages have different sentence structures to them too, so I can't simply look for the period character, I'd have to refer to whatever they use. Maybe they don't have one. Some symbols are a character combo. A single Chinese kajin symbol isn't anywhere close to the English version where letters are simple sitting side by side. So, the function 'Consume()' whose argument is 'Dream of the Red Chamber' would identify the language as Chinese and sort the characters words and other data accordingly. I just wanted to know if a function already existed that read through a book and sorted the words by popularity and perhaps a little more. The more languages, the better.
David_Wimbley 2-Dec-18 0:42am    
You need to check your attitude towards people attempting to understand your issue. Telling them they are making it harder then it needs to be is insulting, if you were more clear (like the comment above) the first time you might not be getting so frustrated because someone asked for some clarification on an issue that isn't their problem to solve.

You use terminology that makes sense to you but sounds ridiculous to others. I've never heard parsing text (whether from a book or a pdf) as "eating it".

Looking over your explanation it looks like what you want to look into is called natural language processing (NLP)

https://en.wikipedia.org/wiki/Natural_language_processing

This is not a simple topic to master (given you are new to development this is likely far beyond what you are ready for just yet), people get PHDs in this line of data science work so be prepared to dig in heavily to this topic.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900