I am trying to classify text files

Question

0.00/5 (No votes)

See more:

I have 15 text files includes articles about different topics and I have another 5 text files each one includes token words about one topic politics, sports, weather, literature, history I want full (not just a sample) C++ code that read the five files that includes the token words and the 15 text files and classify each article with the appropriate topic using the tokens files then print each article with the token file which it classified to, and I want the classification by the frequency of each word in article comparing with its the frequency in token file and at the output each article with token file which it classified to

What I have tried:

#include <iostream>
#include <fstream>
#include <string>
#include <unordered_map>
#include <vector>

using namespace std;

const int N = 20; // number of text files
const int M = 5;  // number of token files

// function to read a text file and return a vector of words
vector<string> readTextFile(string filename) {
    vector<string> words;
    string word;
    ifstream infile(filename);
    while (infile >> word) {
        words.push_back(word);
    }
    return words;
}

// function to read a token file and return a map of words to their frequencies
unordered_map<string, int> readTokenFile(string filename) {
    unordered_map<string, int> wordFreqs;
    string word;
    int freq;
    ifstream infile(filename);
    while (infile >> word >> freq) {
        wordFreqs[word] = freq;
    }
    return wordFreqs;
}

int main() {
    // read all the text files and store them in a vector
    vector<vector<string>> textFiles(N);
    for (int i = 0; i < N; i++) {
        string filename = "text" + to_string(i) + ".txt";
        textFiles[i] = readTextFile(filename);
    }

    // read all the token files and store them in a vector
    vector<unordered_map<string, int>> tokenFiles(M);
    for (int i = 0; i < M; i++) {
        string filename = "token" + to_string(i) + ".txt";
        tokenFiles[i] = readTokenFile(filename);
    }

    // classify each text file based on the token files
    for (int i = 0; i < N; i++) {
        // calculate the similarity score of the text file with each token file
        vector<double> scores(M);
        for (int j = 0; j < M; j++) {
            double score = 0;
            for (string word : textFiles[i]) {
                if (tokenFiles[j].count(word)) {
                    score += tokenFiles[j][word];
                }
            }
            scores[j] = score;
        }

        // find the token file with the highest similarity score
        int maxIdx = 0;
        for (int j = 1; j < M; j++) {
            if (scores[j] > scores[maxIdx]) {
                maxIdx = j;
            }
        }

        // print the text file and the token file it was classified to
        cout << "text" << i << ".txt" << " classified to token" << maxIdx << ".txt" << endl;
    }

    return 0;
}

Posted 28-Dec-22 5:31am

عصام مجدي السيد عبدالفتاح G 7 2022

Updated 28-Dec-22 6:15am

Add a Solution

Comments

OriginalGriff 28-Dec-22 11:50am

And?
What does it do that you didn't expect, or not do that you did?
What have you tried to do to find out why?
Are there any error messages, and if so, where and when? What did you do to make them happen?

This is not a good question - we cannot work out from that little what you are trying to do.
Remember that we can't see your screen, access your HDD, or read your mind - we only get exactly what you type to work with.
Use the "Improve question" widget to edit your question and provide better information.

عصام مجدي السيد عبدالفتاح G 7 2022 28-Dec-22 11:53am

this the output but i can not find the problem, it run with out any syntax error :
text0.txt classified to token0.txt
text1.txt classified to token0.txt
text2.txt classified to token0.txt
text3.txt classified to token0.txt
text4.txt classified to token0.txt
text5.txt classified to token0.txt
text6.txt classified to token0.txt
text7.txt classified to token0.txt
text8.txt classified to token0.txt
text9.txt classified to token0.txt
text10.txt classified to token0.txt
text11.txt classified to token0.txt
text12.txt classified to token0.txt
text13.txt classified to token0.txt
text14.txt classified to token0.txt
text15.txt classified to token0.txt
text16.txt classified to token0.txt
text17.txt classified to token0.txt
text18.txt classified to token0.txt
text19.txt classified to token0.txt

عصام مجدي السيد عبدالفتاح G 7 2022 28-Dec-22 11:55am

it runs with out any syntax error but I can not find the problem with it , there is the output:
text0.txt classified to token0.txt
text1.txt classified to token0.txt
text2.txt classified to token0.txt
text3.txt classified to token0.txt
text4.txt classified to token0.txt
text5.txt classified to token0.txt
text6.txt classified to token0.txt
text7.txt classified to token0.txt
text8.txt classified to token0.txt
text9.txt classified to token0.txt
text10.txt classified to token0.txt
text11.txt classified to token0.txt
text12.txt classified to token0.txt
text13.txt classified to token0.txt
text14.txt classified to token0.txt
text15.txt classified to token0.txt
text16.txt classified to token0.txt
text17.txt classified to token0.txt
text18.txt classified to token0.txt
text19.txt classified to token0.txt

2 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

OriginalGriff · Answer 1 · 2022-12-28T06:12:00

Quote:
this the output but i can not find the problem, it run with out any syntax error :
text0.txt classified to token0.txt
text1.txt classified to token0.txt
text2.txt classified to token0.txt
text3.txt classified to token0.txt
text4.txt classified to token0.txt
text5.txt classified to token0.txt
text6.txt classified to token0.txt
text7.txt classified to token0.txt
text8.txt classified to token0.txt
text9.txt classified to token0.txt
text10.txt classified to token0.txt
text11.txt classified to token0.txt
text12.txt classified to token0.txt
text13.txt classified to token0.txt
text14.txt classified to token0.txt
text15.txt classified to token0.txt
text16.txt classified to token0.txt
text17.txt classified to token0.txt
text18.txt classified to token0.txt
text19.txt classified to token0.txt

Syntax errors only happen when you compile: if you have any syntax errors an executable is not produced and so you cannot run it.

But syntax errors are the tip of the iceberg: Compiling successfully does not mean your code is right! :laugh:
Think of the development process as writing an email: compiling successfully means that you wrote the email in the right language - English, rather than German for example - not that the email contained the message you wanted to send.

So now you enter the second stage of development (in reality it's the fourth or fifth, but you'll come to the earlier stages later): Testing and Debugging.

Start by looking at what it does do, and how that differs from what you wanted. This is important, because it give you information as to why it's doing it. For example, if a program is intended to let the user enter a number and it doubles it and prints the answer, then if the input / output was like this:

Input   Expected output    Actual output
  1            2                 1
  2            4                 4
  3            6                 9
  4            8                16

Then it's fairly obvious that the problem is with the bit which doubles it - it's not adding itself to itself, or multiplying it by 2, it's multiplying it by itself and returning the square of the input.
So with that, you can look at the code and it's obvious that it's somewhere here:

C++

int Double(int value)
   {
   return value * value;
   }

Once you have an idea what might be going wrong, start using the debugger to find out why. Put a breakpoint on the first line of the method, and run your app. When it reaches the breakpoint, the debugger will stop, and hand control over to you. You can now run your code line-by-line (called "single stepping") and look at (or even change) variable contents as necessary (heck, you can even change the code and try again if you need to).
Think about what each line in the code should do before you execute it, and compare that to what it actually did when you use the "Step over" button to execute each line in turn. Did it do what you expect? If so, move on to the next line.
If not, why not? How does it differ?
Hopefully, that should help you locate which part of that code has a problem, and what the problem is.
This is a skill, and it's one which is well worth developing as it helps you in the real world as well as in development. And like all skills, it only improves by use!

CPallini · Answer 2 · 2022-12-28T06:15:00

As far as I can understand, the "classify each text file based on the token files" is wrong.
There you should compute the frequency of the words in the articles (that is, for each article, create a map similar to the one used for the token files) and then compare such frequencies with the ones listed in the token files.

I am trying to classify text files

2 solutions

Solution 1

Solution 2

Add your solution here

Preview 0