Click here to Skip to main content
15,881,381 members
Please Sign up or sign in to vote.
2.00/5 (2 votes)
See more:
I have large data stored in text file more than 10000 as following:
ID  Name  employment age  work_experience_year
12    XXX   UUU        22    BBBB

and so on. So one method using C#, I did is to read all lines and put them in 2D array then line search. Two problems that I faced first when I read all line the memory will increase and the computer will slow down. Second problem is when searching will take a lot of time.

The searching I want is by any columns type age, name...
so what different method do I use without stored in 2D array?
Also and if using SQL is better ?and what to use ?

What I have tried:

the line search and 2D array were too slow
Posted
Updated 22-Aug-20 6:28am
v3
Comments
PIEBALDconsult 21-Aug-20 21:07pm    
Put the data in a DataTable and use the RowFilter property of the DefaultView.
PIEBALDconsult 22-Aug-20 0:22am    
You can also build any number of indices using Dictionaries.

Quote:
I have large data stored in text file

Yes, text files or CSV files are good at space efficient storage, but not at being fast at retrieve data.
That is the reason why databases exist, they are not good at space efficient storage, but are good at fast data retrieve.
Quote:
Also and if using SQL is better ?and what to use ?

SQL or not is your choice among possibles solutions.
You need to learn databases, it is a vast subject we can't teach you in the space of this forum.
Quote:
the line search and 2D array were too slow

2D array in memory is only a possibility if used with databases techniques like indexes.
Quote:
more than 10000

Databases commonly handle millions of records without noticeable slow down.
 
Share this answer
 
v2
Comments
Member 12224804 23-Aug-20 18:19pm    
ah ok so the database is the best solution ? however what recommends ( SQL, MySQL...)for wpf C# and python for web based?
That depends on what you are prepared to accept. 10,000 rows isn't that many, so loading them into memory shouldn't be too bad.
I can read 10,000 rows from CSV in 312ms on a i5-3330 @ 2.00Ghz with 8GB RAM, and 1,000,000 in 18,994ms (The later used just 345MB or process memory)
That's using CsvHelper[^] which may be my new favourite CSV reader!
C#
List<Foo> records;
using (var reader = new StreamReader(@"D:\Test data\SR1M.csv"))
    {
    using (var csv = new CsvReader(reader, CultureInfo.InvariantCulture))
        {
        records = csv.GetRecords<Foo>().ToList();
        }
    }

For searching, I don't see any problem: the 10K row DB can find a record in 1ms:
Stopwatch s2 = new Stopwatch();
s2.Start();
Foo found = records.FirstOrDefault(r => r.Region == "Asia" &&
                                        r.Country == "Mongolia" &&
                                        r.ItemType == "Snacks" &&
                                        r.SalesChannel == "Offline" &&
                                        r.OrderPriority == "M" &&
                                        r.OrderDate == new DateTime(2016, 4, 12));
s2.Stop();
Console.WriteLine(s2.ElapsedMilliseconds);

For the 1,000,000 rows version, it took 75ms to find the same row.
In both cases, it was irrelevant where in the file the row was located.

Testing was performed using the 10K and 1M rows sample CSV's from here: Downloads 18 - Sample CSV Files / Data Sets for Testing (till 1.5 Million Records) - Sales | E for Excel | Awakening Microsoft Excel Student Inside You[^]

I suspect any significantly excessive time or memory usage is down to how you are reading and processing the data, not an intrinsic feature of the file based data format!

Do bare in mind that switching to something like SQL Server will require a lot of resources - disk, memory, and processor - so if your existing code is running on poorly spec'ed hardware that won't improve the problem!
 
Share this answer
 
Comments
PIEBALDconsult 22-Aug-20 8:52am    
For reading CSV I've always used my Rive function, but recently I learned that .net has a CSV reader/writer built-in.
George Swan 23-Aug-20 11:51am    
Have you tried the Nu Get package SoftCircuits.CsvParser? It has a very similar modus operandi to CsvHelper but allocates less memory and is about 25% faster on reads.
In addition to the other solutions, you can use Lucene indexing to build an index of the data and then search for it. Lucene is powerful but a bit clunky to use.

CSV file

ID  Name  employment age  work_experience_year
1,XXX,UUU,22,BBBB
2,John,Manager,40,CCCC
3,Dave,CEO,50,DDDD
4,Peter Jones,Janitor,32,EEEE
5,John Simon,Tester,43,FFFF


Class to hold each row of data

C#
public class RowData
{
    public int ID { get; set; }
    public string Name { get; set; }
    public int Age { get; set; }
    public string Employment { get; set; }
}


Indexing function

C#
static void IndexFile(string filename)
{
    var AppLuceneVersion = Lucene.Net.Util.Version.LUCENE_30;

    var indexLocation = @"C:\temp\Index";
    var dir = FSDirectory.Open(indexLocation);

    var fileStream = new StreamReader(filename);
    string line = fileStream.ReadLine();

    var splitChar = new char[] { ',' };

    using (var analyzer = new StandardAnalyzer(AppLuceneVersion))
    using (var writer = new IndexWriter(dir, analyzer, IndexWriter.MaxFieldLength.UNLIMITED))
    {
        writer.DeleteAll();
        while ((line = fileStream.ReadLine()) != null)
        {
            var values = line.Split(splitChar);

            Document document = new Document();

            document.Add(new Field("id", values[0], Field.Store.YES, Field.Index.NOT_ANALYZED, Field.TermVector.NO));
            document.Add(new Field("name", values[1], Field.Store.NO, Field.Index.ANALYZED));
            document.Add(new Field("name_exact", values[1], Field.Store.YES, Field.Index.NOT_ANALYZED));
            document.Add(new Field("employment", values[2], Field.Store.YES, Field.Index.ANALYZED));
            var ageField = new NumericField("age", Field.Store.YES, true);
            ageField.OmitNorms = false;
                    
            ageField.OmitTermFreqAndPositions = false;
            ageField.SetLongValue(long.Parse(values[3]));
            document.Add(ageField);

            writer.AddDocument(document);
        }
    }
}


Search function

C#
static List<RowData> Search(int id = 0, string name = null, string exactName = null, int age = 0, int maxAge = 0)
{
    var indexLocation = @"C:\temp\Index";
    var dir = FSDirectory.Open(indexLocation);

    IndexSearcher searcher = new IndexSearcher(dir);

    var query = new BooleanQuery();

    if (id > 0)
    {
        Term searchTerm = new Term("id", id.ToString());
        query.Add(new BooleanClause(new TermQuery(searchTerm), Occur.MUST));
    }

    if (!string.IsNullOrWhiteSpace(name))
    {
        Term searchTerm = new Term("name", name.ToLowerInvariant());
        query.Add(new BooleanClause(new TermQuery(searchTerm), Occur.MUST));
    }

    if (!string.IsNullOrWhiteSpace(exactName))
    {
        Term searchTerm = new Term("name_exact", exactName);
        query.Add(new BooleanClause(new TermQuery(searchTerm), Occur.MUST));
    }

    if (age > 0 && maxAge == 0)
    {
        Term searchTerm = new Term("age", NumericUtils.LongToPrefixCoded(age));
        query.Add(new BooleanClause(new TermQuery(searchTerm), Occur.MUST));
    }

    if (age > 0 && maxAge > 0)
    {
        // search for any age greater than or equal to the given age
        var nrq = NumericRangeQuery.NewLongRange("age", age, maxAge, true, true);
        query.Add(nrq, Occur.MUST);
    }

    var hits = searcher.Search(query, 1000);

    var results = new List<RowData>();

    foreach (var hit in hits.ScoreDocs)
    {
        var doc = searcher.Doc(hit.Doc);

        results.Add(new RowData {
            ID = int.Parse(doc.Get("id")),
            Name = doc.Get("name_exact"),
            Employment = doc.Get("employment"),
            Age = int.Parse(doc.Get("age"))
        });
    }

    return results;
}


Usage

C#
static void Main(string[] args)
{
    // You only need to index the file once, or every time it changes
    IndexFile(@"C:\Temp\data.csv");

    // Search by ID
    var results = Search(id: 1);

    ShowResults("Search by ID", results);

    // Do a partial search on a name
    results = Search(name: "John");

    ShowResults("Search by name", results);

    // Do a literal search on a name
    results = Search(exactName: "John Simon");

    ShowResults("Search by exact name", results);

    // Do a search for an exact age
    results = Search(age: 22);

    ShowResults("Search by age", results);

    // Do a search for an age range
    results = Search(age: 30, maxAge: 40);

    ShowResults("Search by age range", results);

    // Or combine any params you want
    results = Search(name: "Dave", age: 30, maxAge: 999);

    ShowResults("Search by name and age", results);

    Console.ReadLine();
}

static void ShowResults(string name, List<RowData> results)
{
    Console.WriteLine($"\r\n{name}\r\n");

    foreach (var result in results)
    {
        Console.WriteLine($"{result.ID} {result.Name} ({result.Age}), {result.Employment}");
    }
}
 
Share this answer
 
Comments
Member 12224804 2-Sep-20 9:14am    
so you made an index file? or you read all the file into the memory?
F-ES Sitecore 2-Sep-20 10:48am    
Yeah it puts a bunch of files in the directory you specify (not sure I mentioned but "temp\index" is a directory, not a file) so it's persistent and when you run queries it runs them against the index in that folder, it's not loading all the data into memory, these indexes can have many millions of rows.
Member 12224804 2-Sep-20 17:32pm    
ah ok yes I saw it does use indexing one time since the data won't change.this is great in my case.
also if I want to use a database instead wich one is better for offline and online server?
F-ES Sitecore 3-Sep-20 7:02am    
For your uses you'll probably want something like SQLLite as it's lighter-weight, re-distributable etc. When you go to "proper" databases like SQL Server you need to install them and so on, they're just trickier to get going.
Member 12224804 3-Sep-20 8:22am    
so sqlite can use them as offline and online server? as my software will be web-based and an offline desktop app, with the same database.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900