Fast searching in large stored data

Question

2.00/5 (2 votes)

See more:

I have large data stored in text file more than 10000 as following:

ID  Name  employment age  work_experience_year
12    XXX   UUU        22    BBBB

and so on. So one method using C#, I did is to read all lines and put them in 2D array then line search. Two problems that I faced first when I read all line the memory will increase and the computer will slow down. Second problem is when searching will take a lot of time.

The searching I want is by any columns type age, name...
so what different method do I use without stored in 2D array?
Also and if using SQL is better ?and what to use ?

What I have tried:

the line search and 2D array were too slow

Posted 21-Aug-20 10:40am

Member 12224804

Updated 22-Aug-20 6:28am

RedDk

v3

Add a Solution

Comments

PIEBALDconsult 21-Aug-20 21:07pm

Put the data in a DataTable and use the RowFilter property of the DefaultView.

PIEBALDconsult 22-Aug-20 0:22am

You can also build any number of indices using Dictionaries.

3 solutions

Solution 1

Quote:
I have large data stored in text file

Yes, text files or CSV files are good at space efficient storage, but not at being fast at retrieve data.
That is the reason why databases exist, they are not good at space efficient storage, but are good at fast data retrieve.

Quote:
Also and if using SQL is better ?and what to use ?

SQL or not is your choice among possibles solutions.
You need to learn databases, it is a vast subject we can't teach you in the space of this forum.

Quote:
the line search and 2D array were too slow

2D array in memory is only a possibility if used with databases techniques like indexes.

Quote:
more than 10000

Databases commonly handle millions of records without noticeable slow down.

Posted 21-Aug-20 16:33pm

Patrice T

Updated 21-Aug-20 18:38pm

v2

Comments

Member 12224804 23-Aug-20 18:19pm

ah ok so the database is the best solution ? however what recommends ( SQL, MySQL...)for wpf C# and python for web based?

Solution 2

That depends on what you are prepared to accept. 10,000 rows isn't that many, so loading them into memory shouldn't be too bad.
I can read 10,000 rows from CSV in 312ms on a i5-3330 @ 2.00Ghz with 8GB RAM, and 1,000,000 in 18,994ms (The later used just 345MB or process memory)
That's using CsvHelper[^] which may be my new favourite CSV reader!

C#

List<Foo> records;
using (var reader = new StreamReader(@"D:\Test data\SR1M.csv"))
    {
    using (var csv = new CsvReader(reader, CultureInfo.InvariantCulture))
        {
        records = csv.GetRecords<Foo>().ToList();
        }
    }

For searching, I don't see any problem: the 10K row DB can find a record in 1ms:

Stopwatch s2 = new Stopwatch();
s2.Start();
Foo found = records.FirstOrDefault(r => r.Region == "Asia" &&
                                        r.Country == "Mongolia" &&
                                        r.ItemType == "Snacks" &&
                                        r.SalesChannel == "Offline" &&
                                        r.OrderPriority == "M" &&
                                        r.OrderDate == new DateTime(2016, 4, 12));
s2.Stop();
Console.WriteLine(s2.ElapsedMilliseconds);

For the 1,000,000 rows version, it took 75ms to find the same row.
In both cases, it was irrelevant where in the file the row was located.

Testing was performed using the 10K and 1M rows sample CSV's from here: Downloads 18 - Sample CSV Files / Data Sets for Testing (till 1.5 Million Records) - Sales | E for Excel | Awakening Microsoft Excel Student Inside You[^]

I suspect any significantly excessive time or memory usage is down to how you are reading and processing the data, not an intrinsic feature of the file based data format!

Do bare in mind that switching to something like SQL Server will require a lot of resources - disk, memory, and processor - so if your existing code is running on poorly spec'ed hardware that won't improve the problem!

Posted 21-Aug-20 23:40pm

OriginalGriff

Comments

PIEBALDconsult 22-Aug-20 8:52am

For reading CSV I've always used my Rive function, but recently I learned that .net has a CSV reader/writer built-in.

George Swan 23-Aug-20 11:51am

Have you tried the Nu Get package SoftCircuits.CsvParser? It has a very similar modus operandi to CsvHelper but allocates less memory and is about 25% faster on reads.

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

F-ES Sitecore · Accepted Answer · 2020-08-22T06:28:00

In addition to the other solutions, you can use Lucene indexing to build an index of the data and then search for it. Lucene is powerful but a bit clunky to use.

CSV file

ID  Name  employment age  work_experience_year
1,XXX,UUU,22,BBBB
2,John,Manager,40,CCCC
3,Dave,CEO,50,DDDD
4,Peter Jones,Janitor,32,EEEE
5,John Simon,Tester,43,FFFF

Class to hold each row of data

C#

public class RowData
{
    public int ID { get; set; }
    public string Name { get; set; }
    public int Age { get; set; }
    public string Employment { get; set; }
}

Indexing function

C#

static void IndexFile(string filename)
{
    var AppLuceneVersion = Lucene.Net.Util.Version.LUCENE_30;

    var indexLocation = @"C:\temp\Index";
    var dir = FSDirectory.Open(indexLocation);

    var fileStream = new StreamReader(filename);
    string line = fileStream.ReadLine();

    var splitChar = new char[] { ',' };

    using (var analyzer = new StandardAnalyzer(AppLuceneVersion))
    using (var writer = new IndexWriter(dir, analyzer, IndexWriter.MaxFieldLength.UNLIMITED))
    {
        writer.DeleteAll();
        while ((line = fileStream.ReadLine()) != null)
        {
            var values = line.Split(splitChar);

            Document document = new Document();

            document.Add(new Field("id", values[0], Field.Store.YES, Field.Index.NOT_ANALYZED, Field.TermVector.NO));
            document.Add(new Field("name", values[1], Field.Store.NO, Field.Index.ANALYZED));
            document.Add(new Field("name_exact", values[1], Field.Store.YES, Field.Index.NOT_ANALYZED));
            document.Add(new Field("employment", values[2], Field.Store.YES, Field.Index.ANALYZED));
            var ageField = new NumericField("age", Field.Store.YES, true);
            ageField.OmitNorms = false;
                    
            ageField.OmitTermFreqAndPositions = false;
            ageField.SetLongValue(long.Parse(values[3]));
            document.Add(ageField);

            writer.AddDocument(document);
        }
    }
}

Search function

C#

static List<RowData> Search(int id = 0, string name = null, string exactName = null, int age = 0, int maxAge = 0)
{
    var indexLocation = @"C:\temp\Index";
    var dir = FSDirectory.Open(indexLocation);

    IndexSearcher searcher = new IndexSearcher(dir);

    var query = new BooleanQuery();

    if (id > 0)
    {
        Term searchTerm = new Term("id", id.ToString());
        query.Add(new BooleanClause(new TermQuery(searchTerm), Occur.MUST));
    }

    if (!string.IsNullOrWhiteSpace(name))
    {
        Term searchTerm = new Term("name", name.ToLowerInvariant());
        query.Add(new BooleanClause(new TermQuery(searchTerm), Occur.MUST));
    }

    if (!string.IsNullOrWhiteSpace(exactName))
    {
        Term searchTerm = new Term("name_exact", exactName);
        query.Add(new BooleanClause(new TermQuery(searchTerm), Occur.MUST));
    }

    if (age > 0 && maxAge == 0)
    {
        Term searchTerm = new Term("age", NumericUtils.LongToPrefixCoded(age));
        query.Add(new BooleanClause(new TermQuery(searchTerm), Occur.MUST));
    }

    if (age > 0 && maxAge > 0)
    {
        // search for any age greater than or equal to the given age
        var nrq = NumericRangeQuery.NewLongRange("age", age, maxAge, true, true);
        query.Add(nrq, Occur.MUST);
    }

    var hits = searcher.Search(query, 1000);

    var results = new List<RowData>();

    foreach (var hit in hits.ScoreDocs)
    {
        var doc = searcher.Doc(hit.Doc);

        results.Add(new RowData {
            ID = int.Parse(doc.Get("id")),
            Name = doc.Get("name_exact"),
            Employment = doc.Get("employment"),
            Age = int.Parse(doc.Get("age"))
        });
    }

    return results;
}

Usage

C#

static void Main(string[] args)
{
    // You only need to index the file once, or every time it changes
    IndexFile(@"C:\Temp\data.csv");

    // Search by ID
    var results = Search(id: 1);

    ShowResults("Search by ID", results);

    // Do a partial search on a name
    results = Search(name: "John");

    ShowResults("Search by name", results);

    // Do a literal search on a name
    results = Search(exactName: "John Simon");

    ShowResults("Search by exact name", results);

    // Do a search for an exact age
    results = Search(age: 22);

    ShowResults("Search by age", results);

    // Do a search for an age range
    results = Search(age: 30, maxAge: 40);

    ShowResults("Search by age range", results);

    // Or combine any params you want
    results = Search(name: "Dave", age: 30, maxAge: 999);

    ShowResults("Search by name and age", results);

    Console.ReadLine();
}

static void ShowResults(string name, List<RowData> results)
{
    Console.WriteLine($"\r\n{name}\r\n");

    foreach (var result in results)
    {
        Console.WriteLine($"{result.ID} {result.Name} ({result.Age}), {result.Employment}");
    }
}

Fast searching in large stored data

3 solutions

Solution 1

Solution 2

Solution 3

Add your solution here

Preview 0

Existing Members

...or Join us