Getting data fast from large files

Question

4.00/5 (1 vote)

See more:

Dear all,

I asked a question a few weeks ago about usage of memory as I feared I had a leak. Turns out after some profiling, I didn't have a problem other than I was using much more actual system memory than I thought I could possibly be - and as a result I quickly realised that the app I'm writing needed a change of direction.

The application is a datalogger capable of displaying lots of channels of data on a screen in a manner similar to an oscilloscope. I have determined that the correct course of action to reduce the memory usage to a minimum is to keep in system memory only points of data that are displayed on screen. So now, when the logger is asked to record structures of the following form are saved to disk:

C#

[StructLayout(LayoutKind.Explicit, Size=14)]
public struct LogPoints
{
    [FieldOffset(0)]public float xtime;
    [FieldOffset(4)]public byte BusId;
    [FieldOffset(5)]public UInt32 Ident;
    [FieldOffset(9)]public byte ItemNo;
    [FieldOffset(10)]public float yval;
}

BusID, Ident and ItemNo allow me to determine which channel of data this point belongs to, xtime and yval are, of course, the time and data value.

The data is not sorted by channel, but it is written in chronological order to the disk. On average data is written to the file at a constant speed - i.e. in any given second, the number of points written to the file will be roughly constant.

As an example, a 10 minute log spooling data to the file could exceed 100MB in size.

The problem has come when I've tried to get data from this file quickly enough to display it on screen without taking too long.

To display the data, the user can create a graphic display. For each display, the user can select whichever channels of data he/she wants to display - you might choose to select the same channel to be displayed in 2 places, and you might also want to display a different time range in any of your graphs.

Accordingly, I have tried several ways of allowing each instance of the 'Graphical Display' class to have access to the file containing the points of data. After lots of trial and error, reading around this subject and looking at lots of articles here (some excellent work by Anthony Baraff and Jun Du), I thought memory-mapped files might be a suitable way to proceed.

So as not to block the GUI, I've made a background worker that is supposed to create a viewstream of an mmf of a fixed size equating to 100,000 data point structures. I've created a function employing a while loop to keep scrolling through the view, and renewing it as necessary. Passed as arguments to this function are the start and end time of the graph to be displayed, and the timestep. The timestep is related to the physical size of the graph on the screen - e.g. if its 510 pixels wide, I'll try and display only 510 points - so effectively it is ((endtime - startime) / x_graphsize) .... enough words.

UPDATE:

Thanks to hints from SA Kryukov, barneyman, and Roman Lerman - I've effectively rolled my own variable sized memory buffer:

C#

while (_filepos < _endpos)
{
                
    //_filepos is stream position of file on disk, _endpos is the stream position on disk of the
    //last point that we want to plot
    //_increment is the sizeof() one datapoint structure
    //_chunksize is the number of bytes to read this time
    //make sure it doesn't try to read past _endpos

    using (var fstr = new FileStream(fname, FileMode.Open, FileAccess.Read, FileShare.Read, (int)_chunksize, FileOptions.SequentialScan))
    {
        //Put the file pointer in the right place
        fstr.Seek(_filepos, SeekOrigin.Begin);

        //Create a temporary memory buffer of fixed size
        byte[] buf = new byte[(int)_chunksize];

        //Get the specified number of bytes from the file
        fstr.Read(buf, 0, buf.Length);

        //Create a stream from the memory buffer, and a reader for the stream
        using (var mstr = new MemoryStream(buf))
        using (var mr = new BinaryReader(mstr))
        {
            //Get last time value - put it one point from the end
            mstr.Seek((-_increment), SeekOrigin.End);
            _bufendtime = mr.ReadSingle();

            for (cnt = 0; cnt < NumList.Count; cnt++)
            {
                //Set the memory stream position back to the start
                mstr.Seek(0, SeekOrigin.Begin);

                //Initialise the first time value
                timeval = mr.ReadSingle();

                //Put it back at the start
                mstr.Seek(-4, SeekOrigin.Current);

                _chunkpos = 0;
                while (_chunkpos <= (_chunksize - _increment))
                {
                  //Read the data from the point at the stream position
                  _xtime = mr.ReadSingle();
                  _busid = mr.ReadByte();
                  _id = mr.ReadUInt32();
                  _chan = mr.ReadByte();
                  _ydata = mr.ReadSingle();
                  //Check if this point is for this channel
                  if ((_xtime >= timeval) && (_busid == (NumList[cnt].varCANin - 1)) && (_id == (uint)NumList[cnt].ItemNo) && (_chan == NumList[cnt].ChanID))
                  {
                      //point must go into the list
                      NumList[cnt].Points.Add(new GraphPoints
                      {
                          xdata = _xtime,
                          ydata = _ydata
                      });

                      //Finding next point - move the file position
                      timeval += timestep;
                      //short loop to get to next point quickly based on value of timestep
                      found = false;
                      while ((!found) && (timeval < _bufendtime))
                      {
                          _xtime = mr.ReadSingle();
                          if ((_xtime >= timeval) || (mstr.Position > (_chunkpos - _increment)))
                          {
                              found = true;
                          }
                          else
                          {
                              mstr.Seek((_increment - 4), SeekOrigin.Current);
                              _chunkpos = mstr.Position;
                          }
                      }
                      if ((timeval < _bufendtime) && (mstr.Position < (_chunksize - _increment)))
                      {
                          //Put it back at the start of this point
                          mstr.Seek(-4, SeekOrigin.Current);
                      }
                      else if (timeval > _bufendtime)
                      {
                          //forces next chunk to be loaded if there is one
                          _chunkpos = _chunksize;
                      }
                  }
                  else
                  {
                      //reader position has moved - _chunkpos must now be incremented
                      _chunkpos += _increment;
                  }
              }
          }
          //Got all points from this chunk of file, increment the real file pointer
          _filepos += _chunksize;
    }
}

Note the use of the 'internal' while loop to find the next datapoint more quickly once you've found the first. In the vast majority of cases the number of datapoints you've got far exceeds the number of pixels you've got to plot them - so the value of timestep allows you to skip a large number of points in the file.

This achieves about 0.5s to extract 2 datachannels from a 10 minute log, but to show more channels it does get quite slow.

Does anyone have any further ideas - should I try sorting the raw file next?

Thanks in advance for any replies.

Kind regards,

Aero

Posted 6-Aug-12 13:41pm

Aero72

Updated 10-Aug-12 3:13am

v2

Add a Solution

Comments

Sergey Alexandrovich Kryukov 6-Aug-12 20:09pm

First of all, I would question if you really need to get data from a file. You could consider keeping the data in the file and get it when required by small piece each time you needed. File bufferization can transparently help you, especially if it is likely that your requests to the file are close with good probability.
--SA

Aero72 7-Aug-12 4:22am

Thanks SA, As you suggest I should be able to seek somewhere near because of the chronological order of the file. I'll read a sizeable portion of the file at a time and report back - I'll take a look at BufferedFileReader

barneyman 6-Aug-12 21:44pm

Agree with SA - Memmap is redundant - it's reading the whole thing into memory for you to seek to an offset and read it ... do that directly on the disk file, read in large blocks and do your own chunking

Aero72 7-Aug-12 4:17am

Thanks, Barneyman - I hoped that's what the MMF was doing for me - reserving a fixed size of memory and have the OS read chunks of the file into that space. I'll try to roll my own and report back.

Roman Lerman 7-Aug-12 10:03am

What I'm using is modified version of this approach FileByteArray[^] My file can be accessed and modified on the fly, the length of "FileArray" is dynamic and the read speed is enough for reading fast large amount of data (~50 MB/sec, depends on op. system). I hope it will help you. Sorry for my English.

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Espen Harlinn · Accepted Answer · 2012-08-10T03:48:00

Solution 1

Sounds like you are dealing with a lot of data, so here is a underutilized Windows feature:
ManagedEsent[^]. I've used this functionality in a number of projects, and so far the performance has been excellent. Most of my projects uses the API directly from C++, but I've also used ManagedEsent, which is very easy to use, in a few projects.

The ManagedEsent developer reports the following performance:

Sequential inserts : 32,000 entries/second
Random inserts : 17,000 entries/second
Random Updates : 36,000 entries/second
Random lookups : 137,000 entries/second
Linq queries : 14,000 queries/second

Best regards
Espen Harlinn

Posted 10-Aug-12 3:48am

Espen Harlinn

Comments

Christoph Keller 10-Aug-12 10:05am

Thanks for the ManagedEsent hint! Never heard about it and sounds really interesting!

my 5* :)

Thanks again and happy coding,
Chris

Espen Harlinn 10-Aug-12 10:14am

Thanks Chris :-D

Aero72 11-Aug-12 6:24am

That looks good.....I have some reading to do! Many Thanks Espen, I'll report back when I've tried it. My 5.

Espen Harlinn 11-Aug-12 6:27am

Glad you liked it, MS Exchange is implmented on top of this technology - so is, to my knowledge, Active Directory.

Aero72 15-Aug-12 11:23am

Thanks for trying Espen - I'm sure getting data out of the tables would be fine, but unfortunately it looks like adding records to the database in the first place kills the app, even if one transaction consists of (many) multiple inserts before committing the data.....I've also tried it using embedded firebird and the .NET provider - and its faster at writing to the db than Esent, but still not quick enough :-(

Thanks again,

Aero

Espen Harlinn 15-Aug-12 13:55pm

>> adding records to the database in the first place kills the app
What kind of performance do you require?
>> embedded firebird and the .NET provider - and its
>> faster at writing to the db than Esent
It used to be the other way around - and I just remembered another project:
http://www.codeproject.com/Articles/316816/RaptorDB-The-Key-Value-Store-V2
Mehdi provides a bit of info about the hardware he used for his test too

Aero72 15-Aug-12 18:23pm

Thanks Espen - I'll take a look.

Aero72 16-Aug-12 4:46am

Apologies, Espen - I didn't answer your question: What I'm presently testing requires a rate of 24k inserts/s, and I can imagine this being as high as 40k inserts/s in worst-case.

What I'm seeing at present is that I'm losing blocks of data because its not quick enough at inserting, and the gui update is significantly slowed down.

There is one table consisting of an auto-increment column plus the five items of data you can see in my code above e.g _ydata = mr.ReadSingle()

I'll have a go at patching in Raptor instead of Firebird.

Thanks and Regards,

Aero