Click here to Skip to main content
15,885,537 members
Please Sign up or sign in to vote.
5.00/5 (1 vote)
See more:
As of now, I am currently chunking the results into small blocks of data. these data blocks are then being processed sequentially at the moment. However, since the blocks are already chunked with the appropriate data it would not matter about being sequential.

How can I use multicore processing to send each chunk to a free CPU to be processed?

Private Sub OpenGEDCOMFile()
    Dim openFileDialog As New OpenFileDialog()
    openFileDialog.Filter = "GEDCOM files (*.ged)|*.ged"

    If openFileDialog.ShowDialog() = DialogResult.OK Then
        Dim fileName As String = openFileDialog.FileName

        ' Open the GEDCOM file using a StreamReader
        Using reader As New StreamReader(fileName)
            ' Divide the data into chunks
            Dim chunk As New StringBuilder()
            Dim chunkLines As New List(Of String)()

            'Try
            While Not reader.EndOfStream
                Dim line As String = reader.ReadLine()

                If line.Contains("INDI") Then
                    If chunk.Length > 0 Then
                        ' Process each chunk
                        ProcessChunk(chunkLines)
                        chunk.Clear()
                        chunkLines.Clear()
                    End If
                End If

                chunk.AppendLine(line)
                chunkLines.Add(line)
            End While

            ' Check if there are any remaining lines in the chunk
            If chunk.Length > 0 Then
                ' Process the remaining chunk
                ProcessChunk(chunkLines)
            End If
            'Catch ex As Exception
            '    MessageBox.Show(ex.Message)
            'End Try
        End Using
    End If
End Sub


What I have tried:

I have tried Parallel for but the result is slower than the sequential, but it should not be. After processing the chunks the parallelization should be quite effective. I am processing a 1.5 million individual Gedcom file so the file is large enough to try this. However, I have changed the code from this version into processing just 1 parameter instead of two.

Private Sub OpenGEDCOMFile()
    Dim openFileDialog As New OpenFileDialog()
    openFileDialog.Filter = "GEDCOM files (*.ged)|*.ged"

    If openFileDialog.ShowDialog() = DialogResult.OK Then
        Dim fileName As String = openFileDialog.FileName

        ' Read the file line by line
        Dim lines As String() = File.ReadAllLines(fileName)

        ' Divide the data into chunks
        Dim chunks As List(Of List(Of String)) = New List(Of List(Of String))
        Dim chunk As List(Of String) = New List(Of String)

        For Each line As String In lines
            If line.Contains("INDI") Then
                If chunk.Count > 0 Then
                    chunks.Add(chunk)
                    chunk = New List(Of String)
                End If
            End If
            chunk.Add(line)
        Next

        ' Check if there are any remaining lines in the chunk
        If chunk.Count > 0 Then
            chunks.Add(chunk)
        End If

        ' Process the chunks in parallel
        Parallel.For(0, chunks.Count, Sub(i)
                                          Dim chunkToProcess = chunks(i)
                                          ProcessChunk(New ArraySegment(Of String)(chunkToProcess.ToArray()), chunkToProcess)
                                      End Sub)
    End If
End Sub
Posted
Updated 19-Feb-23 23:46pm
Comments
[no name] 16-Feb-23 16:29pm    
Your "chunking" adds overhead (new objects and lists) that simply "reading the file" wouldn't incur. One might say your approach doesn't scale; particularly when chunks are small.

1 solution

Start by extracting the code to "chunk" the file into a separate iterator method[^]. You can also get rid of the StringBuilder, since you never read its contents.

NB: If you want to process the lines in parallel, you'll need a new list for each chunk.
VB.NET
Private Shared Iterator Function ChunkFile(ByVal reader As StreamReader) As IEnumerable(Of List(Of String))
    Dim chunkLines As New List(Of String)()
    While Not reader.EndOfStream
        Dim line As String = reader.ReadLine()
        If line.Contains("INDI") Then
            If chunkLines.Count <> 0 Then
                Yield chunkLines
                chunkLines = New List(Of String)()
            End If
        End If
        
        chunkLines.Add(line)
    End While
    
    If chunkLines.Count <> 0 Then
        Yield chunkLines
    End If
End Function
Then use a Parallel.ForEach loop to process the chunks:
VB.NET
Using reader As New StreamReader(fileName)
    Dim chunks As IEnumerable(Of List(Of String)) = ChunkFile(reader)
    Parallel.ForEach(chunks, Sub(chunkLines) ProcessChunk(chunkLines))
End Using
That should significantly reduce the number of allocations compared to your second code block.

You'll then need to profile your code to see where the bottleneck is. If the ProcessChunk method is fairly quick, then the overhead of making the code multi-threaded could outweigh any potential benefits.
 
Share this answer
 
Comments
Member 11856456 23-Feb-23 21:02pm    
Thanks, I cant tell any speed difference, however, this did make it more efficient on memory consumption. So, still a win in my book.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900