Click here to Skip to main content
15,883,901 members
Articles / Programming Languages / Visual Basic
Tip/Trick

CSV Data Management: Unleashing the VBA Arrays Power

Rate me:
Please Sign up or sign in to vote.
2.60/5 (2 votes)
5 Jan 2023GPL36 min read 6.7K   129   4  
Turn Microsoft Office applications into a masterpiece for cleansing, reshaping, managing and analyzing data from CSV files. No more intermediary spreadsheets, no more headaches due to DLL updates that make your implementations stop working, the pure VBA solution is here.
In this article, we will approach CSV files from the perspective of the analysis of their content, urging within the different situations that require a management of them, while proposing a solution coded in VBA for a variety of common situations in the day to day of a data analyst, without calls to external DLLs or APIs, without intermediary objects other than those of Microsoft's macro language.

Introduction

The CSV format is the king when it comes to shared data in plain text format, to such an extent that it is the format used by official institutions in countries such as the United Kingdom, as well as by the Library of Congress in the United States, whose page states that "the Library of Congress Recommended Formats Statement (RFS) includes CSV as a preferred format for datasets". This format is widely used in the industry, so that most applications that work with data (including spreadsheets and the most relevant database engines) support import and export of CSV files or some variant of it, furthermore, the Python and Ruby programming languages have a library called "csv" that allows you to work directly with CSV files.

The idea of all this is to achieve that, as Python and Ruby do, VBA can work with CSV files directly without relying on objects outside the programming language or external libraries (COM, DLLs). However, most VBA solutions for working with CSV files focus on the flat reading of these files, acting as intermediaries used exclusively to read the information contained in the files. In other words, most of the tools focus on reading the data from the CSV files in memory, lacking the ability to analyze the information obtained and then serve the results to technologies such as Power Query or DaxStudio for further analysis. These two tools are highly appreciated and valued by data analysts, as they allow them to make sense of a data set with a few clicks, in the case of Power Query, or with a few lines of code, in the case of DaxStudio, from an attractive and intuitive graphical interface.

The Problem

Currently, there are no tools that allow manipulating variants of CSV files, like DSV or CSV with distinct dialect that those specified at RFC-4180, from VBA, so the programmers of this programming language face difficulties when working with these files since the flow of information must be interrupted for users to use external tools from the precise moment in which the slightest modification of the data read is required. This is a situation that puts developers in a difficult situation, since many of the basic operations such as sorting, rearranging fields are extremely necessary.

The Solution

Feature CSV Interface CSVKit XSV Miller
Manage fields/records: insert, delete, shift, merge, split, rearranging... YES YES* YES* YES*
Dedupe records YES NO* NO* NO*
Data sorting YES YES YES YES
Data grouping YES YES NO YES
Descriptive statistics NO YES YES YES
Data select/filtering YES YES YES YES
Calculated fields YES NO NO YES
Advanced CSV Join YES NO** NO** NO**
Split-up CSV files YES YES NO YES
Sort CSV files on-disk YES NO NO NO
Language VBA Python Rust Go
* There is no direct method to do most of the fields/records editing tasks, instead there are commands like SQL SELECT to work the fields and records at load time.
** Join operations are performed using simple or no conditions. It does not support complex expressions.

All the tools analyzed have extensive data management capabilities, almost all of them offer advanced statistical analysis, so it should be noted that they all have their weaknesses and strengths. What is a palpable fact is that CSV Interface offers the most common features in the CSV file management environment and is the only similar tool available for VBA.

Solving CSV Problems with VBA

In all the examples shown, the asterisk (*) represents a CSVArrayList object, which is a component of the CSV Interface library. Let's start.

In data analysis, there is a very common task when working with files and it is nothing more and nothing less than the removal of duplicates. The goal is that the records/rows contain unique information in the specified fields. Let's see how we can achieve this goal.

VB.NET
*.Dedupe("1,5,6")  'Deduplicate records/rows using fields/columns 
                   'indexes 1, 5 and 6. (Zero based indexes)
*.Dedupe("5-8,11") 'Deduplicate records using fields indexes 5 through 8 and 11.

Another very frequent task is filtering, in which the elimination of records that do not meet a specific criterion is sought. The filtering method, from VBA, accepts four parameters, two of which are mandatory. The first parameter defines the filtering criterion, which is evaluated to decide if a specific record passes the filter (condition that coincides with the return of a True value).  The second parameter indicates the position of the first record to be filtered. The third parameter, of optional character, decides whether records that comply with the filtering rule should be excluded, when True is returned. The last parameter indicates whether the data has a header field.

VB.NET
*.Filter("Region = 'Asia' & Units Sold > 20 & Units Sold <= 50 & _
  Ship Date $ '10/*/2014'", 2) 'Filter criteria using expression with 
                               'fields names as variables
*.Filter("f1 = 'Asia' & f9 > 20 & f9 <= 50 & f8 
  $ '10/*/2014'", 2) 'Same criteria using indexes (One based indexes)

As you can see, the filtering criteria can be a complex expression, and can also use more than 50 VBA language functions (text, numerical, financial, date and time). It is important to note that the literal text strings contained in the filter pattern are delimited by apostrophes (').

Among the tasks of data analysis is the editing of records, usually requiring the insertion of calculated fields from other existing fields in the data to be analyzed. In these cases, it is also possible, from VBA, to use expressions to obtain values and insert them in each of the records.

VB.NET
*.InsertField 12, "Taxes" , Formula:="FORMAT(Total Revenue * _
 Percent(18);'Currency')")   'Insert a field named "Taxes" at the 13th position 
                             'and use a custom fomula for compute it.

The above example will define the new field as 18% of Total Revenue field, the result will be formated as VBA Currency.

On many occasions, it is required to return tables derived from the analysis of data scattered in two or more files, this is where the join functions (left, right and inner) play an important role since they allow to join data using matching fields in the different records, when a optional given condition/predicate is fulfilled.

VB.NET
*.LeftJoin Table1, Table2, "{1,Country};{Total_Revenue}", 
"Order_ID;Order_ID", "t2.Total_Revenue>3000000 & 
 t1.Region='Central America and the Caribbean'"

The above method performs a Left join returning the 1st and Country fields of the left table (Table1) and the Total_Revenue field of the right table (Table2) , joined in the Order_ID field of both tables, of those records that satisfy that the total revenue exceeds $3,000,000 with Central America and the Caribbean as the Region. RightJoin and InnerJoin commands can also be invoked. Again, the condition/predicate used to select the records can be very complex.

Data sorting is also one of the essential tasks when performing data analysis, so tools that allow you to perform this task are highly valued. Using CSV Interface, it is possible to sort CSV files. Having said that, it is worth mentioning that in the VBA application ecosystem, at the time of writing this article, there were few alternatives that allowed advanced CSV data sorting.

VB.NET
ReDim SortKeys(0 To 2): SortKeys(0) = -1: SortKeys(1) = 5: SortKeys(2) = -11
*.Sort SortingKeys:=SortKeys, SortAlgorithm:=SA_Quicksort

The above code fragment will sort the data in descending order in column 1, then sort in ascending order in column 5 and sort in descending order in column 11. This multilevel sorting is "stable".

Closing Thoughts

Although VBA is a limited programming language, it puts amazing computing power in the hands of developers. CSV Interface is the example that the "macro" language of office applications has quite strong muscles.

If you liked the article, do not hesitate to share it, you can also give a star on GitHub by visiting the following link: https://github.com/ws-garcia/VBA-CSV-interface

History

  • 5th January, 2023: First release

License

This article, along with any associated source code and files, is licensed under The GNU General Public License (GPLv3)


Written By
Dominican Republic Dominican Republic
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
-- There are no messages in this forum --