Click here to Skip to main content
15,885,953 members
Articles / Web Development / HTML

Unformatted Data Processing using Vim

Rate me:
Please Sign up or sign in to vote.
1.38/5 (4 votes)
18 Jul 2018CPOL16 min read 7.1K   47   2   3
Use safe Destructive Editing techniques to load a SQLITE table

This is a demonstration on Unformatted Data Processing. Although Vim is used for this example, the choice of editor is a matter of personal preference.

Platform

This was developed on a Linux machine running LinuxMint. SQLITE is the database engine for this demonstration. My personal PC is a Linux machine, but I work on a Windows 10 machine.

The suggested manner for running this example on Windows 10 is to substitute the Cygwin Bash command line, however some configuration will be required. SQL Server can be substituted for SQLITE. I have yet to use the Linux Subsystem for Windows, but I hope to it in the future. If the Cygwin Vim build is used, the bash commands should be the same as for Linux. If the reader should be using the standard Windows GVim build, it would be helpful to prefix the Vim external commands with C:\cygwin64\bin\. The Windows build of GVim works quite well with Cygwin Bash, but it is most helpful to setup the environment path variable to point to the various Cygwin bin directories. Please note, the Appendix has instructions on how to convert the example insertions for running the produced SQL on SQL Server.

The vast majority of my daily work in GVim is on a Microsoft Windows machine programming Visual Studio C# as well as TSQL for SQL Server. This demonstation will suffer from being a stereotypical classroom example on account the work stuff having to stay work stuff. I do far more Bash on Windows than Linux though.

Vimrc setting recommendations

If you are a Vim expert, may the force be with you! Otherwise, might I humbly suggest these settings for your Vimrc file so that the demonstration recipe will be successful on your machine.

C++
:set ignorecase
:set linebreak
:set incsearch
:set showmatch
:set nowrapscan

Background

Those who program will often will be called upon to configure data for various tasks. A specific task might come from a website, a spreadsheet, an email, or any of a number of places.

Although your personal technical experience is no doubt far different than mine, A significant chunk of time can be spent doing these other data processing tasks. The data needing to be formatted is not necessarily in a format either your system or your program can understand. Nonetheless, the round peg must fit into the square hole, and you are the one that has to make it happen.

This example scrapes a webpage and places the data into a SQLite table, but these unformatted text processing techniques both cross specific programming languages as well as the various non-language tasks that must be done to get the project out the door.

Disclaimer: This demonstration will use data from the Bureau of Labor and Statistics (U.S. Government). A data file will be provided for -- in the event that the webpage should change in the future so the reader will be able to complete the demonstration.

Using the code

Create a scratch directory where the Destructive Editing can take place. It is best to create one afresh. All files within this directory should be considered expendable. Open the following webpage in a browser: Consumer Price Index (https://www.bls.gov/cpi/tables/relative-importance/home.htm). View the page source and save the file as links__A.lisp.

Pulling the website data

Open up links__A.lisp in GVim or Vim. Source helper.vim for cosmetics. Register A is also loaded with the website root.

C++
:source helper.vim

The important line in helper.vim is the loading of the register. This will be needed for later.

C++
:let @a = 'https://www.bls.gov'

If the reader wishes to construct the wget statements without doing the line by line, simply source lesson_a_.vim. Please note, helper.vim should be sourced first before the sourcing of lesson_a_.vim.

C++
:source lesson_a_.vim

Let us traverse lesson_a_.vim line by line for further explanation. If the reader will peruse through the html source, it should be noted that there are a number of text files with the vital government data we would like to extract. Inspecting the code shows that the files ending in .txt are the files to be downloaded to a scratch directory (should be the current working directory). The other lines are not needed.

C++
:%g!/txt/d

Delete all but the text lines

You will note that this is a dangerous step, so you will always want to work with a copy of production instead of production itself. We are looking for some hrefs with the txt files we want. Observation: If at all possible, find a non-regex expression to reduce to only the lines we want. Although a regex character may be used in a search, it must be escaped. Looking for a pattern, the pattern 'a href=' can be seen in all the relative lines. We only want the address fragment. Get rid of the unwanted garbage characters.

C++
:%s/.*a href="//g

Get rid of href text

and

C++
:%s/".*//g

Git rid of quote star

The website needs to be prefixed on all lines. Since the website address contains special characters, we wish to prefix in such a way that the string itself needs no escapes for the regex characters. The website root is https://www.bls.gov/cpi/tables/relative-importance. The string itself was loaded via helper.vim into register A. The normal command can be used to prefix each line as-is, without having to escape characters.

C++
:%normal 0"aP

Prefix with web address

Again perusing our rendered webpage. it can be observed that we only have web addresses. There are no actions for these web addresses. The files need to be downloaded so that they might be put together. Let us first make these solid web addresses, and then we can render these lines to pull the data down. End the prefixed text with a single tick, so that the web address will not get mistranslated by Bash.

C++
:%s/^/wget '/g

Prefix with wget

Now finish the wget command with a single tick to finish out the command.

C++
:%s/$/'/g

Finish out wget

Let us get rid of the annoying header text. The webpage has changed since I first composed this script.

C++
:%g/<!--/d

Add an empty line at the end to act as a carriage return. The reader should note that the only line in scope, for the command, is the last line.

C++
:$s/$/\r/g

Make empty last line

The wget commands have been built

The reader should note that Vim buffer has not been saved, and it will be left up to the user to decide whether or not to save the buffer. Select all the contents of the buffer, copy, and open up a Bash command window. Paste the buffer contents into the command window. The reader should note that the commands will execute in sequence. Below is an action shot.

Action shot Konsole paste

The data files are now downloaded, and the extraction of the data can commence. Below shows the downloaded .txt files shown in isolation from within Emacs.

Snapshot of txt data files in Emacs

Part 2: Preparing the downloaded files for the gathering

It is assumed the .txt files have been downloaded. Open a new Vim instance with the PWD pointing to the scratch folder. Source the helper.vim file once again.

C++
:source helper.vim

Default view after pwd

Part 2 can be bypassed by sourcing the lesson_b_.vim in a fresh Vim session. Make sure the working directory is the scratch directory. The data files from Part 1 also need to be resident in the scratch directory.

C++
:source lesson_b_.vim

This is part two of the demo. This demo configures the downloaded files for the gathering. It is a good idea that if you commonly change all files in a directory, you have a common setup script. Danger! this is a very dangerous operation, so always make sure you are working with a copy of production instead of production itself. These kinds of operations should only be done in a throwaway directory. My setup script is called aw.vim, and although I am not including this script amongst the zipped work files, the contents will be shown below. These should be executed by the reader.

Please note that within the script, you will note some registers being setup for future use. The I register is inialized in order to pull in text from the files on the args list. Register C holds the Vim command to pull in text from the various files and gather this text into register I. Since this will be Destructive Editing, the autowriteall is set as well as the nomore commands. The nomore saves user prompting until the very end. By default, the scope of the args list is all files in the current directory. Register B contains comment text. The text of the script is shown below.

C++
" set up for argdo. all files by default.
:let @i = ''
:let @b = 'register C has the command to load pasteboard from args'
:let @c = ':argdo :normal ggVG"Iy'
:set autowriteall
:set nomore
:args *

The script operates on all files in the current directory -- no subdirectories (:args *). We need to confine the operations to only the .txt files. Although the operations are being done in a throwaway directory, there is no need in being sloppy. We need a dummy file to act as a backdoor to the operations applied to all files on the list. On Windows, a dummy file is needed, because frequently all operations will be performed on all files except the very last. It is unknown whether this Vim behavior extends to Linux Vim builds. However, on Microsoft Windows, Vim needs an empty dummy file to act as a backdoor to the args list -- both for the Windows GVim build itself as well as the Cygwin versions.

C++
:silent :!touch zz.zzz

and

C++
:args *.txt zz.zzz

The view of the buffer. The :args command shows some of the various files to be operated on before doing any :argdo commands.

Setup args list before doing the argdo commands

This is the time to peruse some of the text files. there seems to be a pattern common to every text files. We only care about the lines with the price data. The non-price lines should be discarded, because the ultimate destination is a table in the database. A common regex for the lines we want is the following. Although there are other regexes that might exclusively match these lines, we only need one that will work. In our case:

C++
\d\+\.\d\+\s\+\d\+\.\d\+

Regular expression decimal results

The command for how to use this regular expression is shown below. Please note that the scope of the argdo list is all text files as well as the empty zz.zzz file. In each file, delete all lines except the ones with the price data. In the picture below, a random file off the args list is shown.

C++
:argdo :%g!/\d\+\.\d\+\s\+\d\+\.\d\+/d

Random file shown off the args list

Destructive Editing itself is at best a methodical process, where the data should be transformed incrementally, using tiny steps, and leaving a breadcrumb trail so that one can readily backtrack. Data reconnaissance is to be followed by destructive edits. It should not be uncommon to fork the directory into a derivative scratch directory -- the latest directory pointing back to an earlier directory, and that one in turn pointing to an earlier one, and so on. It is only when one discovers that they have ventured down a blind alley that the payoff to doing things incrementally pays off.

We are again perusing through the various text files. One of the observations is that there is very detailed information we want, and then there are summary items that contain totals. It appears that the non-summary lines have at least 4 spaces at the beginning of the line. Keep these lines, and discard the rest. Below is shown the results from a :lvim command looking for lines starting with 4 spaces.

Lvim command results for beginning four spaces

C++
:argdo :%g!/^    /d

If we look through the text files, years 1987 through 1996 have no lines. In real world examples, these outliers end up having to be worked separately than the other txt files. For this demonstration, we'll disregard these. Let us continue cleaning up these files. let us trim the spaces from the beginning and end of the lines. first from the left:

C++
:argdo :%s/^ *//ge

And then from the right. A random file is again shown from the args list. Please note, When running the argdo commands, you should end of with zz.zzz, because it is the last file to run the command on. This is why I chose a random file to show.

C++
:argdo :%s/ *$//ge

Random file from args list trimmed fore and aft

Let us scoop up these files, but first we must inject the year into each file. The remaining records in each file will ultimately be insert statements into a SQLITE table. Take the year from the filename itself, and inject this year into the start of each and every record. Below is shown the actual view when running an :argdo on the args list.

C++
:argdo :%s/^/\t/ge | :%normal 0"%P

The true view when doing an argdo on a file list

We don't need the .txt of the filename. Let's remove it. Below is a sample view of the 1987.txt file.

C++
:argdo :%s/\(^\d\|\)\.txt/\1/ge

Sample view of the 1987 dot txt file from the args list.

Let us combine all text into one file. Register I was initialized and will now be used to combine the text.

C++
:argdo :normal ggVG"Iy

Save this text to file master_records_file__dat.dat

C++
:tabe master_records_file__dat.dat

Place the text from Register I into the new buffer master_records_file__dat.dat. Below is shown the combined gathered records, but with empty lines.

C++
:normal "iP

Combined gathered records with empty lines

Remove any blank lines and stub lines before saving.

C++
:%g/^$/d
:%g/\t$/d

Final view from lesson B

I will leave it to the reader whether to save this file. This concludes part 2 of the demonstration recipe.

Part 3: Shaping the SQL Statements

Part 3 can be bypassed by sourcing the lesson_c_.vim from within the master_records_file__dat.dat file.

C++
:source lesson_c_.vim

Part three takes the master_records_file__dat.dat file, and makes SQL insert statements. This demonstration recipe configures the gathered data and makes insert statements. Please note that there is no inherent order in the following steps. Data cleanup is a necessary chore in any type of data processing. In a perfect world, there would be no single ticks inside sql statements. Let us address that. Please note that the Vim buffer is set to wrap.

C++
:%s/'/&&/ge

Single to double ticks

At this point, let us use an insertion template for fashioning the SQL statements. This template is called insert_template__C.sql and should be included in the zipped work files for this demonstration recipe. I believe it to be good practice to have all potential data destination templates written before the Vim composition begins. This prevents future distraction so you can focus solely on twisting and turning your data into viable code. The contents of the template are shown below.

C++
insert into year_price_info
(
    year,
    expenditure_category,
    cpiu,
    cpiw
)
values
(
    CAST('year_value' as text),
    CAST('expenditure_category_value' as text),
    CAST('cpiu_value' as real),
    CAST('cpiw_value' as real)
)

Most often templates are joined to be a single line for injection into the data. After the injection is done, the lines can later be split for cosmetic appearances. The command to join the lines is shown below.

C++
:normal ggVGJ

rendering

C++
insert into year_price_info ( year, expenditure_category, cpiu, cpiw) values ( CAST('year_value' as text), CAST('expenditure_category_value' as text), CAST('cpiu_value' as real), CAST('cpiw_value' as real))

When fashioning data, these common types of operations will be repeated over and over. These steps can vary in order, but it would behoove the user to learn how to do each of these types of tasks. Although Vim is capable of quite a bit more than what I will show below, the below are amongst the most common unformatted data processing tasks to be done for building and consuming lists.

  1. Prefix chosen lines
  2. Suffixing chosen lines
  3. Inject the string into the chosen lines. Injecting this string before or after some string is the preferred method. It is not uncommon to embed template text into the chosen lines for further operations.
  4. Spitting lines upon string
  5. Joining lines upon string
  6. Injecting a sequence where the count varies upon each chosen line.

Now back to the master_records_file__dat.dat file. The single tick marks were each doubled so that this data might insert correctly. Let us start by prefixing a portion of the SQL insertion template to the lines.

C++
:%s/^/insert or ignore into year_price_info ( year, expenditure_category, cpiu, cpiw) values ( CAST('/ge

Insert first pastings

Do you remember that tab we created on each row of our data? Now is the time to use that. This will finish out the first column and begin the second.

C++
:%s/\t/' as text), CAST('/ge

Second castings

Now we move onto the third column. You will note, by examining what will be the third column, that a unique string exists -- two or more periods and ending the column with zero to multiple spaces. Since a period is a regular expression character, it must be escaped before use in a search and replace operation.

C++
:%s/\.\.\+ */' as text), CAST('/ge

Third herd

For the final column, please notice that there are at least two spaces and often many more spaces. A regular expression can be used for this.

C++
:%s/  \+/' as real), CAST('/ge

Final Column

Last but not least is the suffix for each record.

C++
:%s/$/' as real));/ge

The suffix

Now a complete insert statement has been completed for each row. It is foolhardy to assume each row will readily go into the database. It is time to number and differentiate the different rows. If there should be a problem with one of the insert statements, it will be readily apparent which row is having the issue. While we are at it, let us make sure that individual errors do not spoil the whole batch. These problem rows can be worked separately.

The suffixing of each line will create a new line with our desired print statements. These are not finished rows, so please note the xxxxxx string that is used as templated text. The xxxxxx will be rendered in a future step.

C++
:%s/.*/select 'this is statement xxxxxx';\r&/ge

Unfinished select statements

Let us initialize our counter. For this demonstration, Register I will be used.

C++
:let @i = 0

Let us use the power of g to affect the numbering. Time to give a shoutout to the Vim Tips Wiki for instruction on how to use this wonderful technique. Here is the link for the Power of g. I think of this form of g as a two stage command. The g part chooses the lines, and the execute part specifies what is to be done at that line. When the g command matches a line, the cursor is effectively placed on the matched line. The column of the line might need to be specified if doing a column operation on the line. This g command cannot be properly displayed inside html markup, so I will show an image of the final g command.

final g command to configure the SQL Insert statements

The two control characters that do not render correctly are the CTRL-R and the second is the Return character. Both these characters can be correctly rendered by pressing CTRL-Q and then typing the specific control character. Being that the g statement contains control statements that do not survive html rendering, I offer this as a workaround in order to complete the demonstration. If the reader is a Vim novice, I recommend sourcing the Vim script gcommand_for_lesson_c_.vim.

C++
:source gcommand_for_lesson_c_.vim

If you have experience in Vim, I recommend the following.

C++
:tabe gcommand_for_lesson_c_.vim
:normal 0v$"ay

The g command is now loaded to your A register. When you go to command mode, type CTRL-R and then a to paste the proper command from register A onto the command line.

Numbered select statements

An explaination of the g command is in order. On all lines containing the xxxxxx, execute two commands -- separated with the pipe character.

  1. Increment the counter in Register I
  2. Replace the xxxxxx string with the count contained in Register I.
  3. Please note that the scope of the substitute command is the current line.
  4. When a line is matched, the g command places the cursor on the line containing the match.

Running the insert statements

Appendix: Running the Example SQL on SQL Server

With the finished output of master_records_file__dat.dat, run the command show below. The text destination will be changed to a char(100). Please note that if the user looks for the file create_price_index__sql_server_version.sql, the table definition for SQL Server can be found. After the insert statements have been changed, the example should work on SQL Server.

C++
:%s/ as text/, char(100)/g

Of course, the user is encouraged to source the file convert_sql_to_SQL_Server.vim if they so desire.

C++
:source convert_sql_to_SQL_Server.vim

Attachments

Please refer to the README file in the zipped attachment for a manifest of the included files.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
United States United States
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
GeneralMy vote of 2 Pin
Member 1236439022-Jul-18 22:51
Member 1236439022-Jul-18 22:51 
GeneralRe: My vote of 2 Pin
Robert Crockett23-Jul-18 3:23
Robert Crockett23-Jul-18 3:23 
GeneralRe: My vote of 2 Pin
Robert Crockett25-Jul-18 13:12
Robert Crockett25-Jul-18 13:12 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.