Click here to Skip to main content
15,867,453 members
Articles / Database Development / SQL Server / SQL Server 2008

Dynamic Excel file loading with SSIS

Rate me:
Please Sign up or sign in to vote.
5.00/5 (14 votes)
23 Apr 2012CPOL5 min read 227.4K   8.9K   39   40
SSIS and Excel Automation

Introduction

Loading Excel files with same schema but different file name or sheet name is a challenge for many SSIS developers. Microsoft does not have any Dynamic Excel loading component that search Excel files using a pattern and indentify the sheet contains data and load it.

The article helps you to challenge such scenarios. The package described here is able to search Excel files using a pattern and load it without any user intervention.  The Sheet contains data does not have start at first row or first column.

Features

  •        Filename can be anything(ie it should match the pattern used for searching)
  •        Format of excel can be any format (.xls – Office 2003, .xlsx – Office 2007 or later)
  •        Data to be loaded can be any sheet. Zero dependency on sheet name.
  •        Data can start at any different column or row, ie it does not have to be at A1 or A2 or B10 etc.
  •        Report out if any additional fields are added to the excel data after the design.  
  •        Archive processed files to Processed folder.    

Building the package  

  1.     Create Folder structure(please feel free to create anywhere, for simplicity I use C:\ drive).

a.       Create folder SSISLoad in  C:\

b.      Create subfolder in Data on C:\SSISLoad

c.       Create subfolder in Country on C:\SSISLoad\Data

d.      Create subfolder in Test  on C:\SSISLoad\Data\Country

e.      Create subfolder in Processed  on C:\SSISLoad\Data\Country<o:p>    

2. Package Design Preparations

a.       Identify the Sheet that contains data. Rename it to Data

b.      Delete empty rows and columns such that data starts at cell A1

c.       Save it as Country Details - XX.xlsx & Copy to C:\SSISLoad\Data\Country\Test\

d.      Create a new SSIS Package Load Dynamic Excel.dtsx

e.      Create and Set package variable (string type) Input_Share_Root as C:\SSISLoad\Data\

f.        Create and Set package variable (string type) DataFile_Input_Path as Country\

g.       Create and Set package variable (string type) DataFile_Search_Pattern as Country;.xlsx

h.      Create package variable (string type) NewFieldsSummary

i.         Create package variable (string type) NewFields

j.        Create package variable (boolean type) FileFound

k.       Create package variable (string type) SourceFile 

l.         Create package variable (string type) Extraction_Statement

m.    Create package variable (string type) Connection_String

Save the package.  

3. Set  package variable Connection_String 

Provider=Microsoft.ACE.OLEDB.12.0;Data Source=C:\SSISLoad\Data\Country\Test\Country Details - XX.xlsx;Extended Properties="EXCEL 12.0 XML;HDR=YES;IMEX=1";

   4.  Preparing package variable Extraction_Statement

  • Save the package.  

a.       Add a DataFlow Component to package(Drag DataFlow from Tools), name it as Load Excel

b.      Add a Excel connection to package(Right Click Connection manager area >> New Connection >> Excel >> Browse and Point to C:\SSISLoad\Data\Country\Test\ Country Details - XX.xlsx Name the connection as Excel_Source

c.       Right click Excel_Source connection >>Properties >> Expression >> Click Ellipse button

                                                                 i.      Pick Connection String for Property from drop down , click Ellipse on Expression

                                                            ii.      Set expression as @[User::Connection_String] or Drag it from Variables on the expression      builder window.

d.      Go to Load Excel data flow and add Excel Source by dragging from Tool Box, name it as  Country Details

e.      Edit Country Details

f.        Set OLEDB Connection manager  as Excel_Source (from drop down)

g.       Set Data Access Mode as SQL Command (from drop down)

h.      Click Build Query

i.         Click Add Table symbol and Select Data$ and click Add, Click Close

j.        Manually check each and every column, do not check *(All Columns)

k.       Copy the SQL Generated from the query pane:

SELECT        ID, StateProvinceCode, CountryRegionCode, IsOnlyStateProvinceFlag, Name, TerritoryID

FROM            [Data$] 

l.         Cancel the Build Query process. Cancel Excel Source Editor window.

m.    Make the SQL in single line and set as value for Extraction_Statement

n.      Save the package.  

5Set Country Details  excel source component 

a.       Edit Country Details  and set Connection manager  as Excel_Source (from drop down)

b.      Set Data Access Mode as SQL Command from variable(from drop down)

c.       Set Variable name  as User:: Extraction_Statement (from drop down)

d.      Click Columns from Left pane.

e.      See all columns are populated (you may uncheck any column you do not want). Click OK 

6. Set Destination as you like and map columns.

a.       Create OLEDB Destination Connection manager OLEDB_Destination_Conn (sample used local server, Test as DB 

b.    Create table CountryDetails and map columns  

 

Write Automation Code  

1. Search excel file using the pattern specified in variable DataFile_Search_Pattern.

This pattern separated by extension using semicolon format  

2. Enumerate Excel sheets for the file obtained by Search
a.Use OLEDB driver to create connection to Excel and read schema.


3. Identify Sheet contains data and data start address.
a.Search each sheet for fields used in Extraction_Statement and determine sheetname and data start address.
b.Modify connection string as per the excel file. 

Using the Code 

How it works(Concept)

The package search for the file using the specification supplied, determines the sheet contains data and its start address(like A1 OR C10 etc), and finally modifies the connection string to the Excel file to point the file obtained and modifies SQL Query to read the excel file from the sheet identified. This level of dynamic loading is achieved with the help of user defined variables in the package, and thus the run-time setting of Connection Managers, SQL Command to read/pull data are updated with the help of variables.   

 Let's now analyze concept (code perspective) as below: 

The following method searches file in the path supplied. It uses matching pattern as parameter searchstring  and use parameter extension as filter 

C#
public string SearchFile(string path, string extension, string searchstring)
 {
     DirectoryInfo di = null;
     if (Directory.Exists(path))
     {
         di = new DirectoryInfo(path);
     }
     else
         return "Directory Does not Exist";
     string newestFile;
     IEnumerable<System.IO.FileInfo> fileList = di.GetFiles("*" + searchstring.ToLower() + "*");
     //Create the query
     IEnumerable<System.IO.FileInfo> fileQuery =
         from file in fileList
         where (extension.ToLower().Contains(file.Extension.ToLower()))
         orderby file.LastWriteTime
         select file;

     try
     {

         var FileSearchedResult = (from file in fileQuery orderby file.LastWriteTime select new { file.FullName, file.Name, file.CreationTime }).Last();
         newestFile = FileSearchedResult.FullName;
         FileSearchedResult = null;
         fileList = null;
         di = null;
         return newestFile;
     }
     catch
     {
         fileList = null;
         di = null;
         return null;

     }


 } 

The following method GetExcleSheetNames enumerates all available sheets in the excelFile supplied as parameter. It uses OLEDDB driver to connect the Excel file and reads the Schema using    GetOleDbSchemaTable. 

C#
private List<String> GetExcelSheetNames(string excelFile)
        {
            OleDbConnection objConn = null;
            System.Data.DataTable dt = null;

            try
            {
                // Connection String. Change the excel file to the file you
                // will search.
                String connString = this.GetConnectionString(excelFile);
                // Create connection object by using the preceding connection string.
                objConn = new OleDbConnection(connString);
                // Open connection with the database.
                objConn.Open();
                // Get the data table containg the schema guid.
                dt = objConn.GetOleDbSchemaTable(OleDbSchemaGuid.Tables, null);

                if (dt == null)
                {
                    return null;
                }

                List<String> excelSheets = new List<string>();

                // Add the sheet name to the collection.
                foreach (DataRow row in dt.Rows)
                {
                    string WorkSheet = row["TABLE_NAME"].ToString();
                    WorkSheet = WorkSheet.StartsWith("'") ? "[" + WorkSheet.Substring(1) : WorkSheet;
                    WorkSheet = WorkSheet.StartsWith("[") ? WorkSheet : "[" + WorkSheet;
                    WorkSheet = WorkSheet.EndsWith("'") ? WorkSheet.Substring(0, WorkSheet.Length - 1) + "]" : WorkSheet;

                    WorkSheet = WorkSheet.Substring(0, WorkSheet.Length - 1).EndsWith("$") ? WorkSheet : WorkSheet.Substring(0, WorkSheet.Length - 1) + "$]";


                    if (!excelSheets.Exists(delegate(string k) { return (k.ToLower() == WorkSheet.ToLower() || k.ToLower() + "$" == WorkSheet.ToLower()); }))
                        excelSheets.Add(WorkSheet);
                }
                return excelSheets;
            }
            catch (Exception ex)
            {
                return null;
            }
            finally
            {
                // Clean up.
                if (objConn != null)
                {
                    objConn.Close();
                    objConn.Dispose();
                }
                if (dt != null)
                {
                    dt.Dispose();
                }
            }
        } 

The following method ScanWorkSheet builds DataSet for first 1000 rows from WorkSheet(parameter)  

C#
private SourceFileDetails ScanWorkSheet(string excelFile, string WorkSheet, string ExtractionStatement, string FindColumn, out bool Success)
      {
          Success = false;
          System.Data.DataSet excelDataSet = new DataSet();
          string connectionString = this.GetConnectionString(excelFile);


          using (OleDbConnection objConn = new System.Data.OleDb.OleDbConnection(connectionString))
          {
              try
              {
                  objConn.Open();
                  OleDbDataAdapter cmd = new OleDbDataAdapter("select top 1000 * from " + WorkSheet, objConn);
                  cmd.Fill(excelDataSet, WorkSheet);
                  cmd.Dispose();
              }
              catch { }
              finally
              {
                  if (objConn != null)
                  {
                      objConn.Close();
                      objConn.Dispose();
                  }

              }
          }

          SourceFileDetails sd = ScanRows(excelDataSet, ExtractionStatement, FindColumn);
          if (sd != null)
          {
              Success = true;
              sd.FileNamePath = excelFile;
              sd.ConnectionString = this.GetConnectionString(excelFile, true);
              return sd;
          }
          return (SourceFileDetails)null;
      } 

The following method ScanRows scans each row find our interested columns. This is the way it identifies sheet contains data, and data start address. They key step in this method is build the list of columns/fields we are interested and search for it. Here we assume those fields/columns occur within 1000 rows. 1000 rows are selected here for performance and nobody create an excel file contains data after leaving first 1000 rows as blank. 

C#
         public SourceFileDetails ScanRows(DataSet excelDataSet, string ExtractionStatement, string FindColumn)
        {
            if (excelDataSet.Tables.Count < 1)
                return (SourceFileDetails)null;

            string ExtractFields = this.ReplaceString(this.ReplaceString(ExtractionStatement, "SELECT", ""), "From [Data$]", "");
            List<string> FindStrings = ExtractFields.Split(',').Select(s => s.Trim().Replace("[", "").Replace("]", "").Replace("#", ".").Replace("(", "").Replace(")", "")).ToList();

            foreach (DataTable dt in excelDataSet.Tables)
            {
                List<ItemValueHolder> FoundAddress = new List<ItemValueHolder>();
                int iDuplicates = 0;

                foreach (DataRow dr in dt.Rows)
                {
                    int iItemColumnIndex = 0;
                    foreach (var fieldValue in dr.ItemArray)
                    {
                        object cellData = fieldValue;
                        string sCellData = cellData.ToString().Replace("#", ".").Replace("[", "").Replace("]", "").Replace("(", "").Replace(")", "");

                        if (cellData != null)

                            if (FindStrings.Exists(delegate(string k) { return k.ToLower() == sCellData.ToString().ToLower(); }))
                            {
                                if (!FoundAddress.Exists(delegate(ItemValueHolder t) { return t.Item.ToLower() == sCellData.ToLower(); }))
                                    FoundAddress.Add(new ItemValueHolder(sCellData, dt.Rows.IndexOf(dr), iItemColumnIndex));
                                else
                                    iDuplicates++;
                            }
                        iItemColumnIndex++;
                    }
                    int iTotalFields = FindStrings.Count - iDuplicates;

                    if (100 * FoundAddress.Count / (float)iTotalFields >= 90.00)
                    {
                        var query = FoundAddress.GroupBy(
                    item => item.ItemValue,
                    (itemvalue, items) => new
                    {
                        Key = (int)itemvalue,
                        Count = items.Count(),
                    });


                        int dataStartAddress = (from p in query
                                                where p.Count == (query.Max(it => it.Count))
                                                select p.Key).Max();
                        dataStartAddress += 1;
                        string SheetName = "[" + dt.TableName + "$A" + dataStartAddress.ToString() + ":IV]";
                        string _selectCommand = this.ReplaceString(ExtractionStatement, "[Data$]", SheetName);
                        string reportDate = "";
                        if (!string.IsNullOrEmpty(FindColumn))
                            try
                            {

                                int ColIndex = (from p in FoundAddress
                                                where p.Item.ToLower() == FindColumn.ToLower()
                                                select p.ColumnIndex).First();
                                DataRow drRowReport = dt.Rows[dataStartAddress];
                                reportDate = drRowReport[ColIndex].ToString();

                            }
                            catch (Exception e)
                            {
                                //log 
                            }
                        string NewFields = GetNewFieldsAtSource(FindStrings, dr);
                        return new SourceFileDetails("", "", "", _selectCommand, reportDate, NewFields);
                    }
                }
            }
            return (SourceFileDetails)null;
        } 

The following class is used to store Excel file details that we can use to modify connection strings dynamically . The class is nothing but a information bundle; no other important methods that we use inside this class.    

C#
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace ST_b050e37a03e84ca7b15b853949d29aa6.csproj
{
    public class SourceFileDetails
    {
        private string _fileNamePath = "";
        public string FileNamePath
        {
            get { return _fileNamePath; }
            set { _fileNamePath = value; }
        }

        private string _fileExtension = "";
        public string FileExtension
        {
            get { return _fileExtension; }
            set { _fileExtension = value; }
        }

        private string _connectionString = "";
        public string ConnectionString
        {
            get { return _connectionString; }
            set { _connectionString = value; }
        }
        private string _extractQuery = "";
        public string ExtractQuery
        {
            get { return _extractQuery; }
            set { _extractQuery = value; }
        }
        private string _reportDate = "";
        public string ReportDate
        {
            get { return _reportDate; }
            set { _reportDate = value; }
        }

        private string _newFields = "";
        public string NewFields
        {
            get { return _newFields; }
            set { _newFields = value; }
        }
        public SourceFileDetails() { }

        public SourceFileDetails(string fileNamePath, string fileExtension, string connectionString, string extractQuery, string reportDate, string newFields)
        {
            this.FileNamePath = fileNamePath;
            this.FileExtension = fileExtension;
            this.ConnectionString = connectionString;
            this.ExtractQuery = extractQuery;
            this.ReportDate = reportDate;
            this.NewFields = newFields;
        }
    }
} 

The following class is again an intermediate information storage and no other functionality.  

C#
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace ST_b050e37a03e84ca7b15b853949d29aa6.csproj
{
    public class ItemValueHolder
    {
        private string _item = "";
        public string Item
        {
            get { return _item; }
            set { _item = value; }
        }
        private object _itemValue = "";
        public object ItemValue
        {
            get { return _itemValue; }
            set { _itemValue = value; }
        }

        private int _columnIndex;
        public int ColumnIndex
        {
            get { return _columnIndex; }
            set { _columnIndex = value; }
        }

        public ItemValueHolder() { }
        public ItemValueHolder(string item, object itemValue, int columnIndex)
        {
            this.Item = item;
            this.ItemValue = itemValue;
            this.ColumnIndex = columnIndex;
        }
    }
} 

The following method builds the connection string for the particular excel file supplied. It can build connection string irrespective of data has Header row not. Variable HasHeader determines if we have Header row or not. 

C#
public string GetConnectionString(string FileNamePath, bool HasHeader)
        {
            string ConnectionString = "";
            string Extension = Path.GetExtension(FileNamePath).ToLower();
            
            string BinaryExcelProvider = "Microsoft.Jet.OLEDB.4.0";
            string XmlExcelProvider = "Microsoft.ACE.OLEDB.12.0";
            string BinaryExcelExtProperties = "Excel 8.0";
            string XmlExcelExtProperties = "Excel 12.0";
            string XmlMacroExcelExtProperties = "EXCEL 12.0 Macro";
                       
            string Provider = "";
            string ExtendedProperties = "";

            switch (Extension)
            {
                case ".xls":
                    Provider = BinaryExcelProvider;
                    ExtendedProperties = BinaryExcelExtProperties;
                    break;

                case ".xlsx":
                    Provider = XmlExcelProvider;
                    ExtendedProperties = XmlExcelExtProperties;
                    break;

                case ".xlsm":
                    Provider = XmlExcelProvider;
                    ExtendedProperties = XmlMacroExcelExtProperties;
                    break;
            }

            string Header = ";HDR=NO;IMEX=1";
            if (HasHeader)
                Header = ";HDR=YES;IMEX=1";
            string ConnectionStringFormat = "Provider={0};Data Source={1};Extended Properties=\"{2}{3}\";";

            ConnectionString = string.Format(ConnectionStringFormat, Provider, FileNamePath, ExtendedProperties, Header);
            return ConnectionString;
        }

All codes, packages(SSIS Project Solution file), sql scripts to generate destination table and sample excel files that can used for testing are attached.

Points of Interest 

Multiple files loading with for each file enumerator instead using Search method is also uploaded. Please come back with questions or doubts, I will be glad to help you. 

History 

v3 

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Business Analyst
United States United States
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
QuestionHow do I read multiple sheets Pin
Member 1327849527-Jun-17 21:07
Member 1327849527-Jun-17 21:07 
QuestionRunning the Code Pin
TatMys7-Mar-16 6:37
TatMys7-Mar-16 6:37 
AnswerRe: Running the Code Pin
TatMys8-Mar-16 2:47
TatMys8-Mar-16 2:47 
QuestionReferences & 64 bit operation Pin
Nique_m3-Jan-13 2:16
Nique_m3-Jan-13 2:16 
SuggestionRe: References & 64 bit operation Pin
Nique_m3-Jan-13 5:45
Nique_m3-Jan-13 5:45 
GeneralRe: References & 64 bit operation Pin
Kuthuparakkal3-Jan-13 18:31
Kuthuparakkal3-Jan-13 18:31 
GeneralRe: References & 64 bit operation Pin
teh.kok.leong25-Nov-14 22:26
teh.kok.leong25-Nov-14 22:26 
GeneralRe: References & 64 bit operation Pin
Kuthuparakkal26-Nov-14 9:32
Kuthuparakkal26-Nov-14 9:32 
QuestionLoading all multiple worksheets from excel files Pin
Rio Khn28-Oct-12 23:41
Rio Khn28-Oct-12 23:41 
AnswerRe: Loading all multiple worksheets from excel files Pin
Kuthuparakkal31-Oct-12 17:14
Kuthuparakkal31-Oct-12 17:14 
You can do this by a ForLoop inside another Forloop.

1. Outer ForEach : Loop through available Files.
2. Create a Script Task(inside outer Foreach) and Find the number of Sheets in the Excel
3. Inner For(inside outer Foreach) : Loop through 0 to NumberofSheets.
4. Create Script Task (inside inner Foreach) : Build Select Statement for which "FROM [SheetA$]" will be based on the Inner For variable.

You can fetch Sheets(SheetNames) from Excel using OLEDB driver. Please Look at the article source codes.
GeneralRe: Loading all multiple worksheets from excel files Pin
Rio Khn2-Nov-12 2:36
Rio Khn2-Nov-12 2:36 
GeneralMy vote of 5 Pin
Kanasz Robert24-Sep-12 5:48
professionalKanasz Robert24-Sep-12 5:48 
GeneralRe: My vote of 5 Pin
Kuthuparakkal24-Sep-12 16:11
Kuthuparakkal24-Sep-12 16:11 
GeneralSome questions Pin
svenb7730-Jul-12 0:07
svenb7730-Jul-12 0:07 
AnswerRe: Some questions Pin
Kuthuparakkal30-Jul-12 1:46
Kuthuparakkal30-Jul-12 1:46 
GeneralRe: Some questions Pin
svenb7731-Jul-12 2:02
svenb7731-Jul-12 2:02 
AnswerRe: Some questions Pin
Kuthuparakkal2-Aug-12 8:23
Kuthuparakkal2-Aug-12 8:23 
QuestionRe: Some questions Pin
svenb772-Aug-12 19:31
svenb772-Aug-12 19:31 
AnswerRe: Some questions Pin
Kuthuparakkal3-Aug-12 21:08
Kuthuparakkal3-Aug-12 21:08 
QuestionRe: Some questions Pin
svenb775-Aug-12 21:47
svenb775-Aug-12 21:47 
AnswerRe: Some questions Pin
Kuthuparakkal5-Aug-12 23:03
Kuthuparakkal5-Aug-12 23:03 
QuestionRe: Some questions Pin
svenb776-Aug-12 1:04
svenb776-Aug-12 1:04 
AnswerRe: Some questions Pin
Kuthuparakkal6-Aug-12 1:22
Kuthuparakkal6-Aug-12 1:22 
QuestionRe: Some questions Pin
svenb776-Aug-12 20:28
svenb776-Aug-12 20:28 
AnswerRe: Some questions Pin
Kuthuparakkal6-Aug-12 22:01
Kuthuparakkal6-Aug-12 22:01 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.