Tuesday, March 12, 2013

Using a custom SSIS script component source to process NetCDF files

In case you were interested...

Recently I was involved in a data warehousing project with a number of SSIS-based ETL processes loading data into a dimensional model in the data warehouse from a variety of data sources. One of these data sources was weather condition forecast data (daily minimum & maximum temperatures, average rainfall) sourced from a government bureau in NetCDF format on a daily basis.

Prior to this project I'd never even heard of NetCDF before, but nevertheless we needed to be able to work with data in this format as part of our ETL processes. We discovered the Microsoft SDS libraries are the preferred mechanism for interacting with data in NetCDF format from the .NET platform. Although these libraries are part of a Microsoft Research initiative and may not be used for commercial use, we obtained permission from Microsoft to use the libraries for our purposes.
 

We considered the following alternatives:
  1. Custom .NET web service leveraging the Microsoft SDS libraries (T-SQL INSERT)
  2. Custom .NET web service leveraging the Microsoft SDS libraries (SqlBulkCopy)
  3. Custom SSIS script component source leveraging the Microsoft SDS libraries
Originally it was proposed to utilise the Custom .NET web service leveraging the Microsoft SDS libraries (T-SQL INSERT) alternative, with a SSIS package downloading and extracting each forecast NetCDF file prior to invoking the web service for subsequent processing.

A prototype ASP.NET (ASMX) web service was constructed with a single LoadNetCDF service operation. The service used the Microsoft SDS libraries to read the source NetCDF file and dynamically constructed a T-SQL INSERT statement to insert the data into a specified table within a specified database on a specified SQL Server instance.

While this alternative provided a potentially reusable web service that could fit within a service-oriented architecture, it introduced an additional dependency upon the infrastructure required to host the web service. Additionally during testing the web service was observed to perform relatively poorly, with the SSIS package unable to be configured with a large enough timeout to await the web service response (> 300 seconds).

As a result, a number of additional alternative approaches were considered.

The Custom .NET web service leveraging the Microsoft SDS libraries (SqlBulkCopy) alternative retained the usage of the custom web service from SSIS, simply replacing the use of dynamic T-SQL INSERT statements with usage of the .NET SqlBulkCopy class instead. Rather than connecting to the database and executing an INSERT statement for each candidate record extracted from the source NetCDF file, this approach progressively constructed an in-memory .NET DataTable containing all of the data extracted from the source NetCDF file, prior to using the .NET SqlBulkCopy class to perform a bulk insert of the data into the database. This alternative exhibited significantly improved performance (from > 300 seconds to < 30 seconds).

The Custom SSIS script component source leveraging the Microsoft SDS libraries alternative replaced the SSIS web service call with a standard SSIS data flow task. The data flow task was responsible for extracting the forecast data from its source NetCDF file and inserting it into the target database. The key to this alternative was the use of a SSIS script component as the data source for the data flow task. The script component accepted a single variable parameter containing the path to the source NetCDF file. It then utilised .NET code extracted from the prototype web service to read data from the NetCDF file using the Microsoft SDS libraries and write this data to a SSIS output buffer, making it available for stream-based consumption within the data flow task. This alternative also exhibited significantly improved performance (from > 300 seconds to ≈ 10 seconds).

The Custom SSIS script component source leveraging the Microsoft SDS libraries alternative was selected as the preferred alternative as it removed the infrastructure dependency for hosting the custom web service and significantly improved performance. In addition, although the web service approach may have formed a potentially reusable component within a service-oriented architecture, the design of the prototype web service did not meet the required levels of reusability typically associated with such goals and it was not within the scope of the project to provide such.

2 comments:

  1. I'm working on a project facing the same challenge. Would it be possible for you to share any of the code or an example. Is there a better way to do this? Any advice or suggestions you may have would be greatly appreciated.

    ReplyDelete
  2. I'm also trying to store a large number of netcdf files (global data) in sql server. I'm not sure what is the optimal table design to store the data? A table with columns like: 'time', 'lat', 'lon', 'data' make sense?
    Any suggestion or code sharing would be greatly appreciated.
    Many thanks.

    ReplyDelete