MSBI: SSIS Package: Performance Optmization

1. Avoid Asynchronous Transformation (Sort T/F) wherever possible.
Sort T/F required all the incoming rows to be arrived before start processing. Instead of using Sort T/F, we get sorted rows from data source using ORDER By clause.

2. Pulling High Volumes of Data

Drop all Non-Clustered Indexes and Clustered Index on the destination table if exists, then Transfer and load the data into Destination Table. Create Clustered Index and Non-clustered indexes on the destination tables after data load.

3. Avoid SELECT *

Data Flow Task uses buffer oriented architecture for data transfer and transformation. When data transfer from Source to Destination, the data first comes into the buffer, required transformations are done in the buffer itself and then written to Destination.

The size of buffer depends on the estimated row size. The estimated row size is equal to the maximum size of all columns in the row. So the more columns in a row means less number of rows in a buffer. Hence select only those columns which are required at the destination.

Even if we need all the columns from source, we should use the column name specifically in the SELECT statement, otherwise it takes another round for the source to gather meta-data about the columns when u are using SELECT *.

4. Effect of OLEDB Destination Settings

There are couple of settings with OLEDB destination which can impact the performance of data transfer as listed below.

Data Access Mode – This setting provides the 'fast load' option which internally uses a BULK INSERT statement for uploading data into the destination table instead of a simple INSERT statement (for each single row) as in the case for other options. So unless you have a reason for changing it, don't change this default value of fast load. If you select the 'fast load' option, there are also a couple of other settings which you can use as discussed below.

Keep Identity – By default this setting is unchecked which means the destination table (if it has an identity column) will create identity values on its own. If you check this setting, the dataflow engine will ensure that the source identity values are preserved and same value is inserted into the destination table.

Keep Nulls – Again by default this setting is unchecked which means default value will be inserted (if the default constraint is defined on the target column) during insert into the destination table if NULL value is coming from the source for that particular column. If you check this option then default constraint on the destination table's column will be ignored and preserved NULL of the source column will be inserted into the destination.

Table Lock – By default this setting is checked and the recommendation is to let it be checked unless the same table is being used by some other process at same time. It specifies a table lock will be acquired on the destination table instead of acquiring multiple row level locks, which could turn into lock escalation problems.

Check Constraints – Again by default this setting is checked and recommendation is to un-check it if you are sure that the incoming data is not going to violate constraints of the destination table. This setting specifies that the dataflow pipeline engine will validate the incoming data against the constraints of target table. If you un-check this option it will improve the performance of the data load.

5.Effect of Rows Per Batch and Maximum Insert Commit Size Settings:

5.1 Rows per batch:

The default value for this setting is -1 which specifies all incoming rows will be treated as a single batch. You can change this default behavior and break all incoming rows into multiple batches. The allowed value is only positive integer which specifies the maximum number of rows in a batch.

5.2 Maximum insert commit size:

The default value for this setting is '2147483647' (largest value for 4 byte integer type) which specifies all incoming rows will be committed once on successful completion. You can specify a positive value for this setting to indicate that commit will be done for those number of records. Changing the default value for this setting will put overhead on the dataflow engine to commit several times. Yes that is true, but at the same time it will release the pressure on the transaction log and tempdb to grow specifically during high volume data transfers.

The above two settings are very important to understand to improve the performance of tempdb and the transaction log. For example if you leave 'Max insert commit size' to its default, the transaction log and tempdb will keep on growing during the extraction process and if you are transferring a high volume of data the tempdb will soon run out of memory as a result of this your extraction will fail. So it is recommended to set these values to an optimum value based on your environment.

6. DefaultBufferSize and DefaultBufferMaxRows :

The execution tree creates buffers for storing incoming rows and performing transformations.

The number of buffer created is dependent on how many rows fit into a buffer and how many rows fit into a buffer dependent on few other factors. The first consideration is the estimated row size, which is the sum of the maximum sizes of all the columns from the incoming records. The second consideration is the DefaultBufferMaxSize property of the data flow task. This property specifies the default maximum size of a buffer. The default value is 10 MB and its upper and lower boundaries are constrained by two internal properties of SSIS which are MaxBufferSize (100MB) and MinBufferSize (64 KB). It means the size of a buffer can be as small as 64 KB and as large as 100 MB. The third factor is, DefaultBufferMaxRows which is again a property of data flow task which specifies the default number of rows in a buffer. Its default value is 10000.

If the size exceeds the DefaultBufferMaxSize then it reduces the rows in the buffer. For better buffer performance you can do two things.

First you can remove unwanted columns from the source and set data type in each column appropriately, especially if your source is flat file. This will enable you to accommodate as many rows as possible in the buffer.

Second, if your system has sufficient memory available, you can tune these properties to have a small number of large buffers, which could improve performance. Beware if you change the values of these properties to a point where page spooling (see Best Practices #8) begins, it adversely impacts performance. So before you set a value for these properties, first thoroughly testing in your environment and set the values appropriately.

7. How DelayValidation property can help you

SSIS uses two types of validation.

First is package validation (early validation) which validates the package and all its components before starting the execution of the package.

Second SSIS uses component validation (late validation), which validates the components of the package once started.

Let's consider a scenario where the first component of the package creates an object i.e. a temporary table, which is being referenced by the second component of the package. During package validation, the first component has not yet executed, so no object has been created causing a package validation failure when validating the second component. SSIS will throw a validation exception and will not start the package execution. So how will you get this package running in this common scenario?

To help you in this scenario, every component has a DelayValidation (default=FALSE) property. If you set it to TRUE, early validation will be skipped and the component will be validated only at the component level (late validation) which is during package execution

8. Better performance with parallel execution
9. When to use events logging and when to avoid.

10. Monitoring the SSIS Performance with Performance Counters
Launch Performance Monitor:
1. Start -> All Programs -> Administrative Tools -> Performance
2. Load the SSIS related Counters
In the Performance Object, select SQL Server:SSIS Pipeline and SQL Server:SSIS Service.

SSIS provide a set of performance counters. Among them, the following few are helpful when you tune or debug your package:
Buffers in use
Flat buffers in use
Private buffers in use
Buffers spooled
Rows read
Rows written

“Buffers in use”, “Flat buffers in use” and “Private buffers in use” are useful to discover leaks. During package execution time, we will see these counters fluctuating. But once the package finishes execution, their values should return to the same value as what they were before the execution. Otherwise, buffers are leaked.

“Buffers spooled” has an initial value of 0. When it goes above 0, it indicates that the engine has started memory swapping. In a case like this, set Data Flow Task properties BLOBTempStoragePath and BufferTempStoragePath appropriately for maximal I/O bandwidth.
Buffers Spooled: The number of buffers currently written to the disk. If the data flow engine runs low on physical memory, buffers not currently used are written to disk and then reloaded when needed.

“Rows read” and “Rows written” show how many rows the entire Data Flow has processed.

11. FastParse property
Fast Parse option in SSIS can be used for very fast loading of flat file data. It will speed up parsing of integer, date and time types if the conversion does not have to be locale-sensitive. This option is set on a per-column basis using the Advanced Editor for the flat file source.

12. Checkpoint features helps in package restarting

MSBI

SSIS Package: Performance Optmization

No comments:

Post a Comment