Creating High Performance Dataflows in Centerprise
Centerprise is designed to deliver performance and scalability that enables you to efficiently process high data volumes. Centerprise’s parallel processing architecture exploits multicore and multiprocessor hardware and delivers performance that usually increases linearly with the CPU power available. When working with high volume data, it is important to understand how Centerprise processes data as it moves through the dataflow pipeline. This will help you create and refine high-performance dataflows.
You can use a number of built-in features to optimize database reading. These include partitioning, change data capture, multi-table query, and others.
Read Only What is Necessary. When reading from a database, use where clause to filter out the records you do not need. Using filter transformation after reading the data reads unneeded records. If subsequent transformations expect data to be sorted, use ORDER BY clause to sort records in the database.
Do Not Map All Fields. For database table source, Centerprise reads only the fields that are used in maps. This can substantially improve database read performance. Therefore, avoid mapping the entire database table and map only the fields that you need.
Use Multi-Table Query. If you access data from multiple tables in the same database and then use Centerprise Join Transformation to join them, consider using Centerprise Multi-table Query source. This component features a sophisticated query engine to build queries that span across tables.
Use SQL Query Source. If you have an existing stored procedure or would like to write a query that is optimized for the specific scenario, use SQL query source.
Partition Data. You can use data partitioning feature of Database Table Source. This feature parallelizes database reading by creating multiple readers.
Use Change Data Capture. Centerprise includes multiple change data capture strategies to minimize data reads. These include Audit Field, Incremental Read as well as database writing strategies. If you transfer data at regular intervals and only a small fraction of data changes between transfers, CDC can help speed up your dataflows.
Centerprise supports native bulk insert for popular databases such as SQL Server, Oracle, DB2, Sybase, MySQL, Access, Netezza, Teradata, and more. This enables you to efficiently process high data volumes. Where applicable, use bulk insert option while writing to database. Centerprise also batches updates and deletes to optimize writes. Centerprise includes several write strategies that can be employed to speed up database writes. These include data-driven strategy, Slowly Changing dimensions, and Diff Processor.
It is important to understand performance and memory characteristics of transformations. Transformations are blocking and non-blocking. Blocking transformations hold data until all the records have been received and processed. Non-blocking transformations continuously process and release records. Blocking transformations often impede parallel processing because succeeding transformations must wait for the entire dataset to be received and processed. Sort is an example of a blocking transformation. Join, distinct, aggregate, denormalize, and other transformations are non-blocking if data is sorted and blocking if data is not sorted. Avoid repeated sorting for large datasets. If applicable, use ORDER BY to sort data in the database. Remember, checking a ‘Sort Incoming Data’ has the same effect as inserting Sort Transformation.
Distinct and Aggregate Transformations: Distinct and Aggregate Transformations use little memory when the incoming data is sorted on key or group by. If, however, the incoming data is not sorted, these transformations store values in memory. This could consume enormous amount of memory for large data sets and will slow down the rest of the server.
Field Profile: Field Profile Transformation can consume significant memory if the number of distinct items is very large. Use this transformation with caution on large datasets.
Lookups: Lookups can slow down your dataflows considerably. If a lookup table is large, ensure that it is indexed on lookup fields. You can use appropriate caching options depending on the size of lookup table and the number of values you are looking up. Where appropriate, consider using Join Transformation instead of lookup.
Excel files are memory hogs. While reading, he entire file is loaded in memory. This slows down the dataflow and consumes substantial memory resources. Excel Destination keeps the entire file in memory and writes them to disk at the end of the dataflow. Avoid using Excel workbooks for large data volumes.