Centerprise Best Practices: Performance Tweaking in Dataflow Design

In Part I of this two-part blog, we talked about how dataflows are the cornerstone of any data integration project in Centerprise and discussed some modularity and reusability best practices for getting the most out of your Centerprise integration projects. This week we’ll look at some performance-tweaking best practices. The Centerprise parallel-processing platform has been built for superior speed and performance, so designing your dataflows to take advantage of the software’s abilities can significantly affect your data integration performance.

Data Sources
Frequently, a dataflow can be optimized by some fine-tuning at the data source. Here are some useful optimization techniques:

Filtering in Database: When loading data from a database, enter a where clause to filter data at the source. Loading data in Centerprise and then filtering using the Centerprise Filter Transformation can significantly degrade performance.

Avoid Mapping Extra Fields: The Database Table Source automatically creates a query to load only the fields that are mapped. To take advantage of this optimization, map only the fields that are used in subsequent actions.

Sorting for Succeeding Joins: The performance of Centerprise joins improves by orders of magnitude when working with previously-sorted data. Where possible, avoid sorting data in Centerprise and sort instead in your database query by adding order by clause.

Partitioning: Centerprise Database and File Sources enable data partitioning, which speeds up reading by breaking a dataset into chunks and reading these chunks in parallel. Use partitioning if you are moving a large data table

Change Data Capture: If you periodically transfer incremental data changes, consider using one of Centerprise’s change data capture (CDC) patterns to ensure your data is as up to the minute as you need it to be. Centerprise supports a variety of CDC strategies enabling you to select the appropriate strategy to fit your environment and requirements. Refer to Additional Resources at the end of this document for more information on Centerprise change data capture.

Joins
Centerprise Join Transformations enable you to join multiple data sources. Joining often involves sorting incoming data streams, making it the most common reason for performance issues. Here are the practices to keep in mind when using joins:

  • Joining data in Centerprise is time consuming. Use Database Joins wherever possible.
  • Sort data in the source where appropriate. Joining sorted data streams is much faster.
  • When joining data from the same database, use the Database Join option in the Join Transformation. When the Database Join option is specified, Centerprise builds and runs a single query joining multiple tables. This can enhance performance in most situations.
  • Reducing the number of fields in the Join Transformation also improves performance, so be sure to remove any unnecessary fields.

Lookups

Lookups are another source of performance issues in integration jobs. Centerprise offers several caching techniques to improve the performance of lookups. Experiment with lookup caching options and select the options that work best for specific situations. Some tips are:

  • If you work with a large dataset that does not change often, consider using the new Centerprise Persistent Lookup Cache, which stores a snapshot of the lookup table on the server’s local drive and uses it in subsequent runs. In situations where the lookup table is updated daily, a snapshot can be taken on the first run after update and can be used throughout the day to process incremental data.
  • If you work with a large dataset but use only a small fraction of items in a single run, consider using the Cache on First Use option
  •  If a lookup is used in multiple places within the same flow, consider using a detached lookup
  • Where appropriate, use a database join instead of a lookup function

Destinations

Centerprise supports bulk load for popular databases. Where possible, use bulk load for database destinations and experiment with bulk load batch sizes to fine-tune performance. Centerprise offers a Diff Processor Transformation that can be used to compare an incoming data stream with existing data in the table and apply differences. In certain situations, this transformation can substantially speed up dataflows.

General Guidelines

  • Avoid unnecessary steps. Some steps such as Record Level Log incur major performance overhead and should only be used during development to isolate data issues.
  • The job monitoring window shows execution time for each step. These are not precise numbers but provide a rough approximation of time taken by each step and can be used to optimize specific steps.