Managing Big Data for Audit Compliance and Business Intelligence

By Ryan Goldman, Director of Product Marketing, Cloudera

Records and reporting requirements have long been a challenge for the financial services industry and are the original definition of the sector’s big data problem. The dual objectives of managing historical data to comply with federal requirements and being able to retrieve and query more data on an ad hoc basis can be both disruptive to the business and prohibitively expensive. The diversity of data makes reporting expensive due to the variety of workloads required—extract-transform-load (ETL), warehousing, reporting—while the structured query language (SQL), which is primarily used for business intelligence and analysis, is not an adequate tool for order linkage.

Audits to comply with the Order Audit Trail System (OATS) regulation of the Security and Exchange Commission (SEC) are complex and costly because they require data to be found, collected, transformed, stored, and reported on-demand from a variety of sources and data formats with relatively short timelines in order to avoid fines (or worse). Once the data is brought together, it typically sits in storage and is no longer easily available to the business. Soon, the Consolidated Audit Trail (CAT) will obligate finer-grained order, cancelation, modification, and execution details in a system governed by the Financial Industry Regulatory Authority (FINRA).

Expanding reporting requirements—for both industry firms and regulatory agencies—are overwhelming systems that were originally built in traditional data warehouses and duplicated and archived for Write-Once/Read-Many (WORM) requirements on tape or redundant array of independent disks (RAID). On the reporting side, the relational database management system (RDBMS) breaks down because of increasing volume and variety of data required for OATS (and, eventually, CAT) compliance.

Build a Hadoop Active Archive

As the requirements for compliance with an increasing variety of risk, conduct, transparency, and technology standards grow to exabyte scale, financial services firms and regulatory agencies are building enterprise data hubs with Apache Hadoop at the core. With Hadoop, the IT department works across the different business units to build an active archive for multiple users, administrators, and applications to simultaneously access in real time with full fidelity and governance based on role and profile.

By building an active archive with Hadoop, the data required for reporting becomes less disparate and requires less movement to staging and compute. HDFS—the distributed file system and primary storage layer for Hadoop—and MapReduce—the batch processing engine in Hadoop—offer significant cost savings over the vast majority of (perhaps all) other online WORM-compliant storage technologies and are far more format-tolerant and business-amenable than tape storage. The industry-standard servers on which Hadoop clusters are built also provide the benefit of latent compute alongside storage, which can easily be applied to ETL jobs to speed transformation and cut reporting timelines. Natural-language-based query tools built on Apache Solr provide a full-text, interactive search capability on the data in Hadoop and the scalable, flexible indexing component of an enterprise data hub. Impala—Hadoop’s massively-parallel-processing SQL engine—provides in-cluster reporting and investigation capabilities to keep the data required for auditing accessible in its original format and fidelity for business intelligence and other workloads, while Apache Spark—the next-generation processing engine combining batch, streaming, and interactive analytics via in-memory capabilities—provides significantly faster and more robust order linkage.

Extend Value with a Data Hub

When used in conjunction with traditional storage and data warehousing, an enterprise data hub is a solution for both the banks building reports and the agencies, such as FINRA, that receive, store, and scrutinize them, due to Hadoop’s relatively low cost, scalability, and ease of integration. In fact, Hadoop users who have built enterprise data hubs in the broker-dealer and retail banking industries—including many of the biggest names on Wall Street—have reported completing natural-language-processing jobs that are required for SEC record-keeping in only two hours, compared to at least two weeks to run the same jobs on specialized systems with much larger hardware footprints.

Ryan Goldman, Director of Product Marketing, Cloudera (

Cloudera's Enterprise Data HubCloudera Enterprise offers the first complete hub for Big Data built on Apache Hadoop, including deployments at three of the top five banks, as well as at the world’s leading insurers, credit card and payment companies, and self-regulatory organizations (SROs). With Cloudera, financial services firms can affordably and scalably analyze custom scenarios on an ad hoc basis prior to trade execution by extending the capabilities of existing tools within the data center, rather than requiring expensive, new, dedicated systems. For more information, please visit


Follow Us:

Sitemap | Privacy | Copyright © © 2017, WSTA®, All Rights Reserved.