Big Data Management and Security
Audit Concerns and Business Risks Tami Frankenfield
- Sr. Director, Analytics and Enterprise Data
Big Data Management and Security Audit Concerns and Business Risks - - PowerPoint PPT Presentation
Big Data Management and Security Audit Concerns and Business Risks Tami Frankenfield Sr. Director, Analytics and Enterprise Data Mercury Insurance What is Big Data? Velocity + Volume + Variety = Value The Big Data Journey Big Data is
Reporting Business Intelligence Modeling and Predicting “Fast Data” “Big Data” Data Management
Report on standard business processes Focus on what happened and more importantly, why it happened Establish initial processes and standards Leverage information for forecasting and predictive purposes Analyze streams of real-time data, identify significant events, and alert other systems Leverage large volumes of multi-structured data for advanced data mining and predictive purposes
Big Data is the next step in the evolution of analytics to answer critical and often highly complex business questions. However, that journey seldom starts with technology and requires a broad approach to realize the desired value.
popular websites with millions of concurrent users and thousands of queries per second
database model: for example, swapping ACID (Atomicity, Consistency, Isolation, Durability) for BASE (Basically Available, Soft State, Eventually Consistent)
impossible to express using SQL
CIO Study
Enterprises face the challenge and opportunity of storing and analyzing Big Data.
Big Data Is supported and moved forward by a number of capabilities throughout the ecosystem. In many cases, vendors and resources play multiple roles and are continuing to evolve their technologies and talent to meet the changing market demands.
An Hadoop based solution is designed to leverage distributed storage and a parallel processing framework (MapReduce) for addressing the big data problem. Hadoop is an Apache foundation open source project.
HDFS (Storage Layer) FLUME, SQOOP (Data
Integration)
OOZIE (Workflow & Scheduling) HIVE
(SQL)
UI Framework SDK
Traditional Data Warehouse (DW) Traditional Databases Advanced Analytics Tools
Hadoop
MPP or In-memory solutions
ZOOKEEPER
(Coordination)
MapReduce
(Job Scheduling and Processing Engine)
PIG
(Data Flow)
HBase (Distributed DB)
The need for Big Data storage and management has resulted in a wide array of solutions spanning from advanced relational databases to non-relational databases and file systems. The choice of the solution is primarily dictated by the use case and the underlying data type.
Non-Relational Databases have been developed to address the need for semi- structured and unstructured data storage and management. Relational Databases are evolving to address the need for structured Big Data storage and management. Hadoop HDFS is a widely used key-value store designed for Big Data processing.
Big Data security should address four main requirements – perimeter security and authentication, authorization and access, data protection, and audit and reporting. Centralized administration and coordinated enforcement of security policies should be considered. 8
Perimeter Security & Authentication Authorization & Access Data Protection Audit & Reporting
Big Data Security
Required for guarding access to the system, its data and services. Authentication makes sure the user is who he claims to be. Two levels of Authentication need to be in place – perimeter and intra-cluster.
Kerberos LDAP/ Active Directory and BU Security Tool Integration**
Required to manage access and control over data, resources and services. Authorization can be enforced at varying levels of granularity and in compliance with existing enterprise security standards.
File & Directory Permissions, ACL Role Based Access Controls
Required to control unauthorized access to sensitive data either while at rest or in motion. Data protection should be considered at field, file and network level and appropriate methods should be adopted
Encryption at Rest and in Motion Data Masking & Tokenization
Required to maintain and report activity on the
security compliance and other requirements like security forensics.
Audit Data Audit Reporting
Production Research Development
Applicability Across Environments
Folder Folder File File Table Database Column Restricted Access Encrypted Data Table CF CF Table Database Column
Security/Access Control UI
LDAP/ACTIVE Directory
Integration of Security, Access, Control and Encryption across major components of the Big Data landscape.
Encryption / Anonymization
properly Levels of granularity in relation to data access and security
Security implementation protocol
Manageability / scalability
In Hadoop, Kerberos currently provides two aspects of security:
UNIX level Kerberos IDs to that of Hadoop. In a mature environment, Kerberos is linked/mapped to Active Directory or LDAP system of an organization. Maintenance of mapping is typically complicated
leveraged by the Authorization and the users can be authorized to access data at the HDFS folder level Large organizations should adopt a more scalable solution with finer grains of access control and encryption / anonymization of data Select a tool which is architecturally highly scalable and consists of the following features:
Guidelines & Considerations Key Terms Defined Point of View
Multi-tenancy in Hadoop primarily need to address two key things: 1. Resource sharing between multiple applications and making sure none of the application impacts the cluster because of heavy usage 2. Data security/Auditing - users and applications of one application should not be able to access HDFS data of other apps
POSIX compliant add on packages (General Parallel File System and Isilon, respectively)
distribution decreases as the load approaches steady state
RAM available (% of CPU will be possible in future) ‒ This is a more efficient way of sharing resources between different groups within an organization. Before YARN, resources in Hadoop are available only as a number of map reduce slots available. So although multi-tenancy was possible, it was not very efficient Guidelines & Considerations
Hadoop by having name node data totally distributed removing the single point of failure efficiently. Enables HDFS federation
a single business unit, application or user group. Policies may be applied to a volume to enforce security or ensure availability
management, enabling multi tenancy Key Terms Defined Point of View
in a clustered computing environment, it involves multi-tenancy at a data (file system) level, a workload (jobs and tasks) level, and a systems (node and host) level
closing the feature gap between proprietary file systems and HDFS
Background Recommendations
Authentication mechanism
Data Node, Task Tracker, Oozie etc),End User-Services (Hue, File browser, cli etc)
realm(hadoop cluster) and inter-realms as well
server - to authenticate each other
mechanisms throughout the cluster
groups from LDAP involves corporate IT dependency
Configure Kerberos for Hadoop Provision initial set of users/user groups in POSIX layer Integrate LDAP/Single-Sign On (Kerberos, Hue, Ambari) Prepare final list of users/user groups and provision the users on POSIX layer matching ldap principals Optionally Configure Knox gateway for perimeter authentication for external systems (with LDAP or SSO)
restrictions for users
HBase/Hive
HTTPS(web clients, REST), RPC Encryption (API, Java Frameworks), Data Transfer protocol (Name Node- Data Node)
the moment
exported outside lake within Zurich network
secure)
Perimeter Security & Authentication Authorization & Access Data Protection Audit & Reporting