hitachi next 2018
play

Hitachi NEXT 2018 Data Discovery and Profiling with Hitachi Content - PDF document

Hitachi NEXT 2018 Data Discovery and Profiling with Hitachi Content Intelligence Contents Page 2: Guided Demonstration Page 2: Create Content Class Page 4: Create a Data Connection Page 5: Modify Existing Pipeline Page 6: Create a Data Index


  1. Hitachi NEXT 2018 Data Discovery and Profiling with Hitachi Content Intelligence Contents Page 2: Guided Demonstration Page 2: Create Content Class Page 4: Create a Data Connection Page 5: Modify Existing Pipeline Page 6: Create a Data Index Page 7: Create a Workflow Page 8: Review Search Results

  2. Guided Demonstration Introduction In this guided demonstration, you will use Hitachi Content Intelligence to discover data that contains sensitive information such as US Social Security Numbers, Canadian Social Insurance Numbers, and Credit Card numbers. Then, you will apply custom processing actions to automatically redact the sensitive information to deliver secure search capabilities to target end users. i Objectives In this guided demonstration, you will: • Create a Content Class with a pattern for SSNs • Create a data connection • Modify existing pipeline • Create a data index • Run a workflow • Review and search results Create Content Class A. Before you begin: Review the environment: 1. On the desktop, open the file: HCI Lab Diagram . 2. Verify that you are logged into the Windows Console at IP address 192.168.102.103 . 3. Notice that there are 3 other virtual machines in this environment: • Hitachi Content Platform • Hitachi Content Intelligence Master Node • Hitachi Content Intelligence Worker Node 4. Also, note the login details for each machine. Note: All the software in this lab is preinstalled and configured . B. To create the content class: 1. Access Hitachi Content Intelligence Administration Console. To access: a) Open Firefox and then open Bookmarks . b) Click on the HCI Administration Console in the bookmarks c) Click on Add Exception to advance past unsecure connection. d) Click Confirm Security Exception . e) Login with username: admin and password: 1234 . 2. Once logged in, you will see 3 panels: Workflows, Monitoring, and System Configuration. You can hover over each panel to see more information. HITACHI is a trademark or registered trademark of Hitachi, Ltd. 2

  3. 3. Click the Workflows panel. 4. On the left side, click on Content Classes and then click on Create Content Class . 5. Name the content class as [YourName]SSNs. 6. Provide a brief description such as “ A content class for identifying Social Security Numbers .” And then click Create . 7. A message will briefly appear stating: “ Successfully created Content Class .” 8. To specify the content class to pattern match: a) Click Edit properties . b) Click Add Content Property . c) In the dropdown type select PATTERN . d) Type the name as ssn . e) Carefully enter Expression: “ [0-9]{3}-[0-9]{2}-[0-9]{4}(\s|\n) ” • [0-9] The pattern [0-9] performs the match between 0 to 9. • {n} Matches exactly n occurrences of the preceding expression. n must be a positive integer. • \s Matches a single white space character, including space, tab, form feed, line feed. • \n Matches a line feed. • x|y Matches 'x', or 'y' (if there is no match for 'x'). • Note: To learn possible combinations of Regular Expressions, visit: https://developer.mozilla.org/en/docs/Web/JavaScript/Guide/Regular_Expressi ons f) Click Save . Note: You must save before testing content class . g) Click Test Content Class . h) Select Text . i) Type “ My Social Security Number is 123-45-6789 .” Then, press Enter to bring up a second blank line. j) Click Test Content Class . • Briefly see “ Successfully extracted values for all matching expressions .” k) In Extracted Value, it should appear “ 123-45-6789 .” If this fails, no need to worry, there is a prebuilt correctly configured content class called “ ssncontentclass .” HITACHI is a trademark or registered trademark of Hitachi, Ltd. 3

  4. Create a Data Connection It’s easy to create data connections to existing HCP nodes, S3 buckets, local file systems, Hadoop Distributed File Systems, databases and many more. Content Intelligence ships with dozens of preconfigured connection types or you can code customer connections for your specific needs. To create a data connection: 1. Click on Data Connections on the left side. 2. Click on Add Data Connection . Note: You may need to navigate back to the Workflow Designer page via steps 1, 2 and 3 of Section B. To create content class, in Lab 1. 3. In the dropdown, select HCP . a) Enter the name for the connection as [YourName]Virtual HCP . b) ( Optional ) Type a brief description, for example: the IP address as 192.168.102.180 . c) Enter the HCP System Name as hcp.hcidemo.com . d) Type the HCP Tenant Name as hci . e) Type the HCP Namespace Name as ns1 . f) Enter the HCP Root Directory as /SSN-Data . g) Verify that Use SSL is set to Yes . h) Verify that Use Proxy Server is No . i) Type the User Name as hci . j) Type the Password as hds123 . k) Click Test . A message will briefly appear stating “ Successfully connect to data source .” 4. Click Create . If this fails, no need to worry. An existing connection ssnfiles has been created and correctly configured. HITACHI is a trademark or registered trademark of Hitachi, Ltd. 4

  5. Modify Existing Pipeline You can modify an existing pipeline by performing following steps. A. Before you begin: Know that, • MIME Type Detection detects the type of documents incoming from the data source. • Zip Expansion expands zip archives into contained documents. • Text and Metadata Extraction discover standard metadata fields. • Snippet Extraction extracts a snippet of text from a text stream. • Content Class Extraction uses predefined content classes to extract fields from documents. • If ssn field exists adds metadata to the document identifying PII. B. To modify: 1. Click on Processing Pipelines on the left side. 2. Then scroll down and click on tagwithssn . Note: You may need to navigate back to the Workflow Designer page via steps 1, 2 and 3 of Section B: To create content class, in Lab 1. 3. ( Optional ) If you successfully created a Content class: a) Click the pencil to edit the Content Class Extraction stage. b) Select your content class ( [YourName]SSNs) from the dropdown box, and then click the + (plus) sign. c) Click the garbage can and then confirm the removal of ssncontentclass . d) Click Update at the bottom of the page. 4. To add Document Security stage, a) Within the tagwithssn pipline, click Add Stages . b) In the search box on the left, search for Document Security . c) Click the + (plus) sign to add a stage to the pipeline. d) Configure Document Security Settings • Check Enable ACLs to apply access control lists (ACLs) fields to documents. This will prohibit unauthorized users from accessing documents with SSNs. System Administrators can decide who has access to which documents from the search console. 5. To add a Replace stage (redaction), a. Within the tagwithssn pipline, click Add Stages . b. In the search box on the left, search for Replace . HITACHI is a trademark or registered trademark of Hitachi, Ltd. 5

  6. c. Configure Replace : • Under Fields to Process, click Add item and enter the SSN. • Under Values to Replace, click Add item and enter the following: 1. Source Expression is “[0-9]{3}-[0-9]{2}-[0-9]{4}(\s|\n) ” 2. Replacement is “ XXX-XX-XXXX ” 6. Click Update . You have modified an existing pipeline now. Create a Data Index 1. Click on Index Collections on the left side. 2. Next, click on Create Index . Note: You may need to navigate back to the Workflow Designer page via steps 1, 2 and 3 of Section B: To create content class, in Lab 1. 3. Verify that HCI Index is selected. 4. Enter the name as [YourName]SSNRedacted . 5. Optionally, enter a brief description such as “ An index for redacted SSNs .” 6. Verify that: a. Initial visibility is set to Public . b. Initial Schema set to Schemaless . c. Initial shard count set to 3 . 7. Click Create . You have now created the data index. HITACHI is a trademark or registered trademark of Hitachi, Ltd. 6

  7. Create a Workflow 1. Click on Workflow Designer on the left side. 2. Next, click Create Workflow . Note: You may need to navigate back to the Workflow Designer page via steps 1, 2 and 3 of Section B: To create content class, in Lab 1. 3. Name your workflow as [YourName]SSNRedacted , and optionally enter a brief description. 4. Click Next . 5. Click Select Data Connection and navigate to the data connection you created in the previous lab: [YourName]Virtual HCP or ssnfiles . Note: If you configured the data connection correctly they will be identical. If you incorrectly configured your data connection you can use “ssnfiles”. 6. Click Add to workflow , and then click Close. 7. Click Next to advance to the Add Pipeline screen. 8. Next, click Select Pipeline and navigate to the pipeline you modified tagwithssn , or to RedactionSSN. Note: If you correctly configured the pipeline you can use the one you modified (tagwithssn). Otherwise, you can use the prebuilt RedactionSSN pipeline . 9. Confirm General Settings for Workflow-Agent. 10. Click Add to Workflow and then click Close . 11. Click Next to advance to the Add Output screen. 12. Click Select Index and navigate to the index you created [YourName]SSNRedact ed or to SSNredacted . Note: If you correctly configured the index you can use the one you created. Otherwise, you can use the prebuilt “SSNredacted” index . 13. Click Add to Workflow , and then click Next . 14. Click Create . 15. Click Yes to run the workflow now. The workflow should take less than 2 or 3 minutes to complete. HITACHI is a trademark or registered trademark of Hitachi, Ltd. 7

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend