data management for business intelligence
play

DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data Access: Files - PowerPoint PPT Presentation

DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data Access: Files Salvatore Ruggieri Computer Science Department, University of Pisa Master in Big Data Analytics and Social Mining BI Architecture 2 Business Intelligence Two issues 3 Where are


  1. DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data Access: Files Salvatore Ruggieri Computer Science Department, University of Pisa Master in Big Data Analytics and Social Mining

  2. BI Architecture 2 Business Intelligence

  3. Two issues 3 ¨ Where are my files? ¤ Local file systems ¤ Distributed file systems ¤ Network protocols ¨ Which format is file data in? ¤ Text q CSV, JSON Business Intelligence Lab

  4. Local file system 4 Path of a resource � n Windows: n C:\Program Files\Office\sample.doc n Linux: n /usr/home/r/ruggieri/sample.txt Business Intelligence Lab

  5. Local file system 5 A logical abstraction of persistent mass memory ¤ hierarchical view (tree of directories and files) ¤ types of resources (file, directory, pipe, link, special) ¤ resource attributes (owner, rights, hard links) ¤ services (indexing, journaling) Sample file system: ¤ Windows n NTFS, FAT32 ¤ Linux n EXT2, EXT3, JFS, XFS, REISERFS, FAT32 Business Intelligence Lab

  6. Distributed file system 6 PC-smithj PC-smithj PC-you PC-you Business Intelligence Lab

  7. Distributed file system 7 Acts as a client for a remote file access protocol ¤ logical abstraction of remote persistent mass memory Sample file system: ¤ Samba (SMB) or Common Internet File System (CIFS) ¤ Network File System (NFS) ¤ Hadoop Distributed File System (HDFS) Mount/unmount Business Intelligence Lab

  8. Network protocols 8 ¨ Files accessed through explicit request/reply ¨ A local copy has to be made before accessing data ¨ Resource naming: ¤ Uniform Resource Locator (URL) n scheme://user:password@host:port/path n http://bob:bye@www.host.it:80/home/idx.html n scheme = protocol name (http, https, ftp, file, jdbc, …) n port = TCP/IP port number Business Intelligence Lab

  9. HTTP Protocol 9 ¨ HyperText Transfer Protocol n URL: http://user:pwd@www.di.unipi.it n State-less connections n Crypted variant: Secure HTTP (HTTPs) ¨ Windows clients ¤ Any browser ¤ > wget n GNU http://www.gnu.org/software/wget/ n W3C http://www.w3.org/Library ¨ Linux clients ¤ Any browser ¤ > wget Business Intelligence Lab

  10. SCP Protocol 10 ¨ Secure Copy n > scp data.zip user@mylinux.unip.it:datacopy.zip n File copy from/to a remote account n File paths must be known in advance ¨ Client ¤ command line: n > scp/pscp > scp2 ¤ Windows GUI n WinSCP http://winscp.sourceforge.net n SSH Secure Shell ¤ Linux GUI n SCP: default Business Intelligence Lab

  11. Two issues 11 ¨ Where are my files? ¤ Local file systems ¤ Distributed file systems ¤ Network protocols ¨ Which format is file data in? ¤ Text q CSV, ARFF, JSON Business Intelligence Lab

  12. What is a file? 12 ¨ File = sequence of bytes 67 67 73 73 65 65 79 79 10 10 83 83 10 10 … … Business Intelligence Lab

  13. How bytes are mapped to chars? 13 ¨ Character set = alphabet of characters ¨ Coding bytes by means of a character set ¤ ASCII, EBCDIC (1 byte per char) ¤ UNICODE (1/2/4 bytes per char) Business Intelligence Lab

  14. 14 American Standard Code for Information Interchange Business Intelligence Lab

  15. Text file = file+character set 15 ¨ Text file = sequence di characters C C I I A A O O \n S S \n … … Business Intelligence Lab

  16. Viewing text files 16 ¨ By a text editor ¤ Emacs, Nodepad ++ ,TextPad, GEdit, Vi, etc. ¨ “Carriage return” character ¤ Start a new line ¤ Coding n Unix: 1 char ASCII(0A) (‘\n’ in Java) n Windows: 2 chars ASCII(0D 0A) (“\r\n” in Java) n Mac: 1 char ASCII(0D) (‘\r’ in Java) ¤ Conversions n > dos2unix n > unix2dos Business Intelligence Lab

  17. Text file = file+character set 17 ¨ Text file = sequence di lines C C I I A A O O S S … … Business Intelligence Lab

  18. Tabular data format 18 Column Mario Bianchi 23 Student Row Luigi Rossi 30 Workman Anna Verdi 50 Teacher Rosa Neri 20 Student Business Intelligence Lab

  19. Representing tabular data in text files 19 ¨ Comma Separated Values ( CSV ) ¤ A row per line ¤ Column values in a line separated by a special character ¤ Delimiters: comma, tab, space Mario,Bianchi,23,Student Luigi,Rossi,30,Workman Anna,Verdi,50,Teacher Rosa,Neri,20,Student Business Intelligence Lab

  20. Representing tabular data in text files 20 ¨ Fixed Length Values ( FLV ) ¤ A row per line ¤ Column values occupy a fixed number of chars n Allow for random access to elements n Higher disk space requirements Mario Bianchi 23 Student Luigi Rossi 30 Workman Anna Verdi 50 Teacher Rosa Neri 20 Student Business Intelligence Lab

  21. Quoting 21 ¨ What happens in CSV if a delimiter is part of a value? ¤ Format error ¨ Solution: quoting � ¤ Special delimiters for start and end of a value (ex. “ … “) Mario Bianchi 23 Student Luigi Rossi 30 Workman “Mario Bianchi” 23 Student Anna Verdi 50 Teacher “Luigi Rossi” 30 Workman Rosa Neri 20 Student “Anna Verdi” 50 Teacher “Rosa Neri” 20 Student Business Intelligence Lab

  22. Missing values 22 ¨ How to represent missing values in CSV or FLV? ¤ A reserved string: “ ? ”, “null”, “” “Mario Bianchi” 23 Student “Luigi Rossi” 30 ? “Anna Verdi” 50 Teacher “Rosa Neri” ? Student Business Intelligence Lab

  23. CSV in Python 23 ¨ https://docs.python.org/3/library/csv.html Business Intelligence Lab

  24. Meta-data 24 ¨ Describe properties of data ¤ Table name, column name, column type, … name surname age occupation string string int string Mario Bianchi 23 Student Luigi Rossi 30 Workman Anna Verdi 50 Teacher Rosa Neri 20 Student Business Intelligence Lab

  25. How to represent meta-data in text files? 25 ¨ One or two rows: names and types name,surname,age,occupation name surname age occupation string,string,int,string string string int string Business Intelligence Lab

  26. Meta-data and data in text files 26 ¨ In the same file ¤ Meta-data first (header), then data name surname age occupation name,surname,age,occupation string string int string string,string,int,string Mario Bianchi 23 Student Mario,Bianchi,23,Studente Luigi,Rossi,30,Operaio Luigi Rossi 30 Workman Anna,Verdi,50,Insegnante Rosa,Neri,20,Studente Anna Verdi 50 Insegnante Rosa Neri 20 Studente Business Intelligence Lab

  27. Two issues 27 ¨ Where are my files? ¤ Local file systems ¤ Distributed file systems ¤ Network protocols ¨ Which format is file data in? ¤ Text q CSV, JSON Business Intelligence Lab

  28. Data interchange issue 28 ¨ Problem: data interchange between applications ¤ Proprietary data format do not allow for easy interchange n CSV with different delimiters, or column orders n Similar limitations of FLV, ARFF, binary data, etc. ¨ Solution: ¤ definition of an interchange format… ¤ … marking data elements with their meaning … ¤ … so that any other party can easily interpret them. Business Intelligence Lab

  29. JSON http://www.json.org/ 29 ¨ Objects: ¤ comma-separated list of pairs in the form n name : value { "name":"John", "surname":"Doe", "age":25 } ¨ Name is a string ¨ Value data types: ¤ strings ("John") ¤ integer, real (25)

  30. JSON 30 ¨ Value data types: ¤ Arrays: comma-separated list of values { "name":"John", "surname":"Doe", "age":25, "courses": ["BD", "DM", "AI"] }

  31. JSON 31 ¨ Value data types: ¤ Objects { "name":"John", "surname":"Doe", "age":25, "courses": ["BD", "DM", "AI"], "address": {"street":"5th Av.", "city":"NY"}, "friends": [ {"name":"Ed", "surname":"May"}, {"name":"Al", "surname":"Black"} ] }

  32. How to map CSV in JSON? 32 Business Intelligence Lab

  33. JSON in Python 33 ¨ https://docs.python.org/3.5/library/json.html Business Intelligence Lab

  34. DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data Access: Relational Data Bases Salvatore Ruggieri Computer Science Department, University of Pisa Master in Big Data Analytics and Social Mining

  35. BI Architecture 35 Business Intelligence

  36. Connecting to a RDBMS 36 ¨ Connection protocol Client Server ¤ locate the RDBMS server ConnectionString ¤ open a connection OK ¤ user autentication ¨ Querying SQL query ¤ query SQL n SELECT n UPDATE/INSERT/CREATE ¤ stored procedures ¤ prepared query SQL Result set ¨ Scan Result set ¤ scan row by row ¤ access result meta-data Business Intelligence Lab

  37. Connection Standards 37 ¨ ODBC - Open DataBase Connectivity ¤ Windows: odbc Linux: unixodbc, iodbc ¤ Tabular Data ¨ JDBC ¤ Java APIs for tabular data ¨ OLE DB (Microsoft) ¤ Tabular data, XML, multi-dimensional data ¨ ADO (Microsoft) ¤ Object-oriented API on top of OLE DB ¨ ADO.NET ¤ Evolution of ADO in the .NET framework

  38. ODBC Open DataBase Connectivity 38 Business Intelligence Lab

  39. ODBC Demo 39 ¨ Registering an ODBC data source ¨ Data access ¤ accessing Access data from Excel ¨ Linked tables ¤ accessing Excel data from Access Business Intelligence Lab

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend