Comma Police: The Design and Implementation of a CSV Library George - - PowerPoint PPT Presentation

comma police the design and implementation of a csv
SMART_READER_LITE
LIVE PREVIEW

Comma Police: The Design and Implementation of a CSV Library George - - PowerPoint PPT Presentation

Comma Police: The Design and Implementation of a CSV Library George Wilson Data61/CSIRO george.wilson@data61.csiro.au 23rd May 2018 JSON YAML XML CSV PSV sv {CSV, PSV, . . . } library for Haskell example.csv CSV


slide-1
SLIDE 1

Comma Police: The Design and Implementation of a CSV Library

George Wilson

Data61/CSIRO george.wilson@data61.csiro.au

23rd May 2018

slide-2
SLIDE 2

JSON YAML XML CSV PSV

slide-3
SLIDE 3

sv

{CSV, PSV, . . . } library for Haskell

slide-4
SLIDE 4

CSV

  • Very popular format for data science
  • Described not standardised by RFC 4180

example.csv

"id","species","count" 1,"kangaroo",30 2,"kookaburra",460 3,"platypus",5

slide-5
SLIDE 5

Text CSV data structure Parse User-defined data types Decode

slide-6
SLIDE 6

Text CSV data structure Parse Print User-defined data types Decode Encode

slide-7
SLIDE 7

Text Parse Print User-defined data types Decode Encode

slide-8
SLIDE 8

parse :: ByteString -> Either ByteString (Sv ByteString) decode :: Decode s a -> Sv s -> DecodeValidation a encodeSv :: Encode a -> [a] -> Sv ByteString printSv :: Sv ByteString -> ByteString

slide-9
SLIDE 9

Direct

  • less memory allocated
  • faster
  • streaming made easier

Intermediate structure

  • potential for better errors (often)
  • make decisions based on the structure
  • manipulate the tree to alter documents
slide-10
SLIDE 10

Text CSV data structure Parse Print Manipulate

slide-11
SLIDE 11

needs-fixing.csv

'name',"age" "Frank",30 George, '25' "Harry","32"

slide-12
SLIDE 12

fixQuotes :: Sv s -> Sv s fixQuotes =

  • ver headerFields fixQuote . over recordFields fixQuote

where headerFields = traverseHeader . fields recordFields = traverseRecords . fields fixQuote :: Field a -> Field a fixQuote f = case f of Unquoted a -> Quoted DoubleQuote (noEscape a) Quoted _ v -> Quoted DoubleQuote v

slide-13
SLIDE 13

needs-fixing.csv

'name',"age" "Frank",30 George, '25' "Harry","32"

slide-14
SLIDE 14

fixed.csv

"name","age" "Frank","30" "George", "25" "Harry","32"

slide-15
SLIDE 15

Use sv to define custom linters and sanitisers

Text CSV data structure Parse Print Manipulate

slide-16
SLIDE 16

CSV data structure User-defined data types Decode

slide-17
SLIDE 17

data Decode s a = ...

slide-18
SLIDE 18

data Decode s a = ... raw :: Decode a a ignore :: Decode a () int :: Decode ByteString Int ascii :: Decode ByteString String text :: Decode ByteString Text

slide-19
SLIDE 19

data Decode s a = ... raw :: Decode a a ignore :: Decode a () int :: Decode ByteString Int ascii :: Decode ByteString String text :: Decode ByteString Text instance Functor (Decode s) instance Applicative (Decode s) instance Alt (Decode s) where

slide-20
SLIDE 20

person.csv

"name","age" "Frank","30" "George", "25" "Harry","32"

slide-21
SLIDE 21

person.csv

"name","age" "Frank","30" "George", "25" "Harry","32" data Person = Person Text Int

slide-22
SLIDE 22

person.csv

"name","age" "Frank","30" "George", "25" "Harry","32" data Person = Person Text Int personD :: Decode ByteString Person personD = Person <$> text <*> int

slide-23
SLIDE 23

ragged.csv

"George","Wilson",25 "Frank",33 "Tim",18 "John","Smith",45

slide-24
SLIDE 24

ragged.csv

"George","Wilson",25 "Frank",33 "Tim",18 "John","Smith",45 data Person = OneName Text Int | TwoNames Text Text Int

slide-25
SLIDE 25

ragged.csv

"George","Wilson",25 "Frank",33 "Tim",18 "John","Smith",45 data Person = OneName Text Int | TwoNames Text Text Int personDecoder :: Decode Person personDecoder = OneName <$> text <*> int <!> TwoNames <$> text <*> text <*> int

slide-26
SLIDE 26

class Profunctor p where dimap :: (a -> b) -> (c -> d) -> p b c -> p a d instance Profunctor Decode

slide-27
SLIDE 27

class Profunctor p where dimap :: (a -> b) -> (c -> d) -> p b c -> p a d instance Profunctor Decode

  • - make a Decode work on a different string type

decoder :: Decode ByteString A input :: Text

slide-28
SLIDE 28

class Profunctor p where dimap :: (a -> b) -> (c -> d) -> p b c -> p a d instance Profunctor Decode

  • - make a Decode work on a different string type

decoder :: Decode ByteString A input :: Text encodeUtf8 :: Text -> ByteString

slide-29
SLIDE 29

class Profunctor p where dimap :: (a -> b) -> (c -> d) -> p b c -> p a d instance Profunctor Decode

  • - make a Decode work on a different string type

decoder :: Decode ByteString A input :: Text encodeUtf8 :: Text -> ByteString dimap encodeUtf8 id decoder :: Decode Text A

slide-30
SLIDE 30

Why not a type class?

  • A decoder is something I want to manipulate
  • There are often many different ways to decode the same type
slide-31
SLIDE 31

ignoreFailure :: Decode s a -> Decode s (Maybe a) ignoreFailure a = Just <$> a <!> Nothing <* ignore

slide-32
SLIDE 32

ignoreFailure :: Decode s a -> Decode s (Maybe a) ignoreFailure a = Just <$> a <!> Nothing <* ignore

ints.csv

3 4 8.8 1 null

slide-33
SLIDE 33

ignoreFailure :: Decode s a -> Decode s (Maybe a) ignoreFailure a = Just <$> a <!> Nothing <* ignore

ints.csv

3 4 8.8 1 null parseDecodefromFile (ignoreFailure int) "ints.csv"

  • - [Just 3, Just 4, Nothing, Just 1, Nothing]
slide-34
SLIDE 34
  • - succeeds with Nothing when
  • - the underlying decoder fails

ignoreFailure :: Decode s a -> Decode s (Maybe a)

  • - succeeds with Nothing only when
  • - the field is completely empty
  • rEmpty :: Decode s a -> Decode s (Maybe a)
  • - succeeds with Nothing only when
  • - there is no field at all
  • ptionalField :: Decode s a -> Decode s (Maybe a)
slide-35
SLIDE 35

conferences.csv

"name","date" "Compose Conf",20170828 "Compose Conf",20180827 "Lambda Jam",20170508 "Lambda Jam",20180521

slide-36
SLIDE 36

import Data.Thyme data Conference = Conf Text YearMonthDay

slide-37
SLIDE 37

import Data.Thyme data Conference = Conf Text YearMonthDay ymdParser :: A.Parser YearMonthDay ymdParser = buildTime <$> timeParser defaultTimeLocale "%Y%m%d"

slide-38
SLIDE 38

import Data.Thyme data Conference = Conf Text YearMonthDay ymdParser :: A.Parser YearMonthDay ymdParser = buildTime <$> timeParser defaultTimeLocale "%Y%m%d" trifecta :: T.Parser a -> Decode ByteString a attoparsec :: A.Parser a -> Decode ByteString a

slide-39
SLIDE 39

import Data.Thyme data Conference = Conf Text YearMonthDay ymdParser :: A.Parser YearMonthDay ymdParser = buildTime <$> timeParser defaultTimeLocale "%Y%m%d" trifecta :: T.Parser a -> Decode ByteString a attoparsec :: A.Parser a -> Decode ByteString a ymd :: Decode YearMonthDay ymd = attoparsec ymdParser

slide-40
SLIDE 40

import Data.Thyme data Conference = Conf Text YearMonthDay ymdParser :: A.Parser YearMonthDay ymdParser = buildTime <$> timeParser defaultTimeLocale "%Y%m%d" trifecta :: T.Parser a -> Decode ByteString a attoparsec :: A.Parser a -> Decode ByteString a ymd :: Decode YearMonthDay ymd = attoparsec ymdParser confD :: Decode ByteString Conference confD = Conf <$> text <*> ymd

slide-41
SLIDE 41

sv uses error values data DecodeError s = UnexpectedEndOfRow | ExpectedEndOfRow [Field s] | BadParse s | BadDecode s ...

slide-42
SLIDE 42
  • nError :: Decode s a
  • > (DecodeErrors s -> Decode s a)
  • > Decode s a
slide-43
SLIDE 43

Rather than Either for errors, sv uses the Validation data type data Validation e a = Failure e | Success a

slide-44
SLIDE 44

Rather than Either for errors, sv uses the Validation data type data Validation e a = Failure e | Success a instance Semigroup e => Applicative (Validation e)

slide-45
SLIDE 45

Rather than Either for errors, sv uses the Validation data type data Validation e a = Failure e | Success a instance Semigroup e => Applicative (Validation e) newtype DecodeErrors s = DecodeErrors (NonEmpty (DecodeError s)) deriving Semigroup

slide-46
SLIDE 46

example.csv

"a","b","c"

slide-47
SLIDE 47

example.csv

"a","b","c" data Two = Two Int Int

slide-48
SLIDE 48

example.csv

"a","b","c" data Two = Two Int Int twoD :: Decode ByteString Two twoD = Two <$> int <*> int

slide-49
SLIDE 49

example.csv

"a","b","c" data Two = Two Int Int twoD :: Decode ByteString Two twoD = Two <$> int <*> int parseDecodeFromFile twoD "example.csv"

slide-50
SLIDE 50

example.csv

"a","b","c" data Two = Two Int Int twoD :: Decode ByteString Two twoD = Two <$> int <*> int parseDecodeFromFile twoD "example.csv" Failure (DecodeErrors ( BadDecode "Couldn't parse \"a\" as an int" :| [ BadDecode "Couldn't parse \"b\" as an int" , ExpectedEndOfRow ["c"] ] ))

slide-51
SLIDE 51

What about encoding?

CSV data structure User-defined data types Encode

slide-52
SLIDE 52

data Encode a = ...

slide-53
SLIDE 53

data Encode a = ... int :: Encode Int double :: Encode Double string :: Encode String const :: ByteString -> Encode a encodeOf :: Prism' s a -> Encode a -> Encode s

slide-54
SLIDE 54

data Encode a = ... int :: Encode Int double :: Encode Double string :: Encode String const :: ByteString -> Encode a encodeOf :: Prism' s a -> Encode a -> Encode s instance Semigroup (Encode a) instance Contravariant Encode instance Divisible Encode instance Decidable Encode

slide-55
SLIDE 55

Is it fast?

slide-56
SLIDE 56

Is it fast? No

slide-57
SLIDE 57

Benchmarks

  • Benchmarked with a 100,000 line
  • Text, ints, doubles, products, sums
  • cassava vs. sv (instantiated to attoparsec)
slide-58
SLIDE 58
slide-59
SLIDE 59

Use sv-cassava for now

slide-60
SLIDE 60

Noteworthy limitations as at 2018-05-23

  • No column-name-based decoding
  • Errors don’t report source-file positions
  • No streaming
  • Performance needs work (particularly in parsing)
slide-61
SLIDE 61

Contributions to sv are welcome. Do you have a crazy CSV file to challenge sv? Contact me at george.wilson@data61.csiro.au

slide-62
SLIDE 62

References

  • sv library

https://github.com/qfpl/sv https://github.com/qfpl/sv-cassava

  • validation data type

https://hackage.haskell.org/package/validation https://hackage.haskell.org/package/either

  • CSV RFC

https://tools.ietf.org/html/rfc4180

  • Hedgehog

https://hackage.haskell.org/package/hedgehog

slide-63
SLIDE 63

Thanks for listening!