Ingesting 35M images with Python In the cloud. lex Vinyals - - PowerPoint PPT Presentation

ingesting 35m images with python in the cloud
SMART_READER_LITE
LIVE PREVIEW

Ingesting 35M images with Python In the cloud. lex Vinyals - - PowerPoint PPT Presentation

Ingesting 35M images with Python In the cloud. lex Vinyals Software Engineer @ Hotels Data 1 Unify all the data Challenges of a metasearch 2 3 4 Partner A Partner B Partner C Hotel ID Hotel ID Hotel ID 123 $abc bilbao-hot1 Name


slide-1
SLIDE 1

Ingesting 35M images with Python In the cloud.

Àlex Vinyals Software Engineer @ Hotels Data

1

slide-2
SLIDE 2

Unify all the data

Challenges of a metasearch

2

slide-3
SLIDE 3

3

slide-4
SLIDE 4

4

slide-5
SLIDE 5

Partner A

Hotel ID 123 Name Euskalduna Center Street address Avenida Abandoibarra 3 Coordinates 1.23, 2.43

Partner B

Hotel ID $abc Name Euskalduna Conference Center Street address

  • Av. Abandoibarra 3

Coordinates 1.23754, 2.43123

Partner C

Hotel ID bilbao-hot1 Name Euskalduna CC Street address

  • Avda. Abandoibarra3,

48009 Coordinates 1.238, 2.431

Magic Happens Skyscanner

Hotel ID 123456 Name Euskalduna Conference Center Street address

  • Av. Abandoibarra 3

Coordinates 1.23754, 2.43123

5

slide-6
SLIDE 6

Magic Happens Skyscanner

Hotel ID 123456 Name Euskalduna Conference Center Street address

  • Av. Abandoibarra 3

Coordinates 1.23754, 2.43123

Data Release

Partner A

Hotel ID 123 Name Euskalduna Center Street address Avenida Abandoibarra 3 Coordinates 1.23, 2.43

Partner B

Hotel ID $abc Name Euskalduna Conference Center Street address

  • Av. Abandoibarra 3

Coordinates 1.23754, 2.43123

Partner C

Hotel ID bilbao-hot1 Name Euskalduna CC Street address

  • Avda. Abandoibarra3,

48009 Coordinates 1.238, 2.431

6

slide-7
SLIDE 7

So what about the images?

7

slide-8
SLIDE 8

Partner A

Hotel ID 123

Partner B

Hotel ID $abc

Partner C

Hotel ID bilbao-hot1

Magic Happens Skyscanner

Hotel ID 123456

8

slide-9
SLIDE 9

9

slide-10
SLIDE 10

10

slide-11
SLIDE 11

11

slide-12
SLIDE 12

12

slide-13
SLIDE 13

With more than 200 partners

800.000 hotels reach production

13

slide-14
SLIDE 14

Images to process = K * M * N ~ 35M images

K = number of partners

M = avg number of hotels per partner N = avg number of images per partner hotel

14

slide-15
SLIDE 15

Resizing is a thing

And we have 14 different configurations

15

slide-16
SLIDE 16

Tale of an image processing pipeline

16

slide-17
SLIDE 17

Tech Stack

Riding on AWS

17

slide-18
SLIDE 18

Tech Stack

Riding on AWS

SQS

Simple Queue Service

18

slide-19
SLIDE 19

Tech Stack

Riding on AWS

SQS

Simple Queue Service

Compute resources

19

slide-20
SLIDE 20

*with DjangoRestFramework *without Django ORM

Libraries

20

slide-21
SLIDE 21

*with DjangoRestFramework *without Django ORM

Libraries

21

slide-22
SLIDE 22

Kombu

Messaging / queues / amqp *with DjangoRestFramework *without Django ORM

Libraries

22

slide-23
SLIDE 23

Kombu

Messaging / queues / amqp

Boto

Amazon stuff *with DjangoRestFramework *without Django ORM

Libraries

23

slide-24
SLIDE 24

Kombu

Messaging / queues / amqp

Boto

Amazon stuff

Pillow

Image Processing *with DjangoRestFramework *without Django ORM

Libraries

24

slide-25
SLIDE 25

Kombu

Messaging / queues / amqp

Boto

Amazon stuff

Pillow

Image Processing *with DjangoRestFramework *without Django ORM

Libraries

Python2.7

25

slide-26
SLIDE 26

Tale of an image processing pipeline

Triggering Downloading Fingerprinting Deduplicating Prioritising Generating

26

slide-27
SLIDE 27

Tale of an image processing pipeline

Triggering Downloading Fingerprinting Deduplicating Prioritising Generating Asynchronous ( Always Running ) Triggered by the Data Release

27

slide-28
SLIDE 28

Triggering

Triggering Downloading Fingerprinting Deduplicating Prioritising Generating

28

slide-29
SLIDE 29

These urls are new These urls are updated Those urls are deleted

Partner A

Hotel ID 123 Images http:/.../image.png http://… http://…

Partner B

Hotel ID $abc Images http://… http://… http://… http://… http://…

Computes Diff Partner C

Hotel ID bilbao-hot-1 Images http://… http://…

DB

Catalogues

API Image Release

29

slide-30
SLIDE 30

30

slide-31
SLIDE 31

31

slide-32
SLIDE 32

32

slide-33
SLIDE 33

Downloading

Triggering Downloading Fingerprinting Deduplicating Prioritising Generating

33

slide-34
SLIDE 34

34

slide-35
SLIDE 35

import io import boto import requests from PIL import Image s3 = boto.connect_s3() bucket = s3.get_bucket('available-images') @reliable_callback() def downloader_callback(queued_image): """ Overly simplified downloading callback without error handling logic """ response = requests.get(queued_image.url) blob = response.content key = bucket.new_key(queued_image.basename) key.set_contents_from_string(blob) image = Image.open(io.BytesIO(blob)) if should_filter(image): return fingerprinting_producer.publish(queued_image) def should_filter(image): height, width = image.size short_size = min(width, height) if short_size < minimum_short: return True long_size = max(width, height) if long_size < minimum_long: return True total_pixels = width * height if total_pixels > max_pixels: return True return False

35

slide-36
SLIDE 36

import io import boto import requests from PIL import Image s3 = boto.connect_s3() bucket = s3.get_bucket('available-images') @reliable_callback() def downloader_callback(queued_image): """ Overly simplified downloading callback without error handling logic """ response = requests.get(queued_image.url) blob = response.content key = bucket.new_key(queued_image.basename) key.set_contents_from_string(blob) image = Image.open(io.BytesIO(blob)) if should_filter(image): return fingerprinting_producer.publish(queued_image) def should_filter(image): height, width = image.size short_size = min(width, height) if short_size < minimum_short: return True long_size = max(width, height) if long_size < minimum_long: return True total_pixels = width * height if total_pixels > max_pixels: return True return False

36

slide-37
SLIDE 37

import io import boto import requests from PIL import Image s3 = boto.connect_s3() bucket = s3.get_bucket('available-images') @reliable_callback() def downloader_callback(queued_image): """ Overly simplified downloading callback without error handling logic """ response = requests.get(queued_image.url) blob = response.content key = bucket.new_key(queued_image.basename) key.set_contents_from_string(blob) image = Image.open(io.BytesIO(blob)) if should_filter(image): return fingerprinting_producer.publish(queued_image) def should_filter(image): height, width = image.size short_size = min(width, height) if short_size < minimum_short: return True long_size = max(width, height) if long_size < minimum_long: return True total_pixels = width * height if total_pixels > max_pixels: return True return False

37

slide-38
SLIDE 38

import functools import warnings from PIL import Image def reliable_callback(): def decorator(func): @functools.wraps(func) def wrapper(*args, **kwargs): warnings.simplefilter('error', Image.DecompressionBombWarning) try: return func(*args, **kwargs) except BaseException: logger.error("Critical worker error", exc_info=True) return wrapper return decorator

38

slide-39
SLIDE 39

import functools import warnings from PIL import Image def reliable_callback(): def decorator(func): @functools.wraps(func) def wrapper(*args, **kwargs): warnings.simplefilter('error', Image.DecompressionBombWarning) try: return func(*args, **kwargs) except BaseException: logger.error("Critical worker error", exc_info=True) return wrapper return decorator

39

slide-40
SLIDE 40

import functools import warnings from PIL import Image def reliable_callback(): def decorator(func): @functools.wraps(func) def wrapper(*args, **kwargs): warnings.simplefilter('error', Image.DecompressionBombWarning) try: return func(*args, **kwargs) except BaseException: logger.error("Critical worker error", exc_info=True) return wrapper return decorator

40

slide-41
SLIDE 41

import functools import warnings from PIL import Image def reliable_callback(): def decorator(func): @functools.wraps(func) def wrapper(*args, **kwargs): warnings.simplefilter('error', Image.DecompressionBombWarning) try: return func(*args, **kwargs) except BaseException: logger.error("Critical worker error", exc_info=True) return wrapper return decorator

41

slide-42
SLIDE 42

from kombu import Connection, Consumer, Exchange, Queue, eventloop class KombuConsumer(common.BaseConsumer): # ... bla bla def callback(self, body, message): self.handler(body) message.ack() def listen(self): with Connection(self.backend.broker, transport_options={'region': self.backend.region}) as connection: with Consumer(connection, self.queue, callbacks=[self.callback], accept=[self.backend.serializer]): for _ in eventloop(connection): pass # What a simplified worker looks like # Broker URI stored on Backend object, looks like: # sqs://{s3_key}:{s3_secret}@ consumer = KombuConsumer(backend, handler=downloader.downloader_callback) consumer.listen()

42

slide-43
SLIDE 43

from kombu import Connection, Consumer, Exchange, Queue, eventloop class KombuConsumer(common.BaseConsumer): # ... bla bla def callback(self, body, message): self.handler(body) message.ack() def listen(self): with Connection(self.backend.broker, transport_options={'region': self.backend.region}) as connection: with Consumer(connection, self.queue, callbacks=[self.callback], accept=[self.backend.serializer]): for _ in eventloop(connection): pass # What a simplified worker looks like # Broker URI stored on Backend object, looks like: # sqs://{s3_key}:{s3_secret}@ consumer = KombuConsumer(backend, handler=downloader.downloader_callback) consumer.listen()

43

slide-44
SLIDE 44

from kombu import Connection, Consumer, Exchange, Queue, eventloop class KombuConsumer(common.BaseConsumer): # ... bla bla def callback(self, body, message): self.handler(body) message.ack() def listen(self): with Connection(self.backend.broker, transport_options={'region': self.backend.region}) as connection: with Consumer(connection, self.queue, callbacks=[self.callback], accept=[self.backend.serializer]): for _ in eventloop(connection): pass # What a simplified worker looks like # Broker URI stored on Backend object, looks like: # sqs://{s3_key}:{s3_secret}@ consumer = KombuConsumer(backend, handler=downloader.downloader_callback) consumer.listen()

44

slide-45
SLIDE 45

from kombu import Connection, Consumer, Exchange, Queue, eventloop class KombuConsumer(common.BaseConsumer): # ... bla bla def callback(self, body, message): self.handler(body) message.ack() def listen(self): with Connection(self.backend.broker, transport_options={'region': self.backend.region}) as connection: with Consumer(connection, self.queue, callbacks=[self.callback], accept=[self.backend.serializer]): for _ in eventloop(connection): pass # What a simplified worker looks like # Broker URI stored on Backend object, looks like: # sqs://{s3_key}:{s3_secret}@ consumer = KombuConsumer(backend, handler=downloader.downloader_callback) consumer.listen()

45

slide-46
SLIDE 46

Fingerprinting

Triggering Downloading Fingerprinting Deduplicating Prioritising Generating

46

slide-47
SLIDE 47

SQS

Fingerprinters Queue

Fingerprinter RDS

unique identification for the image content

S3

47

slide-48
SLIDE 48

Are those images the same?

48

slide-49
SLIDE 49

Are those images the same?

49

slide-50
SLIDE 50

50

slide-51
SLIDE 51

Are those images the same?

Yes, they are

51

slide-52
SLIDE 52

import imagehash def fingerprint_callback(queued_image): blob = download_image_blob(queued_image.basename) image = Image.open(BytesIO(blob)) result = cropped_hash(image, imagehash.phash) store_hashes(queued_image.image_id, result)

52

slide-53
SLIDE 53

import imagehash def fingerprint_callback(queued_image): blob = download_image_blob(queued_image.basename) image = Image.open(BytesIO(blob)) result = cropped_hash(image, imagehash.dhash) store_hashes(queued_image.image_id, result)

53

slide-54
SLIDE 54

def cropped_hash(image, algorithm, steps=range(0, 51, 10)): result = [] w, h = image.size # We want to cut by steps % of the image (default 0%, 10%...50%), which # means we need to cut half of that from each side: # | N%/2 | | N%/2 | # +------+----------+------+-- # | : : | N%/2 # +- - - +----------+ - - -+-- # | | | | # | | | | # +- - - +----------+ - - -+-- # | : : | N%/2 # +------+----------+------+-- for x in steps: x_band = x * w / 200 for y in steps: y_band = y * h / 200 with image.crop((x_band, y_band, w-x_band, h-y_band)) as sub_image: sub_hash = algorithm(sub_image) result.append(hash_to_int(sub_hash))

54

slide-55
SLIDE 55

def cropped_hash(image, algorithm, steps=range(0, 51, 10)): result = [] w, h = image.size # We want to cut by steps % of the image (default 0%, 10%...50%), which # means we need to cut half of that from each side: # | N%/2 | | N%/2 | # +------+----------+------+-- # | : : | N%/2 # +- - - +----------+ - - -+-- # | | | | # | | | | # +- - - +----------+ - - -+-- # | : : | N%/2 # +------+----------+------+-- for x in steps: x_band = x * w / 200 for y in steps: y_band = y * h / 200 with image.crop((x_band, y_band, w-x_band, h-y_band)) as sub_image: sub_hash = algorithm(sub_image) result.append(hash_to_int(sub_hash))

55

slide-56
SLIDE 56

Deduplication Time

Triggering Downloading Fingerprinting Deduplicating Prioritising Generating

56

slide-57
SLIDE 57

Deduplicator SQS

Prioritisers Queue

RDS SQS

Deduplicators Queue

API Data Release

CSV with 1M groups of hotel ids Group Payloads If needed *

57

slide-58
SLIDE 58

58

What is a “group”?

slide-59
SLIDE 59

59

What is a “group”?

slide-60
SLIDE 60

“If needed”

60

slide-61
SLIDE 61

“If needed”

61

slide-62
SLIDE 62

Hotel Group 123 [(partner_id1, accomodation_id1), …, (partner_idn, accomodation_idn) ]

62

slide-63
SLIDE 63

Hotel Group 123 [(partner_id1, accomodation_id1), …, (partner_idn, accomodation_idn) ]

63

slide-64
SLIDE 64

Hotel Group 123 [(partner_id1, accomodation_id1), …, (partner_idn, accomodation_idn) ] Image Group 456 Image Group 203

64

slide-65
SLIDE 65

Hotel Group 123 [(partner_id1, accomodation_id1), …, (partner_idn, accomodation_idn) ] Hotel Group 123 has two image groups: [456, 203] Image Group 456 Image Group 203

65

slide-66
SLIDE 66

def deduplicate_group(all_hotel_images): all_images = set(all_hotel_images) groups = [] while all_images: seed_image = all_images.pop() group = {seed_image} new_additions = {seed_image} while new_additions: me = new_additions.pop() for other in all_images: if is_same_picture(me.hashes, other.hashes): group.add(other) new_additions.add(other) all_images = all_images - group groups.append(group) return groups

66

slide-67
SLIDE 67

def deduplicate_group(all_hotel_images): all_images = set(all_hotel_images) groups = [] while all_images: seed_image = all_images.pop() group = {seed_image} new_additions = {seed_image} while new_additions: me = new_additions.pop() for other in all_images: if is_same_picture(me.hashes, other.hashes): group.add(other) new_additions.add(other) all_images = all_images - group groups.append(group) return groups

67

slide-68
SLIDE 68

def deduplicate_group(all_hotel_images): all_images = set(all_hotel_images) groups = [] while all_images: seed_image = all_images.pop() group = {seed_image} new_additions = {seed_image} while new_additions: me = new_additions.pop() for other in all_images: if is_same_picture(me.hashes, other.hashes): group.add(other) new_additions.add(other) all_images = all_images - group groups.append(group) return groups

68

slide-69
SLIDE 69

def deduplicate_group(all_hotel_images): all_images = set(all_hotel_images) groups = [] while all_images: seed_image = all_images.pop() group = {seed_image} new_additions = {seed_image} while new_additions: me = new_additions.pop() for other in all_images: if is_same_picture(me.hashes, other.hashes): group.add(other) new_additions.add(other) all_images = all_images - group groups.append(group) return groups

69

slide-70
SLIDE 70

def deduplicate_group(all_hotel_images): all_images = set(all_hotel_images) groups = [] while all_images: seed_image = all_images.pop() group = {seed_image} new_additions = {seed_image} while new_additions: me = new_additions.pop() for other in all_images: if is_same_picture(me.hashes, other.hashes): group.add(other) new_additions.add(other) all_images = all_images - group groups.append(group) return groups

70

slide-71
SLIDE 71

def is_same_picture(left_hashes, right_hashes): for left_hash in left_hashes: for right_hash in right_hashes: ham_dist = hamdist(left_hash, right_hash) if ham_dist < threshold: return True return False

71

slide-72
SLIDE 72

def is_same_picture(left_hashes, right_hashes): for left_hash in left_hashes: for right_hash in right_hashes: ham_dist = hamdist(left_hash, right_hash) if ham_dist < threshold: return True return False

72

slide-73
SLIDE 73

How do you tune this step?

Guarantees are needed

73

slide-74
SLIDE 74

You build a corpus.

74

slide-75
SLIDE 75

75

slide-76
SLIDE 76

76

slide-77
SLIDE 77

Prioritisation

Triggering Downloading Fingerprinting Deduplicating Prioritising Generating

77

slide-78
SLIDE 78

SQS

Prioritizers Queue

Prioritizer SQS

Generators Queue

RDS

78

slide-79
SLIDE 79

Hotel Group 123 [ImageGroup406, ImageGroup203]

79

slide-80
SLIDE 80

Hotel Group 123 [ImageGroup406, ImageGroup203]

80

slide-81
SLIDE 81

“Best image” “Best Image” Hotel Group 123 [ImageGroup406, ImageGroup203]

81

slide-82
SLIDE 82

“Best image” “Best Image” Hotel Group 123 [ImageGroup406, ImageGroup203] “Best order” 1 2 reaches production

82

slide-83
SLIDE 83

What could go wrong?

83

slide-84
SLIDE 84

What could go wrong?

84

slide-85
SLIDE 85

What could go wrong?

* Detect features, prioritise based on that. * Tools to manually fix data.

85

slide-86
SLIDE 86

Generation

Triggering Downloading Fingerprinting Deduplicating Prioritising Generating

86

slide-87
SLIDE 87

SQS

Generators Queue

Generator RDS S3

87

slide-88
SLIDE 88

from PIL import ImageEnhance def scalefit(image): sz = image.size bs = best_size(image) # A bit of math: # - if we scale to fit width, the scaling factor is: scale_width = bs[0]/sz[0] # - if we scale to fit height, the scaling factor is: scale_height = bs[1]/sz[1] # We want to scale to the smaller of them (so that the image fits in both), so we scale to width if: # scale_width < scale_height => bs[0]/sz[0] < bs[1]/sz[1] => bs[0]*sz[1] < bs[1]*sz[0] # Having it in this form means no floats are needed; all in integers. if bs[0] * sz[1] < bs[1] * sz[0]: # Scale to width w = bs[0] h = sz[1] * bs[0] / sz[0] else: # Scale to height w = sz[0] * bs[1] / sz[1] h = bs[1] return image.resize((w, h), "bilinear") def contrast(image, value): enhancer = ImageEnhance.Contrast(image) return enhancer.enhance(float(value))

88

slide-89
SLIDE 89

from PIL import ImageEnhance def scalefit(image): sz = image.size bs = best_size(image) # A bit of math: # - if we scale to fit width, the scaling factor is: scale_width = bs[0]/sz[0] # - if we scale to fit height, the scaling factor is: scale_height = bs[1]/sz[1] # We want to scale to the smaller of them (so that the image fits in both), so we scale to width if: # scale_width < scale_height => bs[0]/sz[0] < bs[1]/sz[1] => bs[0]*sz[1] < bs[1]*sz[0] # Having it in this form means no floats are needed; all in integers. if bs[0] * sz[1] < bs[1] * sz[0]: # Scale to width w = bs[0] h = sz[1] * bs[0] / sz[0] else: # Scale to height w = sz[0] * bs[1] / sz[1] h = bs[1] return image.resize((w, h), "bilinear") def contrast(image, value): enhancer = ImageEnhance.Contrast(image) return enhancer.enhance(float(value))

89

slide-90
SLIDE 90

from PIL import ImageEnhance def scalefit(image): sz = image.size bs = best_size(image) # A bit of math: # - if we scale to fit width, the scaling factor is: scale_width = bs[0]/sz[0] # - if we scale to fit height, the scaling factor is: scale_height = bs[1]/sz[1] # We want to scale to the smaller of them (so that the image fits in both), so we scale to width if: # scale_width < scale_height => bs[0]/sz[0] < bs[1]/sz[1] => bs[0]*sz[1] < bs[1]*sz[0] # Having it in this form means no floats are needed; all in integers. if bs[0] * sz[1] < bs[1] * sz[0]: # Scale to width w = bs[0] h = sz[1] * bs[0] / sz[0] else: # Scale to height w = sz[0] * bs[1] / sz[1] h = bs[1] return image.resize((w, h), "bilinear") def contrast(image, value): enhancer = ImageEnhance.Contrast(image) return enhancer.enhance(float(value))

90

slide-91
SLIDE 91

And that’s it for the pipeline

91

slide-92
SLIDE 92

92

slide-93
SLIDE 93

93

slide-94
SLIDE 94

94

slide-95
SLIDE 95

95

slide-96
SLIDE 96

96

slide-97
SLIDE 97

97

slide-98
SLIDE 98

98

slide-99
SLIDE 99

99

slide-100
SLIDE 100

Thanks for listening

Any questions?

100