[PPT] - Ingesting 35M images with Python In the cloud. lex Vinyals PowerPoint Presentation

SLIDE 1

Ingesting 35M images with Python In the cloud.

Àlex Vinyals Software Engineer @ Hotels Data

1

SLIDE 2

Unify all the data

Challenges of a metasearch

2

SLIDE 3

3

SLIDE 4

4

SLIDE 5

Partner A

Hotel ID 123 Name Euskalduna Center Street address Avenida Abandoibarra 3 Coordinates 1.23, 2.43

Partner B

Hotel ID $abc Name Euskalduna Conference Center Street address

Av. Abandoibarra 3

Coordinates 1.23754, 2.43123

Partner C

Hotel ID bilbao-hot1 Name Euskalduna CC Street address

Avda. Abandoibarra3,

48009 Coordinates 1.238, 2.431

Magic Happens Skyscanner

Hotel ID 123456 Name Euskalduna Conference Center Street address

Av. Abandoibarra 3

Coordinates 1.23754, 2.43123

5

SLIDE 6

Magic Happens Skyscanner

Hotel ID 123456 Name Euskalduna Conference Center Street address

Av. Abandoibarra 3

Coordinates 1.23754, 2.43123

Data Release

Partner A

Hotel ID 123 Name Euskalduna Center Street address Avenida Abandoibarra 3 Coordinates 1.23, 2.43

Partner B

Hotel ID $abc Name Euskalduna Conference Center Street address

Av. Abandoibarra 3

Coordinates 1.23754, 2.43123

Partner C

Hotel ID bilbao-hot1 Name Euskalduna CC Street address

Avda. Abandoibarra3,

48009 Coordinates 1.238, 2.431

6

SLIDE 7

So what about the images?

7

SLIDE 8

Partner A

Hotel ID 123

Partner B

Hotel ID $abc

Partner C

Hotel ID bilbao-hot1

Magic Happens Skyscanner

Hotel ID 123456

8

SLIDE 9

9

SLIDE 10

10

SLIDE 11

11

SLIDE 12

12

SLIDE 13

With more than 200 partners

800.000 hotels reach production

13

SLIDE 14

Images to process = K * M * N ~ 35M images

K = number of partners

M = avg number of hotels per partner N = avg number of images per partner hotel

14

SLIDE 15

Resizing is a thing

And we have 14 different configurations

15

SLIDE 16

Tale of an image processing pipeline

16

SLIDE 17

Tech Stack

Riding on AWS

17

SLIDE 18

Tech Stack

Riding on AWS

SQS

Simple Queue Service

18

SLIDE 19

Tech Stack

Riding on AWS

SQS

Simple Queue Service

Compute resources

19

SLIDE 20

*with DjangoRestFramework *without Django ORM

Libraries

20

SLIDE 21

*with DjangoRestFramework *without Django ORM

Libraries

21

SLIDE 22

Kombu

Messaging / queues / amqp *with DjangoRestFramework *without Django ORM

Libraries

22

SLIDE 23

Kombu

Messaging / queues / amqp

Boto

Amazon stuff *with DjangoRestFramework *without Django ORM

Libraries

23

SLIDE 24

Kombu

Messaging / queues / amqp

Boto

Amazon stuff

Pillow

Image Processing *with DjangoRestFramework *without Django ORM

Libraries

24

SLIDE 25

Kombu

Messaging / queues / amqp

Boto

Amazon stuff

Pillow

Image Processing *with DjangoRestFramework *without Django ORM

Libraries

Python2.7

25

SLIDE 26

Tale of an image processing pipeline

Triggering Downloading Fingerprinting Deduplicating Prioritising Generating

26

SLIDE 27

Tale of an image processing pipeline

Triggering Downloading Fingerprinting Deduplicating Prioritising Generating Asynchronous ( Always Running ) Triggered by the Data Release

27

SLIDE 28

Triggering

Triggering Downloading Fingerprinting Deduplicating Prioritising Generating

28

SLIDE 29

These urls are new These urls are updated Those urls are deleted

Partner A

Hotel ID 123 Images http:/.../image.png http://… http://…

Partner B

Hotel ID $abc Images http://… http://… http://… http://… http://…

Computes Diff Partner C

Hotel ID bilbao-hot-1 Images http://… http://…

DB

Catalogues

API Image Release

29

SLIDE 30

30

SLIDE 31

31

SLIDE 32

32

SLIDE 33

Downloading

Triggering Downloading Fingerprinting Deduplicating Prioritising Generating

33

SLIDE 34

34

SLIDE 35

import io import boto import requests from PIL import Image s3 = boto.connect_s3() bucket = s3.get_bucket('available-images') @reliable_callback() def downloader_callback(queued_image): """ Overly simplified downloading callback without error handling logic """ response = requests.get(queued_image.url) blob = response.content key = bucket.new_key(queued_image.basename) key.set_contents_from_string(blob) image = Image.open(io.BytesIO(blob)) if should_filter(image): return fingerprinting_producer.publish(queued_image) def should_filter(image): height, width = image.size short_size = min(width, height) if short_size < minimum_short: return True long_size = max(width, height) if long_size < minimum_long: return True total_pixels = width * height if total_pixels > max_pixels: return True return False

35

SLIDE 36

import io import boto import requests from PIL import Image s3 = boto.connect_s3() bucket = s3.get_bucket('available-images') @reliable_callback() def downloader_callback(queued_image): """ Overly simplified downloading callback without error handling logic """ response = requests.get(queued_image.url) blob = response.content key = bucket.new_key(queued_image.basename) key.set_contents_from_string(blob) image = Image.open(io.BytesIO(blob)) if should_filter(image): return fingerprinting_producer.publish(queued_image) def should_filter(image): height, width = image.size short_size = min(width, height) if short_size < minimum_short: return True long_size = max(width, height) if long_size < minimum_long: return True total_pixels = width * height if total_pixels > max_pixels: return True return False

36

SLIDE 37

import io import boto import requests from PIL import Image s3 = boto.connect_s3() bucket = s3.get_bucket('available-images') @reliable_callback() def downloader_callback(queued_image): """ Overly simplified downloading callback without error handling logic """ response = requests.get(queued_image.url) blob = response.content key = bucket.new_key(queued_image.basename) key.set_contents_from_string(blob) image = Image.open(io.BytesIO(blob)) if should_filter(image): return fingerprinting_producer.publish(queued_image) def should_filter(image): height, width = image.size short_size = min(width, height) if short_size < minimum_short: return True long_size = max(width, height) if long_size < minimum_long: return True total_pixels = width * height if total_pixels > max_pixels: return True return False

37

SLIDE 38

import functools import warnings from PIL import Image def reliable_callback(): def decorator(func): @functools.wraps(func) def wrapper(*args, **kwargs): warnings.simplefilter('error', Image.DecompressionBombWarning) try: return func(*args, **kwargs) except BaseException: logger.error("Critical worker error", exc_info=True) return wrapper return decorator

38

SLIDE 39

import functools import warnings from PIL import Image def reliable_callback(): def decorator(func): @functools.wraps(func) def wrapper(*args, **kwargs): warnings.simplefilter('error', Image.DecompressionBombWarning) try: return func(*args, **kwargs) except BaseException: logger.error("Critical worker error", exc_info=True) return wrapper return decorator

39

SLIDE 40

import functools import warnings from PIL import Image def reliable_callback(): def decorator(func): @functools.wraps(func) def wrapper(*args, **kwargs): warnings.simplefilter('error', Image.DecompressionBombWarning) try: return func(*args, **kwargs) except BaseException: logger.error("Critical worker error", exc_info=True) return wrapper return decorator

40

SLIDE 41

import functools import warnings from PIL import Image def reliable_callback(): def decorator(func): @functools.wraps(func) def wrapper(*args, **kwargs): warnings.simplefilter('error', Image.DecompressionBombWarning) try: return func(*args, **kwargs) except BaseException: logger.error("Critical worker error", exc_info=True) return wrapper return decorator

41

SLIDE 42

from kombu import Connection, Consumer, Exchange, Queue, eventloop class KombuConsumer(common.BaseConsumer): # ... bla bla def callback(self, body, message): self.handler(body) message.ack() def listen(self): with Connection(self.backend.broker, transport_options={'region': self.backend.region}) as connection: with Consumer(connection, self.queue, callbacks=[self.callback], accept=[self.backend.serializer]): for _ in eventloop(connection): pass # What a simplified worker looks like # Broker URI stored on Backend object, looks like: # sqs://{s3_key}:{s3_secret}@ consumer = KombuConsumer(backend, handler=downloader.downloader_callback) consumer.listen()

42

SLIDE 43

from kombu import Connection, Consumer, Exchange, Queue, eventloop class KombuConsumer(common.BaseConsumer): # ... bla bla def callback(self, body, message): self.handler(body) message.ack() def listen(self): with Connection(self.backend.broker, transport_options={'region': self.backend.region}) as connection: with Consumer(connection, self.queue, callbacks=[self.callback], accept=[self.backend.serializer]): for _ in eventloop(connection): pass # What a simplified worker looks like # Broker URI stored on Backend object, looks like: # sqs://{s3_key}:{s3_secret}@ consumer = KombuConsumer(backend, handler=downloader.downloader_callback) consumer.listen()

43

SLIDE 44

from kombu import Connection, Consumer, Exchange, Queue, eventloop class KombuConsumer(common.BaseConsumer): # ... bla bla def callback(self, body, message): self.handler(body) message.ack() def listen(self): with Connection(self.backend.broker, transport_options={'region': self.backend.region}) as connection: with Consumer(connection, self.queue, callbacks=[self.callback], accept=[self.backend.serializer]): for _ in eventloop(connection): pass # What a simplified worker looks like # Broker URI stored on Backend object, looks like: # sqs://{s3_key}:{s3_secret}@ consumer = KombuConsumer(backend, handler=downloader.downloader_callback) consumer.listen()

44

SLIDE 45

from kombu import Connection, Consumer, Exchange, Queue, eventloop class KombuConsumer(common.BaseConsumer): # ... bla bla def callback(self, body, message): self.handler(body) message.ack() def listen(self): with Connection(self.backend.broker, transport_options={'region': self.backend.region}) as connection: with Consumer(connection, self.queue, callbacks=[self.callback], accept=[self.backend.serializer]): for _ in eventloop(connection): pass # What a simplified worker looks like # Broker URI stored on Backend object, looks like: # sqs://{s3_key}:{s3_secret}@ consumer = KombuConsumer(backend, handler=downloader.downloader_callback) consumer.listen()

45

SLIDE 46

Fingerprinting

Triggering Downloading Fingerprinting Deduplicating Prioritising Generating

46

SLIDE 47

SQS

Fingerprinters Queue

Fingerprinter RDS

unique identification for the image content

S3

47

SLIDE 48

Are those images the same?

48

SLIDE 49

Are those images the same?

49

SLIDE 50

50

SLIDE 51

Are those images the same?

Yes, they are

51

SLIDE 52

import imagehash def fingerprint_callback(queued_image): blob = download_image_blob(queued_image.basename) image = Image.open(BytesIO(blob)) result = cropped_hash(image, imagehash.phash) store_hashes(queued_image.image_id, result)

52

SLIDE 53

import imagehash def fingerprint_callback(queued_image): blob = download_image_blob(queued_image.basename) image = Image.open(BytesIO(blob)) result = cropped_hash(image, imagehash.dhash) store_hashes(queued_image.image_id, result)

53

SLIDE 54

def cropped_hash(image, algorithm, steps=range(0, 51, 10)): result = [] w, h = image.size # We want to cut by steps % of the image (default 0%, 10%...50%), which # means we need to cut half of that from each side: # | N%/2 | | N%/2 | # +------+----------+------+-- # | : : | N%/2 # +- - - +----------+ - - -+-- # | | | | # | | | | # +- - - +----------+ - - -+-- # | : : | N%/2 # +------+----------+------+-- for x in steps: x_band = x * w / 200 for y in steps: y_band = y * h / 200 with image.crop((x_band, y_band, w-x_band, h-y_band)) as sub_image: sub_hash = algorithm(sub_image) result.append(hash_to_int(sub_hash))

54

SLIDE 55

def cropped_hash(image, algorithm, steps=range(0, 51, 10)): result = [] w, h = image.size # We want to cut by steps % of the image (default 0%, 10%...50%), which # means we need to cut half of that from each side: # | N%/2 | | N%/2 | # +------+----------+------+-- # | : : | N%/2 # +- - - +----------+ - - -+-- # | | | | # | | | | # +- - - +----------+ - - -+-- # | : : | N%/2 # +------+----------+------+-- for x in steps: x_band = x * w / 200 for y in steps: y_band = y * h / 200 with image.crop((x_band, y_band, w-x_band, h-y_band)) as sub_image: sub_hash = algorithm(sub_image) result.append(hash_to_int(sub_hash))

55

SLIDE 56

Deduplication Time

Triggering Downloading Fingerprinting Deduplicating Prioritising Generating

56

SLIDE 57

Deduplicator SQS

Prioritisers Queue

RDS SQS

Deduplicators Queue

API Data Release

CSV with 1M groups of hotel ids Group Payloads If needed *

57

SLIDE 58

58

What is a “group”?

SLIDE 59

59

What is a “group”?

SLIDE 60

“If needed”

60

SLIDE 61

“If needed”

61

SLIDE 62

Hotel Group 123 [(partner_id1, accomodation_id1), …, (partner_idn, accomodation_idn) ]

62

SLIDE 63

Hotel Group 123 [(partner_id1, accomodation_id1), …, (partner_idn, accomodation_idn) ]

63

SLIDE 64

Hotel Group 123 [(partner_id1, accomodation_id1), …, (partner_idn, accomodation_idn) ] Image Group 456 Image Group 203

64

SLIDE 65

Hotel Group 123 [(partner_id1, accomodation_id1), …, (partner_idn, accomodation_idn) ] Hotel Group 123 has two image groups: [456, 203] Image Group 456 Image Group 203

65

SLIDE 66

def deduplicate_group(all_hotel_images): all_images = set(all_hotel_images) groups = [] while all_images: seed_image = all_images.pop() group = {seed_image} new_additions = {seed_image} while new_additions: me = new_additions.pop() for other in all_images: if is_same_picture(me.hashes, other.hashes): group.add(other) new_additions.add(other) all_images = all_images - group groups.append(group) return groups

66

SLIDE 67

def deduplicate_group(all_hotel_images): all_images = set(all_hotel_images) groups = [] while all_images: seed_image = all_images.pop() group = {seed_image} new_additions = {seed_image} while new_additions: me = new_additions.pop() for other in all_images: if is_same_picture(me.hashes, other.hashes): group.add(other) new_additions.add(other) all_images = all_images - group groups.append(group) return groups

67

SLIDE 68

def deduplicate_group(all_hotel_images): all_images = set(all_hotel_images) groups = [] while all_images: seed_image = all_images.pop() group = {seed_image} new_additions = {seed_image} while new_additions: me = new_additions.pop() for other in all_images: if is_same_picture(me.hashes, other.hashes): group.add(other) new_additions.add(other) all_images = all_images - group groups.append(group) return groups

68

SLIDE 69

def deduplicate_group(all_hotel_images): all_images = set(all_hotel_images) groups = [] while all_images: seed_image = all_images.pop() group = {seed_image} new_additions = {seed_image} while new_additions: me = new_additions.pop() for other in all_images: if is_same_picture(me.hashes, other.hashes): group.add(other) new_additions.add(other) all_images = all_images - group groups.append(group) return groups

69

SLIDE 70

def deduplicate_group(all_hotel_images): all_images = set(all_hotel_images) groups = [] while all_images: seed_image = all_images.pop() group = {seed_image} new_additions = {seed_image} while new_additions: me = new_additions.pop() for other in all_images: if is_same_picture(me.hashes, other.hashes): group.add(other) new_additions.add(other) all_images = all_images - group groups.append(group) return groups

70

SLIDE 71

def is_same_picture(left_hashes, right_hashes): for left_hash in left_hashes: for right_hash in right_hashes: ham_dist = hamdist(left_hash, right_hash) if ham_dist < threshold: return True return False

71

SLIDE 72

def is_same_picture(left_hashes, right_hashes): for left_hash in left_hashes: for right_hash in right_hashes: ham_dist = hamdist(left_hash, right_hash) if ham_dist < threshold: return True return False

72

SLIDE 73

How do you tune this step?

Guarantees are needed

73

SLIDE 74

You build a corpus.

74

SLIDE 75

75

SLIDE 76

76

SLIDE 77

Prioritisation

Triggering Downloading Fingerprinting Deduplicating Prioritising Generating

77

SLIDE 78

SQS

Prioritizers Queue

Prioritizer SQS

Generators Queue

RDS

78

SLIDE 79

Hotel Group 123 [ImageGroup406, ImageGroup203]

79

SLIDE 80

Hotel Group 123 [ImageGroup406, ImageGroup203]

80

SLIDE 81

“Best image” “Best Image” Hotel Group 123 [ImageGroup406, ImageGroup203]

81

SLIDE 82

“Best image” “Best Image” Hotel Group 123 [ImageGroup406, ImageGroup203] “Best order” 1 2 reaches production

82

SLIDE 83

What could go wrong?

83

SLIDE 84

What could go wrong?

84

SLIDE 85

What could go wrong?

* Detect features, prioritise based on that. * Tools to manually fix data.

85

SLIDE 86

Generation

Triggering Downloading Fingerprinting Deduplicating Prioritising Generating

86

SLIDE 87

SQS

Generators Queue

Generator RDS S3

87

SLIDE 88

from PIL import ImageEnhance def scalefit(image): sz = image.size bs = best_size(image) # A bit of math: # - if we scale to fit width, the scaling factor is: scale_width = bs[0]/sz[0] # - if we scale to fit height, the scaling factor is: scale_height = bs[1]/sz[1] # We want to scale to the smaller of them (so that the image fits in both), so we scale to width if: # scale_width < scale_height => bs[0]/sz[0] < bs[1]/sz[1] => bs[0]*sz[1] < bs[1]*sz[0] # Having it in this form means no floats are needed; all in integers. if bs[0] * sz[1] < bs[1] * sz[0]: # Scale to width w = bs[0] h = sz[1] * bs[0] / sz[0] else: # Scale to height w = sz[0] * bs[1] / sz[1] h = bs[1] return image.resize((w, h), "bilinear") def contrast(image, value): enhancer = ImageEnhance.Contrast(image) return enhancer.enhance(float(value))

88

SLIDE 89

from PIL import ImageEnhance def scalefit(image): sz = image.size bs = best_size(image) # A bit of math: # - if we scale to fit width, the scaling factor is: scale_width = bs[0]/sz[0] # - if we scale to fit height, the scaling factor is: scale_height = bs[1]/sz[1] # We want to scale to the smaller of them (so that the image fits in both), so we scale to width if: # scale_width < scale_height => bs[0]/sz[0] < bs[1]/sz[1] => bs[0]*sz[1] < bs[1]*sz[0] # Having it in this form means no floats are needed; all in integers. if bs[0] * sz[1] < bs[1] * sz[0]: # Scale to width w = bs[0] h = sz[1] * bs[0] / sz[0] else: # Scale to height w = sz[0] * bs[1] / sz[1] h = bs[1] return image.resize((w, h), "bilinear") def contrast(image, value): enhancer = ImageEnhance.Contrast(image) return enhancer.enhance(float(value))

89

SLIDE 90

from PIL import ImageEnhance def scalefit(image): sz = image.size bs = best_size(image) # A bit of math: # - if we scale to fit width, the scaling factor is: scale_width = bs[0]/sz[0] # - if we scale to fit height, the scaling factor is: scale_height = bs[1]/sz[1] # We want to scale to the smaller of them (so that the image fits in both), so we scale to width if: # scale_width < scale_height => bs[0]/sz[0] < bs[1]/sz[1] => bs[0]*sz[1] < bs[1]*sz[0] # Having it in this form means no floats are needed; all in integers. if bs[0] * sz[1] < bs[1] * sz[0]: # Scale to width w = bs[0] h = sz[1] * bs[0] / sz[0] else: # Scale to height w = sz[0] * bs[1] / sz[1] h = bs[1] return image.resize((w, h), "bilinear") def contrast(image, value): enhancer = ImageEnhance.Contrast(image) return enhancer.enhance(float(value))

90

SLIDE 91

And that’s it for the pipeline

91

SLIDE 92

92

SLIDE 93

93

SLIDE 94

94

SLIDE 95

95

SLIDE 96

96

SLIDE 97

97

SLIDE 98

98

SLIDE 99

99

SLIDE 100

Thanks for listening

Any questions?

100