GoodReads Book Recommendation Service Yijun Tian, Vicky Bai, Zeynep - - PowerPoint PPT Presentation

goodreads book recommendation service
SMART_READER_LITE
LIVE PREVIEW

GoodReads Book Recommendation Service Yijun Tian, Vicky Bai, Zeynep - - PowerPoint PPT Presentation

GoodReads Book Recommendation Service Yijun Tian, Vicky Bai, Zeynep Doganata Introduction/Related Work - A subclass of information filtering system that seek to predict the rating or preference that a user would give to an


slide-1
SLIDE 1

GoodReads Book Recommendation Service

Yijun Tian, Vicky Bai, Zeynep Doganata

slide-2
SLIDE 2

Introduction/Related Work

  • “A subclass of information filtering system that seek to predict the ‘rating’ or

‘preference’ that a user would give to an item” -- Wikipedia

  • Recommendation systems drive significant engagement and revenue for

companies such as Amazon, Netflix, and Good Readers.

  • Approaches: Collaborative filtering, Content-based filtering, Contextual

filtering, Social and demographic filtering.

  • Techniques: Supervised Learning, Clustering/Unsupervised Learning,

Transfer Learning, Text Classification, Text Embedding

slide-3
SLIDE 3

Data - GoodReads

Datasets:

  • Meta-Data of Books (2.36M books);
  • User-Book Interactions (229M user-book interactions);
  • Book Review Texts (15M records).

Book Information (book id, title, description) User Behavior (rating, is_read) Recommendation Engine

Book

User Information (user id, user’s shelf)

Book Book Book Book

slide-4
SLIDE 4

Pipeline

  • Input: a book ID
  • Output: most similar books,

including ID, titles, description Extract Similar Books Ground Truth Similar Books Reader-based Similar Books

(A, Similar Book B1) (A, Similar Book Bn)

Model Most Similar Books Book A Model

InferSent Book A embedding Similar Books embedding Calculate Similarity Score

slide-5
SLIDE 5

Extract Similar Books

  • Ground Truth Similar Books

Provided in GoodReads dataset. However, we don’t know how they generate the similar books (e.g. same series, topic, author?)

  • Reader-based Similar Books
  • Share same readers
  • Share same ratings (4/5 stars)
  • Randomly select 200 similar books

(1) Ground Truth Similar Books (2) Reader-based Similar Books

slide-6
SLIDE 6

Model Exploration

  • 1. Word embedding: Word2vec vs FastText
  • 2. Transfer learning: ULMFit
  • 3. Sentence embedding: InferSent
slide-7
SLIDE 7

Word Embedding

One Hot Encoding Co-Occurrence Matrix Word2Vec apple apples

<ap app ppl ple le>

Word2Vec fastText

slide-8
SLIDE 8

ULMFiT: Universal Language Model Fine-tuning for Text Classification

  • Transfer learning in NLP
  • Consists of 3 main phases:
  • Language model trained on general domain corpus
  • Fine tuning begins on target task data using slanted triangular learning

rates to learn features

  • Further fine-tuning using gradual unfreezing and slanted triangular

learning rates - to preserve low-level learnings and adapt high-level representations

  • Not yet used very much for unsupervised tasks such as Semantic Text

Similarity - many tasks implemented with ULMFiT involve classification

Paper: “Universal Language Model Fine-tuning for Text Classification”

slide-9
SLIDE 9

InferSent: sentence embedding

  • To obtain general-purpose sentence embeddings

that capture generic information

  • Pre-trained on Stanford Natural Language

Inference (SNLI) dataset.

  • 570k humangenerated English sentence

pairs

  • u: premise representation
  • v: hypothesis representation
  • 3-class classifier: entailment, contradiction and

neutral

  • Example: A soccer game with multiple males playing & Some men

are playing a sport. (entailment)

Paper: “Supervised Learning of Universal Sentence Representations from Natural Language Inference Data”

slide-10
SLIDE 10

InferSent: sentence embedding

  • Our accuracy: 0.73 in all similar books, 0.77 in top 5. (test size: 3000 books and their similar books)

Paper: “Supervised Learning of Universal Sentence Representations from Natural Language Inference Data”

slide-11
SLIDE 11

Example Results

Original Book: Anita Diamant's international bestseller "The Red Tent" brilliantly re-created the ancient world of womanhood. Diamant brings her remarkable storytelling skills to "Good Harbor" -- offering insight to the precarious balance of marriage and career, motherhood and friendship in the world of modern women. The seaside town of Gloucester, Massachusetts is a place where the smell of the ocean lingers in the air and the rocky coast glistens in the Atlantic sunshine. When longtime Gloucester-resident Kathleen Levine is diagnosed with breast cancer, her life is thrown into turmoil. Frightened and burdened by secrets, she meets Joyce Tabachnik -- a freelance writer with literary aspirations -- and a once-in-a-lifetime friendship is born. Joyce has just bought a small house in Gloucester, where she hopes to write as well as vacation with her family. Like Kathleen, Joyce is at a fragile place in her life. A mutual love for books, humor, and the beauty of the natural world brings the two women together. They share their personal histories, and help each other to confront scars left by old emotional wounds. With her own trademark wisdom and humor, Diamant considers the nature, strength, and necessity of adult female friendship. "Good Harbor" examines the tragedy of loss, the insidious nature of family secrets, as well as the redemptive power of friendship. Similar Book: In A Little Love Story, Roland Merullo--winner of the Massachusetts Book Award and the Maria Thomas Fiction Award--has created a sometimes poignant, sometimes hilarious tale of attraction and loyalty, jealousy and grief. It is a classic love story--with some modern twists. Janet Rossi is very smart and unusually attractive, an aide to the governor of Massachusetts, but she suffers from an illness that makes her, as she puts it, "not exactly a good long-term investment." Jake Entwhistle is a few years older, a carpenter and portrait painter, smart and good-looking too, but with a shadow over his romantic history. After meeting by accident--literally--when Janet backs into Jake's antique truck, they begin a love affair marked by courage, humor, a deep and erotic intimacy . . . and modern complications. Working with the basic architecture of the love story genre, Merullo--a former carpenter known for his novels about family life--breaks new ground with a fresh look at modern romance, taking liberties with the classic design, adding original lines of friendship, spirituality, and laughter, and, of course, probing the mystery of love. ... (Score: 0.8631)

slide-12
SLIDE 12

API Demo

slide-13
SLIDE 13
slide-14
SLIDE 14
slide-15
SLIDE 15

Service Hosting

Specs/Details:

  • Flask application with model and data preloading
  • GET endpoint with book_id and “top n” parameter
  • Docker image ~ 10.4 GB
  • InferSent model file size ~ 4.5 GB
  • GoodReads data ~ 2.5 GB
slide-16
SLIDE 16

AWS Fargate

  • AWS Serverless compute engine for containers
  • Works with ECS - elastic container service
  • UI configuration - not very intuitive

○ Task definitions ○ Container definitions ○ Soft and hard limits on resources at both layers

  • 10 GB memory limit! :(
slide-17
SLIDE 17

Kubernetes on DigitalOcean

  • “The node had condition: [MemoryPressure]”
slide-18
SLIDE 18

AWS SageMaker

  • Targeted towards Data Scientist and ML

engineers to provide serverless capabilities for: ○ Labeling ○ Building ○ Training ○ Sharing notebooks ○ Deploying models ○ Managing Inference endpoints ○ Supports “custom” containers

slide-19
SLIDE 19

More on ...

  • Containers must be deployed to AWS ECR
  • Must be organized in a compatible way:

○ Infer POST endpoint following Sagemaker spec ○ Model directory that gets packaged and uploaded to S3 as part of the deployment ○ Data directory

slide-20
SLIDE 20

Short-term solution: EC2 Instance

  • Sagemaker looked promising, but after deploying our container we saw it

would require a non-trivial amount of refactoring to make it work

  • To ensure we had our service deployed somewhere, we provisioned an EC2

instance

  • Trade - offs:

○ Availability - intermittent crashing ○ Scaling requires: ■ fleet management ■ load balancer

slide-21
SLIDE 21

Hosting Enhancements

If we had more time…

  • Dockerize better - layering analysis and pruning unnecessary base image

packages

  • Host the model file externally (S3)
  • Upload GoodReads data externally or pickle data structures (S3)
  • Possibly use a small key-value based DB for GoodReads data storage
  • Try SageMaker which has been optimized to do this for us
  • Latency optimization:

○ GPU inference ○ Experiment with precomputing embeddings for our dataset

slide-22
SLIDE 22

Other Enhancements

  • Recommendation Engine Enhancement
  • Combined with the user reviews and other information
  • Use another metrics instead of accuracy
  • Embedding enhancements
  • Incorporate the trained fastText model in InferSent
  • Combine the word embeddings and text embeddings
slide-23
SLIDE 23

Reference

Datasets:

  • Mengting Wan, Julian McAuley, "Item Recommendation on Monotonic Behavior Chains", in RecSys'18. [bibtex]
  • Mengting Wan, Rishabh Misra, Ndapa Nakashole, Julian McAuley, "Fine-Grained Spoiler Detection from Large-Scale

Review Corpora", in ACL'19. [bibtex]

  • Common Crawl: https://commoncrawl.org/

Models:

  • Howard, Jeremy and Ruder, Sebastian. "Universal Language Model Fine-tuning for Text Classification." Paper

presented at the meeting of the ACL, 2018.

  • Conneau, Alexis, Douwe Kiela, Holger Schwenk, Loïc Barrault and Antoine Bordes. “Supervised Learning of

Universal Sentence Representations from Natural Language Inference Data.” EMNLP (2017).

slide-24
SLIDE 24

Questions

slide-25
SLIDE 25

Appendix