AUTOMATIC BUSINESS ATTRIBUTE LABELING FROM YELP REVIEWS : A MACHINE - - PowerPoint PPT Presentation

automatic business attribute labeling
SMART_READER_LITE
LIVE PREVIEW

AUTOMATIC BUSINESS ATTRIBUTE LABELING FROM YELP REVIEWS : A MACHINE - - PowerPoint PPT Presentation

AUTOMATIC BUSINESS ATTRIBUTE LABELING FROM YELP REVIEWS : A MACHINE LEARNING APPLICATION by Michael Iannelli and Ephraim Rosenfeld Search Engine Architecture(CSCI-GA 3033-006 - Spring 2017) Courant Institute of Mathematical Sciences


slide-1
SLIDE 1

AUTOMATIC BUSINESS ATTRIBUTE LABELING FROM YELP REVIEWS: A MACHINE LEARNING APPLICATION

  • by Michael Iannelli and Ephraim Rosenfeld
  • Search Engine Architecture(CSCI-GA 3033-006 - Spring 2017)
  • Courant Institute of Mathematical Sciences – New

York University

slide-2
SLIDE 2

WHAT IS YELP?

  • “Yelp combines traditional business listings in a directory like

Yellow Pages with social elements. Customers can leave feedback on their experiences with that business which … informs future customers of what they might expect and … keeps standards high, or forces an improvement of those standards to prevent negative feedback.”1

  • 1. Source: https://www.techjunkie.com/what-is-yelp/

What is the Yelp Dataset Challenge?

 Since 2009, Yelp has published samples of its business and reviewer data so that students can compete in research projects based off of these datasets  Data consists of business profiles, reviewer account information, and business reviews for: 8 cities in the U.S., 2 in Canada, 1 in the U.K., and 1 in Germany

What is the purpose of our analytic?

 To predict business profile attributes using review data

Why is this valuable?

 Manual curation of business account information can be time-consuming, error-prone, and inaccurate  Leveraging crowd-generated data can provide objective insight into a business or venue

slide-3
SLIDE 3

HOW DO WE PERFORM THE PREDICTIONS?

  • 1. REVIEW DATA AS

TERM FEATURE VECTORS

  • Reviews are represented as term feature vectors
  • TF-IDF scores can be used to give a weight to terms in the vector
  • Binary classifiers suitable for sparse, noisy, high dimensional vectors (Naïve-Bayes and Linear SVC)
  • Advantage: “open box” where top-performing features can be analyzed
  • 2. WORD2VEC
  • Reviews are represented as summaries of distributed word or document representations
  • Binary classifiers suitable for dense, lower-dimensional vectors, e.g. SVC, logistical regression, and

gradient-boosted machine (GBM)

  • Advantage: greater accuracy by utilizing context of words
slide-4
SLIDE 4

LOOKING AT THE DATA

The Yelp Challenge consists of the following JSON files:

  • 1. Businesses: name, id, and profile attributes

{"business_id":"eYzm1jUK0GsI2KOLTt2PbQ", "name":"Bâton Rouge Steakhouse & Bar", "neighborhood":"Downtown Core", "address":"218 Yonge Street", "city":"Toronto", "state":"ON", "stars":3.0, "review_count":107, "is_open":1, "attributes":["Alcohol: full_bar","Ambience: {'romantic': False, 'intimate': False, 'classy': False, 'hipster': False, 'touristy': False, 'trendy': False, 'upscale': False, 'casual': True}", "BikeParking: False", "BusinessAcceptsCreditCards: True“, "Caters: False","GoodForKids: True"}

  • 2. Business tips: short, informational posts that

Yelpers provide to each other and to potential customers:

“Cash-only! Their Lyonnaise potatoes are very well seasoned.”

  • 3. Business Reviews: lengthier than tips and usually contain a narrative of a customer’s experience:

"Like any other Zoe's, this location has great sandwiches, salads, kabobs, and more … I like the location -- Birkdale is super convenient -- but parking can sometimes be a challenge because of the popularity of this shopping center. Things are tight inside the restaurant, too, but they're very kid-friendly and don't mind if you have a stroller with you. Friendly staff, great food, great location.”

slide-5
SLIDE 5

DATA PREPARATION AND MODELING

  • 1. Review JSON files are

aggregated by business to make

  • ne large concatenated review
  • 2. Business Profile information

is joined to the review data to form a single JSON record

  • 3. An equal number of

positive and negative labels are selected to created balanced datasets

  • 4. Review data is turned into a

word feature vectors and the business attribute acts as the class label

  • 6. Labelled

datasets are fed to a classifier

  • 7. Cross validation

is used to assess the results

  • 5. Feature

Selection 5a. T

  • p-performing

features are extracted for further analysis

slide-6
SLIDE 6

RESULTS

  • Accuracy improved when using larger datasets
  • Accuracy differed based upon attribute, with

classification of more obscure attributes performing more poorly

  • TF-IDF term-vector modeling performed as

well, or better than, classification using Word2Vec feature vectors (SVM performed the best)

  • Chi-square feature selection did not

substantially improve performance

0.59 0.72 0.69 0.76 0.81 0.66 0.60 0.83 0.75 0.80 0.85 0.72 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 Accepts Bitcoin Accepts Credit Card Dogs Allowed Restaurant Delivers Restaurant Does Take-out Wheelchair Accessible

TF-IDF Performance Against Full Datasets

nb svc 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 BikeParking Accepts Bitcoin Accepts Credit Dogs Allowed Restaurant Delivers Good For Kids Takeout Wheelchair Access

Word2Vec Classification Results

SVC Logistic Regression Gradient Boosted Trees

slide-7
SLIDE 7

ADDITIONAL EXPERIMENTS AND OBSERVATIONS

0.59 0.71 0.69 0.76 0.81 0.65 0.59 0.72 0.69 0.76 0.81 0.66 0.74 0.69 0.76 0.83 0.67 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 Accepts Bitcoin Accepts Credit Card Dogs Allowed Restaurant Delivers Restaurant Does Take-

  • ut

Wheelchair Accessible

Performance with different business-review thresholds

min=3 min=5 min=10 300 12,864 5,398 18,532 9,010 9,092

  • 5,000

10,000 15,000 20,000 25,000 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 Accepts Bitcoin Accepts Credit Card Dogs Allowed Restaurant Delivers Restaurant Does Take-out Wheelchair Accessible

Accuracy Scores Feature Size

By imposing a minimum-document frequency threshold on uncommon terms, the feature size “levels-off” as the sample-size increases Increasing the review-count-per-business threshold increased accuracy

slide-8
SLIDE 8

ANALYSIS OF TOP-PERFORMING FEATURES

Attribute Positively-correlated SVC Features Negatively-correlated SVC Features Accepts Credit Card card, pricey, online, ordered, hotel cash, atm, debit, plastic, cards,cart Dogs Allowed patio, outside, pet, dog marriot, lobster, lounge, salon Restaurant Delivers delivery, delivers, phone, deliveries, ordered, (“pizza” w/ chi2 and tf-idf) smoking, casino, register, cost, seated Good for Kids kids, family, families, friendly, daughter, slushies bar, reservation, crowd, hip, downtown, soju, drunk, trendy, casino, cocktail, dj Wheelchair Accessible mall, elevator, hotel, plaza stairs, upstairs

slide-9
SLIDE 9

Attribute Positively-correlated SVC Features Negatively-correlated SVC Features Accepts Credit Card Pittsburgh, PA Dogs Allowed Scottsdale, AZ and Stuttgart, Germany Las Vegas, NV Wheelchair Accessible Scottsdale, AZ Toronto, ON and Montreal, QC Attribute T

  • p TF-IDF

T erms Bottom TF-IDF T erms Restaurant Delivers pizza, chinese, rice, sushi, chicken, lunch hefeweizen (a kind of beer), abendessen (dinner), essens (food) Attribute T

  • p Chi-Squared Features

Dogs Allowed sushi, rice, thai, korean, noodles, Japanese, pho, ramen

What T

  • p-Performing Features T

ell Us About Location and Culture

slide-10
SLIDE 10

FUTURE WORK

  • Improve performance
  • Incorporate n-grams into our modeling
  • Calibrate the inverse-document frequency (IDF) to give less weight to ubiquitous terms
  • Use a higher confidence threshold or review-count-per-business threshold to classify fewer businesses

with greater

  • T
  • p-performing feature analysis
  • Assess the polarity of chi-squared correlated features
  • Use vector arithmetic for business attribute research:
  • Example: I like all qualities of Business-X except for one, e.g. “Smoking Prohibited”:
  • Vector arithmetic: [Business-X] + [Smoking Permitted] = [Business-Y where smoking is permitted]
slide-11
SLIDE 11

QUESTIONS?