 
              AUTOMATIC BUSINESS ATTRIBUTE LABELING FROM YELP REVIEWS : A MACHINE LEARNING APPLICATION • by Michael Iannelli and Ephraim Rosenfeld • Search Engine Architecture(CSCI-GA 3033-006 - Spring 2017) • Courant Institute of Mathematical Sciences – New York University
WHAT IS YELP? • “Yelp combines traditional business listings in a directory like Yellow Pages with social elements. Customers can leave feedback on their experiences with that business which … informs future customers of what they might expect and … keeps standards high, or forces an improvement of those standards to prevent negative feedback.” 1 What is the Yelp Dataset Challenge?  Since 2009, Yelp has published samples of its business and reviewer data so that students can compete in research projects based off of these datasets  Data consists of business profiles, reviewer account information, and business reviews for: 8 cities in the U.S., 2 in Canada, 1 in the U.K., and 1 in Germany What is the purpose of our analytic?  To predict business profile attributes using review data Why is this valuable?  Manual curation of business account information can be time-consuming, error-prone, and inaccurate  Leveraging crowd-generated data can provide objective insight into a business or venue 1. Source: https://www.techjunkie.com/what-is-yelp/
HOW DO WE PERFORM THE PREDICTIONS? 1. REVIEW DATA AS TERM FEATURE VECTORS Reviews are represented as term feature vectors • TF-IDF scores can be used to give a weight to terms in the vector • Binary classifiers suitable for sparse, noisy, high dimensional vectors (Naïve-Bayes and Linear SVC) • Advantage : “open box” where top -performing features can be analyzed • 2. WORD2VEC Reviews are represented as summaries of distributed word or document representations • • Binary classifiers suitable for dense, lower-dimensional vectors, e.g. SVC, logistical regression, and gradient-boosted machine (GBM) • Advantage: greater accuracy by utilizing context of words
LOOKING AT THE DATA The Yelp Challenge consists of the following JSON files: 1. Businesses : name, id, and profile attributes {"business_id":"eYzm1jUK0GsI2KOLTt2PbQ", "name":"Bâton Rouge Steakhouse & Bar", "neighborhood":"Downtown Core", "address":"218 Yonge Street", "city":"Toronto", "state":"ON", "stars":3.0, "review_count":107, "is_open":1, "attributes":["Alcohol: full_bar","Ambience: {'romantic': False, 'intimate': False, 'classy': False, 'hipster': False, 'touristy': False, 'trendy': False, 'upscale': False, 'casual': True}", "BikeParking: False", "BusinessAcceptsCreditCards : True“, "Caters: False"," GoodForKids: True"} 2. Business tips : short, informational posts that Yelpers provide to each other and to potential customers: “Cash - only! Their Lyonnaise potatoes are very well seasoned.” 3. Business Reviews : lengthier than tips and usually contain a narrative of a customer’s experience: "Like any other Zoe's, this location has great sandwiches, salads, kabobs, and more … I like the location -- Birkdale is super convenient -- but parking can sometimes be a challenge because of the popularity of this shopping center. Things are tight inside the restaurant, too, but they're very kid-friendly and don't mind if you have a stroller with you. Friendly staff, great food, great location.”
DATA PREPARATION AND MODELING 1. Review JSON files are aggregated by business to make one large concatenated review 2. Business Profile information is joined to the review data to form a single JSON record 3. An equal number of 5a. T op-performing positive and negative labels features are extracted for are selected to created further analysis balanced datasets 4. Review data is turned into a 5. Feature 6. Labelled 7. Cross validation datasets are fed word feature vectors and the Selection is used to assess to a classifier business attribute acts as the the results class label
RESULTS • Accuracy improved when using larger datasets TF-IDF Performance Against Full Datasets 0.90 0.85 0.83 • Accuracy differed based upon attribute, with 0.81 0.85 0.80 0.76 0.80 classification of more obscure attributes 0.75 0.72 0.72 0.75 0.69 performing more poorly 0.70 0.66 nb svc 0.65 0.60 0.59 • TF-IDF term-vector modeling performed as 0.60 well, or better than, classification using 0.55 0.50 Word2Vec feature vectors (SVM performed the Accepts Bitcoin Accepts Credit Dogs Allowed Restaurant Restaurant Does Wheelchair Card Delivers Take-out Accessible best) Word2Vec Classification Results • Chi-square feature selection did not 0.9 0.85 substantially improve performance 0.8 0.75 0.7 0.65 0.6 0.55 0.5 BikeParking Accepts Accepts Dogs Allowed Restaurant Good For Takeout Wheelchair Bitcoin Credit Delivers Kids Access SVC Logistic Regression Gradient Boosted Trees
ADDITIONAL EXPERIMENTS AND OBSERVATIONS By imposing a minimum-document frequency Increasing the review-count-per-business threshold on uncommon terms, the feature threshold increased accuracy size “levels - off” as the sample -size increases Performance with different business-review 25,000 thresholds 0.83 20,000 0.85 0.81 0.81 300 5,398 9,010 9,092 0.80 Feature Size 0.76 0.76 15,000 0.76 18,532 0.74 12,864 0.75 0.72 0.71 10,000 0.69 0.69 0.69 0.67 0.70 0.66 0.65 5,000 0.65 0.59 0.59 - 0.60 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 Accuracy Scores 0.55 0.50 Accepts Accepts Dogs Allowed Restaurant Restaurant Wheelchair Bitcoin Credit Card Delivers Does Take- Accessible out Accepts Bitcoin Accepts Credit Card Dogs Allowed min=3 min=5 min=10 Restaurant Delivers Restaurant Does Take-out Wheelchair Accessible
ANALYSIS OF TOP-PERFORMING FEATURES Attribute Positively-correlated SVC Negatively-correlated SVC Features Features Accepts Credit Card card, pricey, online, ordered, hotel cash, atm, debit, plastic, cards,cart Dogs Allowed patio, outside, pet, dog marriot, lobster, lounge, salon Restaurant Delivers delivery, delivers, phone, deliveries, ordered, smoking, casino, register, cost, seated (“pizza” w/ chi2 and tf-idf ) Good for Kids kids, family, families, friendly, daughter, bar, reservation, crowd, hip, downtown, slushies soju, drunk, trendy, casino, cocktail, dj Wheelchair Accessible mall, elevator, hotel, plaza stairs, upstairs
What T op-Performing Features T ell Us About Location and Culture Attribute Positively-correlated SVC Features Negatively-correlated SVC Features Accepts Credit Card Pittsburgh, PA Dogs Allowed Scottsdale, AZ and Stuttgart, Germany Las Vegas, NV Wheelchair Accessible Scottsdale, AZ Toronto, ON and Montreal, QC Attribute T op TF-IDF T erms Bottom TF-IDF T erms Restaurant Delivers pizza, chinese, rice, sushi, chicken, lunch hefeweizen (a kind of beer), abendessen (dinner), essens (food) Attribute T op Chi-Squared Features Dogs Allowed sushi, rice, thai, korean, noodles, Japanese, pho, ramen
FUTURE WORK • Improve performance • Incorporate n-grams into our modeling • Calibrate the inverse-document frequency (IDF) to give less weight to ubiquitous terms • Use a higher confidence threshold or review-count-per-business threshold to classify fewer businesses with greater op-performing feature analysis • T • Assess the polarity of chi-squared correlated features • Use vector arithmetic for business attribute research: • Example: I like all qualities of Business- X except for one, e.g. “Smoking Prohibited”: • Vector arithmetic: [Business-X] + [Smoking Permitted] = [Business-Y where smoking is permitted]
QUESTIONS?
Recommend
More recommend