data str u ct u res vocab le x emes and stringstore
play

Data Str u ct u res : Vocab , Le x emes and StringStore AD VAN C E - PowerPoint PPT Presentation

Data Str u ct u res : Vocab , Le x emes and StringStore AD VAN C E D N L P W ITH SPAC Y Ines Montani spaC y core de v eloper Shared v ocab and string store (1) Vocab : stores data shared across m u ltiple doc u ments To sa v e memor y, spaC y


  1. Data Str u ct u res : Vocab , Le x emes and StringStore AD VAN C E D N L P W ITH SPAC Y Ines Montani spaC y core de v eloper

  2. Shared v ocab and string store (1) Vocab : stores data shared across m u ltiple doc u ments To sa v e memor y, spaC y encodes all strings to hash v al u es Strings are onl y stored once in the StringStore v ia nlp.vocab.strings String store : look u p table in both directions coffee_hash = nlp.vocab.strings['coffee'] coffee_string = nlp.vocab.strings[coffee_hash] Hashes can ' t be re v ersed – that ' s w h y w e need to pro v ide the shared v ocab # Raises an error if we haven't seen the string before string = nlp.vocab.strings[3197928453018144401] ADVANCED NLP WITH SPACY

  3. Shared v ocab and string store (2) Look u p the string and hash in nlp.vocab.strings doc = nlp("I love coffee") print('hash value:', nlp.vocab.strings['coffee']) print('string value:', nlp.vocab.strings[3197928453018144401]) hash value: 3197928453018144401 string value: coffee The doc also e x poses the v ocab and strings doc = nlp("I love coffee") print('hash value:', doc.vocab.strings['coffee']) hash value: 3197928453018144401 ADVANCED NLP WITH SPACY

  4. Le x emes : entries in the v ocab u lar y A Lexeme object is an entr y in the v ocab u lar y doc = nlp("I love coffee") lexeme = nlp.vocab['coffee'] # print the lexical attributes print(lexeme.text, lexeme.orth, lexeme.is_alpha) coffee 3197928453018144401 True Contains the conte x t - independent information abo u t a w ord Word te x t : lexeme.text and lexeme.orth ( the hash ) Le x ical a � rib u tes like lexeme.is_alpha Not conte x t - dependent part - of - speech tags , dependencies or entit y labels ADVANCED NLP WITH SPACY

  5. Vocab , hashes and le x emes ADVANCED NLP WITH SPACY

  6. Let ' s practice ! AD VAN C E D N L P W ITH SPAC Y

  7. Data Str u ct u res : Doc , Span and Token AD VAN C E D N L P W ITH SPAC Y Ines Montani spaC y core de v eloper

  8. The Doc object # Create an nlp object from spacy.lang.en import English nlp = English() # Import the Doc class from spacy.tokens import Doc # The words and spaces to create the doc from words = ['Hello', 'world', '!'] spaces = [True, False, False] # Create a doc manually doc = Doc(nlp.vocab, words=words, spaces=spaces) ADVANCED NLP WITH SPACY

  9. The Span object (1) ADVANCED NLP WITH SPACY

  10. The Span object (2) # Import the Doc and Span classes from spacy.tokens import Doc, Span # The words and spaces to create the doc from words = ['Hello', 'world', '!'] spaces = [True, False, False] # Create a doc manually doc = Doc(nlp.vocab, words=words, spaces=spaces) # Create a span manually span = Span(doc, 0, 2) # Create a span with a label span_with_label = Span(doc, 0, 2, label="GREETING") # Add span to the doc.ents doc.ents = [span_with_label] ADVANCED NLP WITH SPACY

  11. Best practices Doc and Span are v er y po w erf u l and hold references and relationships of w ords and sentences Con v ert res u lt to strings as late as possible Use token a � rib u tes if a v ailable – for e x ample , token.i for the token inde x Don ' t forget to pass in the shared vocab ADVANCED NLP WITH SPACY

  12. Let ' s practice ! AD VAN C E D N L P W ITH SPAC Y

  13. Word v ectors and semantic similarit y AD VAN C E D N L P W ITH SPAC Y Ines Montani spaC y core de v eloper

  14. Comparing semantic similarit y spaCy can compare t w o objects and predict similarit y Doc.similarity() , Span.similarity() and Token.similarity() Take another object and ret u rn a similarit y score ( 0 to 1 ) Important : needs a model that has w ord v ectors incl u ded , for e x ample : YES : en_core_web_md ( medi u m model ) YES : en_core_web_lg ( large model ) NO : en_core_web_sm ( small model ) ADVANCED NLP WITH SPACY

  15. Similarit y e x amples (1) # Load a larger model with vectors nlp = spacy.load('en_core_web_md') # Compare two documents doc1 = nlp("I like fast food") doc2 = nlp("I like pizza") print(doc1.similarity(doc2)) 0.8627204117787385 # Compare two tokens doc = nlp("I like pizza and pasta") token1 = doc[2] token2 = doc[4] print(token1.similarity(token2)) 0.7369546 ADVANCED NLP WITH SPACY

  16. Similarit y e x amples (2) # Compare a document with a token doc = nlp("I like pizza") token = nlp("soap")[0] print(doc.similarity(token)) 0.32531983166759537 # Compare a span with a document span = nlp("I like pizza and pasta")[2:5] doc = nlp("McDonalds sells burgers") print(span.similarity(doc)) 0.619909235817623 ADVANCED NLP WITH SPACY

  17. Ho w does spaC y predict similarit y? Similarit y is determined u sing w ord v ectors M u lti - dimensional meaning representations of w ords Generated u sing an algorithm like Word 2 Vec and lots of te x t Can be added to spaC y' s statistical models Defa u lt : cosine similarit y, b u t can be adj u sted Doc and Span v ectors defa u lt to a v erage of token v ectors Short phrases are be � er than long doc u ments w ith man y irrele v ant w ords ADVANCED NLP WITH SPACY

  18. Word v ectors in spaC y # Load a larger model with vectors nlp = spacy.load('en_core_web_md') doc = nlp("I have a banana") # Access the vector via the token.vector attribute print(doc[3].vector) [2.02280000e-01, -7.66180009e-02, 3.70319992e-01, 3.28450017e-02, -4.19569999e-01, 7.20689967e-02, -3.74760002e-01, 5.74599989e-02, -1.24009997e-02, 5.29489994e-01, -5.23800015e-01, -1.97710007e-01, -3.41470003e-01, 5.33169985e-01, -2.53309999e-02, 1.73800007e-01, 1.67720005e-01, 8.39839995e-01, 5.51070012e-02, 1.05470002e-01, 3.78719985e-01, 2.42750004e-01, 1.47449998e-02, 5.59509993e-01, 1.25210002e-01, -6.75960004e-01, 3.58420014e-01, -4.00279984e-02, 9.59490016e-02, -5.06900012e-01, -8.53179991e-02, 1.79800004e-01, 3.38669986e-01, ... ADVANCED NLP WITH SPACY

  19. Similarit y depends on the application conte x t Usef u l for man y applications : recommendation s y stems , � agging d u plicates etc . There ' s no objecti v e de � nition of " similarit y" Depends on the conte x t and w hat application needs to do doc1 = nlp("I like cats") doc2 = nlp("I hate cats") print(doc1.similarity(doc2)) 0.9501447503553421 ADVANCED NLP WITH SPACY

  20. Let ' s practice ! AD VAN C E D N L P W ITH SPAC Y

  21. Combining models and r u les AD VAN C E D N L P W ITH SPAC Y Ines Montani spaC y core de v eloper

  22. Statistical predictions v s . r u les Statistical models R u le - based s y stems application needs to generali z e Use cases based on e x amples Real -w orld prod u ct names , person names , e x amples s u bject / object relationships spaC y entit y recogni z er , dependenc y feat u res parser , part - of - speech tagger ADVANCED NLP WITH SPACY

  23. Statistical predictions v s . r u les Statistical models R u le - based s y stems application needs to generali z e based dictionar y w ith � nite n u mber of Use cases on e x amples e x amples Real -w orld prod u ct names , person names , co u ntries of the w orld , cities , dr u g e x amples s u bject / object relationships names , dog breeds tokeni z er , Matcher , PhraseMatcher spaC y entit y recogni z er , dependenc y parser , feat u res part - of - speech tagger ADVANCED NLP WITH SPACY

  24. Recap : R u le - based Matching # Initialize with the shared vocab from spacy.matcher import Matcher matcher = Matcher(nlp.vocab) # Patterns are lists of dictionaries describing the tokens pattern = [{'LEMMA': 'love', 'POS': 'VERB'}, {'LOWER': 'cats'}] matcher.add('LOVE_CATS', None, pattern) # Operators can specify how often a token should be matched pattern = [{'TEXT': 'very', 'OP': '+'}, {'TEXT': 'happy'}] # Calling matcher on doc returns list of (match_id, start, end) tuples doc = nlp("I love cats and I'm very very happy") matches = matcher(doc) ADVANCED NLP WITH SPACY

  25. Adding statistical predictions matcher = Matcher(nlp.vocab) matcher.add('DOG', None, [{'LOWER': 'golden'}, {'LOWER': 'retriever'}]) doc = nlp("I have a Golden Retriever") for match_id, start, end in matcher(doc): span = doc[start:end] print('Matched span:', span.text) # Get the span's root token and root head token print('Root token:', span.root.text) print('Root head token:', span.root.head.text) # Get the previous token and its POS tag print('Previous token:', doc[start - 1].text, doc[start - 1].pos_) Matched span: Golden Retriever Root token: Retriever Root head token: have Previous token: a DET ADVANCED NLP WITH SPACY

  26. Efficient phrase matching (1) PhraseMatcher like reg u lar e x pressions or ke yw ord search – b u t w ith access to the tokens ! Takes Doc object as pa � erns More e � cient and faster than the Matcher Great for matching large w ord lists ADVANCED NLP WITH SPACY

  27. Efficient phrase matching (2) from spacy.matcher import PhraseMatcher matcher = PhraseMatcher(nlp.vocab) pattern = nlp("Golden Retriever") matcher.add('DOG', None, pattern) doc = nlp("I have a Golden Retriever") # iterate over the matches for match_id, start, end in matcher(doc): # get the matched span span = doc[start:end] print('Matched span:', span.text) Matched span: Golden Retriever ADVANCED NLP WITH SPACY

  28. Let ' s practice ! AD VAN C E D N L P W ITH SPAC Y

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend