migrating terrible content to drupal 8
play

Migrating Terrible Content to Drupal 8 - PowerPoint PPT Presentation

Migrating Terrible Content to Drupal 8 https://events.drupal.org/seattle2019/sessions/migrating-terrible-static-content-drupal-8 Kristian Ducharme About Me Kristian Ducharme - Technical Lead / Product Manager / DevOps Engineer @ CivicActions


  1. Migrating Terrible Content to Drupal 8 https://events.drupal.org/seattle2019/sessions/migrating-terrible-static-content-drupal-8 Kristian Ducharme

  2. About Me ❏ Kristian Ducharme - Technical Lead / Product Manager / DevOps Engineer @ CivicActions ❏ Drupal Projects - NSF.gov (current), DEA.gov, DKAN, USDA.gov, Georgia.gov, DigitalDemocracy.org, Whitehouse.gov, City of Los Angeles ❏ Past Presentations - DrupalCon Los Angeles 2015, BADCamp 2016 ❏ What else do I do? Musician, Dad, Electronics DIYer

  3. The problem Almost all websites have “terrible” yet necessary content to migrate. ■ A lot has changed since the ‘90s. ■ In most cases, very loose “structure” for static HTML ■ Most government sites required to preserve content ■ Mobile? Responsive? Accessibility? What’s an iPhone? ■ Dynamic content was more difficult to make

  4. Difficulties With Static Content Migration ■ Source content: Variance in formats/HTML markup/tools used to author ■ Varying migration needs: Simple as basic text, as complicated as media w/paragraphs plus file attachments ■ Content buried inside of content: Tables, deeper links, surrounded by other extraneous information. ■ Changing static content before go-live: Needs ability to re-run migrations

  5. Available Drupal Migration Tools ■ Core Migrate API: https://www.drupal.org/docs/8/api/migrate-api ■ Migrate Plus: https://www.drupal.org/project/migrate_plus (Mike Ryan) ■ Migrate Tools: https://www.drupal.org/project/migrate_tools (Mike Ryan) ■ Migrate File: https://www.drupal.org/project/migrate_file (Chris Eastwood) ■ Migration Tools: https://www.drupal.org/project/migration_tools (CivicActions) ■ QueryPath: http://querypath.org

  6. Preparing for Migration (Less “Terrible” Content) ■ “Content Cleanup During Migration” Florida DrupalCamp 2019 - Steve Wirt https://www.fldrupal.camp/sessions/development-performance/content-cleanup-during-migration ■ Browser/Spidering Tools - Chrome Add-ons: Pesticide, HTML DOM Navigator, Site Spider. Screaming Frog ■ Auditing Content - Spreadsheets for auditing, CSV exporting

  7. Core Migration + Migration Tools

  8. Migration Workflow

  9. Configuring Migration Tools ■ Migration Tools integrates via PrepareRow , part of “source” configuration. ■ Each “Row” can be a URL or HTML data. ■ Added to Migration YAML as a “ migration_tools ” key under “Source” list key. ■ Migration YAML ○ Source - whether input field is a URL to fetch or HTML content. ○ Source Operations - Performed on HTML prior to initializing QueryPath in order specified. ○ Fields - Defines jobs for extracting content using Obtainers (May be renamed in future release). ○ DOM Operations - Performed on QueryPath object in order specified.

  10. Source Operations SourceModifierHTML Class ■ replaceString ■ runStringTools (cont’d) ■ basicCleanup ○ makeWordsFirstCapital ■ runStringTools ○ reduceDuplicateBr ○ fixEncoding ○ removePhp ○ convertFatalCharstoASCII ○ decodeHtmlEntityNumeric ○ convertNonASCIItoASCII ○ cleanTitle ○ stripFunkyChars ○ fixHtmlTag ○ superTrim ○ fixHeadTag ○ stripWindowsCRChars ○ fixBodyTag ○ stripCmsLegacyMarkup ○ fixWindowSpecificChars

  11. Fields Definition ■ Name - Used by DOM Operations to run this job set ■ Obtainer - Class to use for obtaining content ■ Jobs - List of jobs to run in order, proceeds until found ○ Job : “addSearch” currently only job type ○ Method : Obtainer method to run ○ Arguments : Passed to method /** * Plucker for nth selector on the page. * fields: * @param string $selector body: * The selector to find. # Finds the body by plucking the .field-name-body field. * @param int $n obtainer: ObtainBody * (optional) The depth to find. Default: first item n=1. jobs: * @param string $method - * (optional) The method to use on the element, text or html. Default: text. job: 'addSearch' * method: 'pluckSelector' * @return string arguments: * The text found. - '#main-content' */ - '1' protected function pluckSelector($selector, $n = 1, $method = 'text') { - innerHTML

  12. Obtainer Workflow

  13. Obtainers ■ ObtainHtml ■ ObtainImage ■ ObtainArray ■ ObtainImageFile ■ ObtainBody ■ ObtainLink ■ ObtainCity ■ ObtainLinkFile ■ ObtainContentType ■ ObtainLocation ■ ObtainCountry ■ ObtainState ■ ObtainDate ■ ObtainSubTitle ■ ObtainDateSpanish ■ ObtainTable ■ ObtainID ■ ObtainTitle

  14. DOM Operations # DOM Operations performs the field jobs and applied modifiers in order. ■ Operation: dom_operations: - ○ Get Field - Runs jobs defined in the “fields” section operation: get_field #'get_field' or 'modifier' field: title # Field from above to get (run jobs) ○ Modifier - Apply a DOM Modifier with arguments - operation: modifier modifier: removeSelectorAll arguments: - '#topbar' - operation: modifier modifier: removeEmptyTables - operation: modifier modifier: removeSelectorAll arguments: - 'strong' - # Get the body field after above modifiers have run. operation: get_field field: body

  15. Data Parser Plugin: DOM Parser ■ Included with Migration Tools ■ What is it? A Migrate Plus module “data parser” plugin (JSON/XML/SOAP) ■ What does it do? Allows you to extract URLs from a webpage (“chunking”) and process each URL as a “row” ■ How do I use it? Combined with Migration Tools, can extract URLs from the DOM

  16. Example Migration Strategy ■ Source Content: ○ HTML Page with list of links to content - Determine how to extract links from DOM ○ HTML Content Page - Determine how to extract elements from a page into Drupal content type fields for migration ■ Defining Drupal Content Structure - fields (including data only needed for migrating), taxonomies, paragraphs, media, etc. ■ Mapping/Extracting content to fields (Migration YAML config) ■ Processing leveraging core/contrib migration process plugins

  17. Press Release Migration Example

  18. Example: DEA.gov Press Release Archives Listing https://web.archive.org/web/20151229193128/http://www.dea.gov/divisions/atl/atl_2015.shtml

  19. source: Strategy: plugin: url data_fetcher_plugin: http Press Release Listing Page data_parser_plugin: dom urls: - 'https://web.archive.org/web/20150907034317/http://www.dea.gov/divisions/bos/bos_2011.shtml' ids: ■ Goal: Capture PR URLS from url: type: string “.PLNews-Article” div area item_selector: url dom_config: ■ Use an Obtainer to grab all the URLs migration_tools: - from that div: source_operations: - ObtainLinkFile, method operation: modifier modifier: basicCleanup findFileLinksHref fields: url: ■ Base URL or Relative URL links ? obtainer: ObtainLinkFile jobs: Use a DOM Operation modifier prior to - job: addSearch running Obtainer job. method: findFileLinksHref arguments: - '.PLNews-Article' - [] - [ 'web.archive.org' ] dom_operations: - operation: modifier modifier: convertBaseHrefLinks - operation: get_field field: url

  20. Example: DEA.gov Press Release Page https://web.archive.org/web/20150915220656/http://www.dea.gov/divisions/bos/2011/bos111611.shtml

  21. Example: DEA.gov PR Content Type

  22. Strategy: Press Release Page Identify what content needs extraction to fields: ■ Jobs: ■ Title, Subtitle, Date, Contact, Phone Number, ○ ○ Date - From “.PLNews-Byline”? Division, Body, PDF Attachments How about from URL via regex?? Structure of the content: ■ http://www.dea.gov/divisions/bos/2011/bos 111611 .shtml = /[a-z]{3}([0-9]+)\.shtml/ Everything is inside of a “PLNews-Article” div ○ ○ Phone Number - Pluck from “.PLNews-Byline”, regex: class. /([0-9]{3}-[0-9]{3}-[0-9]{4})/ Date, Contact, Division, Phone number inside ○ Division - from “.PLNews-Byline”? How about from URL via ○ of “PLNews-Byline” div class, separated by <br> regex?? tags http://www.dea.gov/divisions/bos/2011/bos111611.shtml = Title is contained in “PLNews-Title” div class, /divisions\/([a-z]*)\/[0-9]*/ ○ ○ Title - Pluck from “.PLNews-Title” Subtitle is in “PLNews-Sub-Title” div class - ○ Subtitle - Pluck from “.PLNews-Sub-Title” finally an easy one! Body text begins after the Subtitle, contains ○ Body - Needs everything above removed before ○ PDF attachment links processing so “.PLNews-Article” contains only the body. ○ Attachments - Pluck files in “.PLNews-Article”

  23. “Subtractive” Content Extraction

  24. source: PR Migration YAML migration_tools: plugin: url - data_fetcher_plugin: http source: url data_parser_plugin: dom source_type: url urls: source_operations: - 'https://web.archive.org/web/20150907034317/http://www.dea.gov/divisions/bos/bos_2011.shtml' operation: modifier ids: modifier: basicCleanup url: fields: type: string pdf_files: item_selector: url obtainer: ObtainLinkFile dom_config: jobs: migration_tools: - - job: addSearch source_operations: method: pluckFileLinksHref - arguments: operation: modifier - '.PLNews-Article' modifier: basicCleanup - [ 'pdf' ] fields: byline: url: obtainer: ObtainHTML obtainer: ObtainLinkFile jobs: jobs: - - job: addSearch job: addSearch method: pluckSelector method: findFileLinksHref arguments: arguments: - .PLNews-Byline - '.PLNews-Article' - '' - [] - 'innerHTML' - [ 'web.archive.org' ] title: dom_operations: obtainer: ObtainTitle - jobs: operation: modifier - modifier: convertBaseHrefLinks job: addSearch - method: pluckSelector operation: get_field arguments: field: url - .PLNews-Title

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend