Big Data and Internet Thinking
Chentao Wu Associate Professor
- Dept. of Computer Science and Engineering
Big Data and Internet Thinking Chentao Wu Associate Professor - - PowerPoint PPT Presentation
Big Data and Internet Thinking Chentao Wu Associate Professor Dept. of Computer Science and Engineering wuct@cs.sjtu.edu.cn Download lectures ftp://public.sjtu.edu.cn User: wuct Password: wuct123456
Contents
Mac achine Le Lear arning/D /Deep Le Lear arning IoT (Internet of Thin hings) & & Sensor Anal nalyt ytic ics Mod
ling Wi Will llin ingness- to to-Pay
Natural Lan Language Process ssing Ana Analyzing Da Data @ Sc Scale Creating a Lak Lake St Streaming Co Consumer Be Behavior Da Data
being explicitly programmed to solve business problems
growing network of connected physical objects and the communication between them
to-pay for product features
through application of computer science, AI, and computational linguistics
learning tools to analyze hundreds of gigabytes
when and where consumers are making choices
1 2 3 4 5 6
Methods of using Big Data to generate insight Refers to the DATA only
Decisions Analytics Data Technology Mindset & Skills
The value of Big Data Analytics is driven by the unique decisions facing leaders, companies, and countries today. In turn, the type, frequency, speed, and complexity of decisions drive how Big Data Analytics is deployed. To leverage the variety and volume of Big Data while managing its volatility, advanced analytical approaches are necessary, such as natural language processing, network analysis, simulative modeling, artificial intelligence, etc. Big Data Analytics is about
and more data, but it is also about data quality, data interoperability, data disaggregation, and the ability to modularize data structures to quickly absorb new data and new types of data. To store, manage, and use Big Data
investments in new technologies and data processing methods, such as distributed processing (e.g., Hadoop), NoSQL storage, and Cloud computing. Big Data Analytics requires firm commitment to using analytics in decision- making; a decisive mentality capable of employing in-the- moment intelligence; and investment in analytical technology, resources, and skills.
Traditional Emerging Structured Unstructured
A/B/N Testing Experiment to find the most effective variation of a website, product, etc Sentiment Analysis Extract consumer reactions based on social media behavior Complex Event Processing Combine data sources to recognize events Predictive Modeling Use data to forecast or infer behavior Regression Discover relationships between variables Time Series Analysis Discover relationships
Classification Organize data points into known categories Simulation Modeling Experiment with a system virtually Spatial Analysis Extract geographic or topological information Cluster Analysis Discover meaningful groupings of data points Signal Analysis Distinguish between noise and meaningful information Visualization Use visual representations of data to find and communicate info Network Analysis Discover meaningful nodes and relationships on networks Optimization Improve a process or function based on criteria Deep QA Find answers to human questions using artificial intelligence Natural Language Processing Extract meaning from human speech or writing
understand the past, and opens up entirely new avenues for preparing for and adapting to the future.
What happened?
Describe, summarize and analyze historical data
What should be done?
Recommend ‘right’ or
decisions
How do we adapt to change?
Monitor, decide, and act autonomously or semi-autonomously
What could happen?
Predict future
past
events
sources such as social listening and web crawling
value
Natural Language Processing to identify hidden relationships and themes
service propositions (graph analysis, entity resolution on data lakes to infer present customer need)
multiple ‘what-if’ scenarios
and actions
continuous basis
strategies based on changing environment and improved predictions
dynamic simulation models, time-series analysis Descriptive Analytics
Predictive Analytics Prescriptive Analytics Continuous Analytics
Increasing Business Value
Why did it happen?
Identify causes of trends and outcomes
events
sources such as social listening and web crawling
regression analysis
Diagnostic Analytics
Increasing Sophistication of Data & Analytics Rea ear-view For
Com
Bus usiness Ne Need Data and nd Analy alytic ics Imp mpact Greater tail ailorin ing of f credit it car ard offers to fit customer needs Statistical model based on public credit and demographic data to target customized products to customers Net revenue grew at a CAGR of 32% 32% from 1994 to 2003; prompted competitors to shift focus to data and analytics Da Data-en enable led eng ngin ine pr prog
ics, monitoring, maintenance and repair Analysis of sensor data from hundreds of sensors in 4,000 engines to identify and solve issues weeks in advance Over 70 70% ann nnual l revenue from the aircraft engine division attributable to this service Search-to to-purchase conversio ion by anticipating intent of a shopper’s search and delivering relevant results Semantic search, which enables discovery using algorithms that rank results via social signals from around the web Increases 10-15% the likelihood that a customer will complete their purchase – translating to mi milli llions of f do dolla llars in n re reven enue e Transformation from subscription streaming service to orig igin inal content t pr prod
Analysis of data from 66 million subscribers’ viewing habits and preferences Rev even enue e and nd sub ubscri riber er ba base e incr ncrea eased ed by by 15% and nd 9% 9% respectively in 2013 Le Leverage Internet of f Thi hings (IoT) by connecting machines to facilitate data- enabled prognostics, increase efficiency and reduce downtime Launched software to help airlines and railroads move their data to the cloud and predict mechanical malfunctions, improve safety, and reduce trip cancellations and cost Estimated 1% reduction in fuel costs, projected to sa save the airline industry $3 $30 0 bi billio lion ov
15 year years
Imp Impact Bi Big Da Data a Ana Analy lytics Bus Busin iness Nee eed
Com
Bus usiness Ne Need Data and nd Analy alytic ics Imp mpact Mor
ansparent, reli liable le, and nd low
t me meth thod to to trac ack k infla flatio ion in Argentina Web scraping of online price data used to produce price indices, and econometric analysis used to model disaggregated impacts of policies Gov
tistic ical offic ices sh shift iftin ing to to accept t Big ig Data. Central banks using Big Data to see day-to-day volatility. Unde nderstand how how mi migrants act t as s arbit itrageurs to bring labor markets into equilibrium Iterative analysis of call detail records (CDRs) to track movement of migrants in response to local shocks to labor demand (weather, economy, conflict, etc.) Informin ing labor po polic licy y de desig sign in n low
income countrie ies to incentivize or disincentivize migratory behavior The he city ty of f Rio io de de Jan aneir iro wanted to to imp mprove its eme mergency response by better predicting heavy rainfall and subsequent severe landslides and flooding The city combines data from 30 city agencies – including weather, satellite, video, GPS, historic rainfall, and topographic survey data – in a central Operations Center Rio io has has imp mproved eme mergency y response tim ime by by 30 30%, %, catalogued 200+ flood points, and can now predict heavy rains 48 hours in advance on a half- km basis Create a be better ecos
mobile ile se servic ices in n the he agric icult ltural l se sectors of Kenya, Tanzania, and Mozambique Remote crowdsourced data gathered via cell phones used to connect farmers to markets, assess farmers’ credit worthiness, and incubate new mobile businesses with greater predictors of success M-PESA is being used to low
cos
armers to to receiv ive loan
and nd pe perform tran ansactio ions with distributers and buyers, as well as to provide geography-specific market information
Imp Impact Big Big Da Data Ana naly lytics Bu Busin iness Nee eed
Big Data Analytics to address can help guide where and which data solutions are deployed.
Value enablement Value enhancement
Strategic Tactical Operational
Day y to
y op
rations
management of daily operations
across program units
intelligence Enabl abling ng strat rateg egy y and d imp mprov roving ng perfo rforman rmance
drive consensus
events
Deliveri ering g fut uture ure va value ue
programs/opportunities
intuitive decision-making
Organization-specific knowledge about data assets – including enterprise “metadata” – their location and appropriate business context for use in advanced analytics Comfort in programming across various languages, a thorough understanding of external and internal data sources, data gathering, storing, and retrieving methods which help combine disparate data sources to generate unique insights
Sub ubje ject ct Ar Area or
Domain Ex Expertise Compute ter r Science ce & Pr Progr gramming Sta tati tistical & Mat athemat atica cal Orga ganization-specific Inf nformation Kno nowledge
Expertise in statistical techniques, tools and languages used to run analyses that generate insights to effectively determine and communicate actionable insights Deep understanding of industry, subject area, or research domain to help determine which questions need answering and on what frequency, specificity, or geography
horizontally aligned, how interconnected or autonomous separate units are, how resources and successes are shared – can influence efficiency and impact.
CENTRA TRAL
Ana naly lyti tics Co Compet mpeten ency Cente ter
Dis Distri ributed ed An Analyt ytics Cen entr tralize zed An Analyt ytics
LO LOCAL CENTRA TRAL
Ana naly lyti tics Co Compet mpeten ency Ce Cente nter ETL Data Wa Ware reho house BI BI App ppli licatio tions Meta etadata ta Reposito itory ry Data Mart rt
Feder erated ed An Analyt ytics
LO LOCAL CENTRA TRAL
Ana naly lyti tics Co Compet mpeten ency Ce Cente nter ETL Data Wa Ware reho house BI BI App ppli licatio tions Meta etadata ta Reposito itory ry Data Mart rt ET ETL Data Wa Ware reho house BI BI App ppli licatio tions Meta etadata ta Reposito itory ry Data Mart rt
Objectives
wide strategy
Data Warehouses, Marts, etc.
across groups
Analytics Tools
group framework
having access to shared resources
Analytics Staff/ Competencies
subject matter
available as needed to support individual units
Competency Center ‘Standardization’
2
Loc Local l Bus usiness Ope peratio ions Glob lobal l Bus usiness Str trategy Local Adoption
Centers of Excellence (Regional) Competency Center (‘Standards’) Central Decision Hub
Local ‘Spoke’
Central Decision Hub
1
Center of Excellence (Regional)
3
Center of Excellence (Regional)
3
Center of Excellence (Regional)
3 1 2 3 4
Local ‘Spoke’
4
Local ‘Spoke’
4
Local ‘Spoke’
4
Local ‘Spoke’
4
Local ‘Spoke’
4
Local ‘Spoke’
4
Local ‘Spoke’
4
Local ‘Spoke’
4
Local ‘Spoke’
4
Local ‘Spoke’
4
Sample Hub-Spoke Interaction Model
constraints, and new data structures that traditional data infrastructure is not equipped to support
Obj bjective Consi siderations Impact
Identify the type of analysis that will be conducted and define which analytics capabilities will be employed
Dictates performance needs along with data structures and processing architecture Interface could restrict the ability to perform analysis ad hoc and restrict ability to update Support for analysis specific data structures can improve performance and reduce analysis effort
Define the data set that will be used for the analysis including its sources, size, and structure Size of data sets introduce need for scalable infrastructure and performance Variability of source data models and data set structure require data model flexibility Diverse sources will require scalability, model flexibility, and flexible interfaces Define the timeliness and frequency of the analysis results for reporting and downstream systems Frequency of analysis will dictate the processing architecture (batch or real time) The timeliness of the analysis will impact the need for scalability and performance In and out bound interfaces are defined by the use
Analytics Capabilities Data Variety Application
Analysis Type Size Structure Sources Frequency Speed Interfaces Analysis Flexibility Analysis Structures
Contents
Distributed Processing
Hadoop and similar solutions that provide scalable distributed storage and distributed computation on commodity hardware
NoSQL
Embedded and persisted storage that implement data models through document, graph, and dictionary structures
Cloud Computing
Cloud computing can improve flexibility, scalability and cost management and enable a cohesive business strategy across a org
queries require large volumes of processing cycles that can quickly scale
flexible data models to better ingest unstructured and semi structured data
sources Traditional challenges being addressed…
Distributed Processing
Hadoop and similar solutions that provide scalable distributed storage and distributed computation on commodity hardware
Introduction to Hadoop
early 2000s (combination of Google File System (GFS) and MapReduce)
complex data across multiple data sources
system
whenever possible
system for increased availability and reliability
Faster and Lower Cost Analysis Linear Scalability Greater flexibility
NoSQL
Embedded and persisted storage that implement data models through document, graph, and dictionary structures
NoSQL - Storage Types
Document Store Key – Value Store Graph Store Columnar Store
Solution Examples Increasing Data Complexity
Pros: Simplicity & Scalability Cons: Lack of advanced features/queries Pros: Scalability & Flexibility Cons: Complexity Pros: Easy to Use Cons: Scalability Pros: Graph Joins Cons: Flexibility
Cloud Computing
The model is compelling; cloud computing can improve flexibility, scalability and cost management. Businesses best able to realize the potential will establish a cohesive business strategy as cloud computing can transform your entire organization — people, processes, and systems
Source: PwC, “Digital IQ Snapshot: Cloud,”; PwC, “FS Viewpoint: Clouds is the forecast”
Cloud transformation begins at the infrastructure level and leads to more agile applications, resulting in faster speed to market and more flexibility to meet client needs. The key benefits, beyond consolidation, include standardized application and development environments, resulting in better controlled and more efficient application lifecycles.
Contents
Adding Weighting. WEIGHTED as 2nd parameter of the constructor can cause the resulting correlation to be pushed towards 1.0, or -1.0, depending on how many points are used.
// number of parameters, location of non-zero indices, and non-zero values // number of parameters, Sequence of non-value values (index, value)
powerful tool in statistics to determine whether a result is statistically significant. Spark ML currently supports Pearson’s Chi-squared (χ2) tests for independence.
independence test for every feature against the label.
Tell what size clusters to look for. The algorithm will find the number of clusters that have approximately that size. The algorithm uses two distance thresholds. This method prevents all points close to an already existing canopy from being the center of a new canopy.
https://github.com/HewlettPackard/cacti
Contents
Extraction of implicit, previously unknown, and potentially useful information from data
Data Mining
Analysis of large quantities of natural language text and detecting lexical or linguistic usage patterns to extract probably useful information
Text Mining Natural Language Processing
NLP is a theoretically motivated range of computational techniques for analyzing and representing naturally occurring texts at one or more levels of linguistic analysis for the purpose of achieving human-like language processing for a range of tasks or applications.
Too
Desc Description Ana Analysis Typ ype
Ope penNLP A machine learning based toolkit for the processing of natural language text. Link
GATE ATE A Java suite of tools that can perform natural language processing tasks for multiple languages. Link
NL NLTK A suite of libraries and programs for symbolic and statistical natural language processing Python. Link
Sta tanford NLP Statistical NLP toolkits for various computational linguistics problems that can be incorporated into applications with human language technology needs. Link
LingPipe A tool kit for processing text using computational
Mon
A suite of libraries and programs for symbolic and statistical natural language processing for both Python and Java. Link
Ros Rosetta Linguistic Pl Platform A suite of linguistic analysis components that integrate into applications for mining unstructured data. Link
concept extraction
Too
Des Descrip iption Ana nalysis is Type
Rap apid idMin iner An open source environment for machine learning, data mining, text mining, predictive analytics, and business analytics. Link
SAS S Text t Miner A suite of text processing and analysis tools. Link,
VisualT lText Integrated development environment for building information extraction systems, natural language processing systems, and text analyzers. Link
Search SAS S Sentim timent t Analy alysis is Commercial tool that is dedicated to customer sentiment analysis. Link
monitoring
Textif tifie ier Tool for sorting large amounts of unstructured text with The Public Comment Analysis Toolkit (PCAT). Link
Infin nfinite Ins nsig ight System for automatically preparing and transforming unstructured text attributes into a structured representation. Link
words
Clusti tify fy Software for grouping related documents into clusters, providing an overview of the document set and aiding with categorization. Link
Too
De Description Analysis Type
Attensi sity Ana Analyze Customer analytics applications that help analyze high volumes of customer conversations across multiple channels. Link
communication analysis
ReV eVerb A program that automatically identifies and extracts binary relationships from English
extraction
Ope pen text xt sum summarizer Open source tool for summarizing texts. Link
summarization Ope pen Cala alais Web based API that is used to analyze content and extract topics or information. Link
extraction
Knowledge Sea Search Family of techniques tools for searching and
KH KH Coder A free software for Quantitative Content Analysis or Text Mining Link
Ov Overv rview
information from an image or sets of images for advanced classification and traditional analysis
processing, and machine learning techniques to extract, quantify, and structure, image information Adv dvantages
stored within images
be applied to understanding consumer behavior, automating business processes, and discovering knowledge enterprise content
Too
Ov Overv rvie iew Ima mage Processin ing Co Computer Visio ion Ma Machin ine Le Learnin ing Ope OpenCV Open source library of computer vision functions that is accessible via C, Java, and Python
X X X
PAX AXit it Ima mage Ana Analysis Integrated image analysis platform that provides basic feature identification functions
X X
Ima mageJ Java based image processing platform that can be accessed via an API and expanded with custom plugins
X
PIL Python image processing library
X
PyBrain in A modular machine learning library for Python
X
Over erview
analyzing its features as to extract content and context of an event
processing principles to structure audio information for analysis via NLP or traditional analytics techniques Advantages
events or common patterns within sound bytes
the content and topics within a conversation, but also the emotions and context
Too
Ov Overv rvie iew Aud Audio Processin ing Infor
ion Retr trie ieval Cla Clam A C++ library that provides varying level
retrieval capabilities
X X
Call CallMin iner A tool that is capable of translating calls to a more structured text data set and combining with other communication forms
X
Nu Nuance Logs calls and structures audio for text based search and retrieval
X
yaa aafe Aduio feature extraction toolkit with wrappers for several languages
X
PRA RAAT Multiple platform audio analysis toolkit
X
Ana Analysis Obj Objectives Co Colla labor
ion Ana Analy lysis is
Evaluate team structures , information flows among team members, and information exchanges with other teams to improve working structures
influential to collaborative work environments
Co Content/ Kno nowle ledge Man Management
Evaluate how knowledge or content is diffused and accessed within an
communication flows, and establish channels
Co Communit ity Mining
Identify groups or informal teams that share knowledge, communicate frequently, solve problems, or work together to perform specific tasks
functions.
functions
communities
Or Organiz izatio ion De Develo lopment
Explore formal and informal
how individuals work with one another to improve the design
to better align with the informal practices
and would impact the organization if promoted
Ana Analysis Obj bjectives Disas Disaster rec ecovery pla planning
Assess organizational structures and communication patterns as they relate to the groups that play a role in disaster recovery plans
teams
collaboration during recovery plan execution
Da Data/ Informati tion Disse Dissemination
Assess how data points or information sets originate or are distributed across the enterprise to their intended targets
information dissemination
architecture impact the flow of information to its targets
Fraud De Detecti tion / pr prevention
Assess the organization or external network to identify communication
with known fraudulent activity
fraudulent agents
Process ss Disc Discovery / Improvement
Analyze the organization structure and communication patterns to uncover process improvements or identify new processes
process steps, communication flows , and actors
hidden within frequent collaboration and communication paths
Sup Supply Cha hain Ana Analysis
Evaluate the structure of a supply network and the interactions among the entities that comprise the network to identify gaps, bottlenecks and sourcing strategies
process or operations
network
Ana Analysis Obj bjectives Novelty/ Sen Sentiment Dif Diffusi sion Ana Analysis
Observe how a specific topic, news articles or sentiment diffuses through a consumer network
retained within a system and how far it will spread
Mar arket Infl fluencer Ide dentifi fication
Monitor and analyze connections within social media networks to identify markets or consumers that are influential within communities
adoption
improve product/service adoption
Consumer r Seg Segmentation
Analyze the connections and consumer attributes within the target market to discover communities or groups with common characteristics
that connect the consumer market
based on identified segmentation characteristics
Product or
Dif Diffusi sion Ana Analysi sis
Analyze the flow of communication
to evaluate how a product may diffuse
adopters
product/service adoption
Rec ecommendation Systems
Analyze consumer network connections and common features among consumers to develop recommendations
Too
Ov Overv rview Netw twor
Ana nalysis is Netw twor
Vi Visual Netw twor
Manipulation SN SNAP
A general purpose network analysis and graph mining library for C++ . Link
St Statnet
A package for R that provides capabilities for social network statistical
lib ibSNA, , gr graphTool, , ne netw tworkX
Python libraries for network analysis and manipulation. libSNA, networkX, graphTool
JUN JUNG
Java package for network analysis and
Nod
Excel plug-in that provides an easy to use and interactive interface to explore and visualize networks Link
Tool Overview Network Ana Analysis Network Visu Visual Network Man anipulation GEP EPHI
Interactive open source platform for network analysis and visualization. Gephi
X X X
Uci cinet
Commercial social network analysis tool with separate visualization component. Link
X X
Graphviz
Open source graph visualization package. Link
X
NetMiner
Proprietary package that provides the ability to develop and implement custom algorithms link
X X X
kxen SN SNA
Network analysis package that provides predictive analytics and customer MDM integration. Link
X X X
ProM
Open source package for mining business process
X X X
Cytoscape
Open source tool for network modeling, and
X X X
Network Work
Large-Scale Network Analysis, Modeling and Visualization Toolkit for Biomedical, Social Science and Physics Research. Link
X X X
Contents
What is DeepQA?
domain question analysis and answering system
search, NLP, learning, and scoring algorithms
infrastructure that leverages Map Reduce and the Unstructured Information Management Architecture
What is the target problem set?
human language
large library of unstructured information
questions that are complex and must sourced from a large knowledge set
Techn hnolog
Li Links Uns nstructured Informatio ion Archit itecture UIMA Link SQL QL Server MySQL Link Apache Derby Link Java Na Natu tural l Lan Language Tool
kit Open NLP Link Stanford NLP Link Map ap/R /Reduce Apache Hadoop Link Com
Know nowle ledgebase OpenCYC Link Open Mind Common Sense Link Trip iple le Stor
Apache Jena Link OpenAnzo Link Text t Sear arch Lucene Link Open FTS Link
Techn hnolog
Li Links Web b Server Apache Link Vir irtu tualiz izatio ion Hos Host VMWare Link Zen Link Distrib ibuted File le Sy System Apache Hadoop Link OpenAFS Link File le Management/ Archival rSync Link OS OS Fedora Link Clou loud Man anagement Extreme Cloud Administration Link Open Nebula Link
Ov Overview Obj bjectives Knowledge Disc Discovery
Search internal and external unstructured/structured information assets to uncover previously unknown knowledge
internal and external information sources
may be difficult to analyze within traditional data sources
E-Di Disc scovery
Search documents and communications to uncover relevant information associated with a specific topic
and documents
external data sources
Contract Eval aluations
Search through single or multiple contracts to answer specific questions about the nature of the contract
contracts
entities or features
Rel elationship Man anag agement
Provide the ability to interact with consumers providing precise responses to technical and open domain questions
questions about products or services
consumers
Consumer Disc Discovery
Search consumer communications, social media, and sales information to identify opportunities and demographics
Tec echnical al Troubleshooting
Find answers to technical and process problems through
solutions or root causes to system and process problems
Infrastructure/Tools and Search Technologies/Concepts
Topic Res esearch Tools
Had Hadoop Map ap/R /Reduce The tool is used to distribute queries, analysis, and other processing activities across multiple CPUs. Further research is required to understand the tools architecture and how to integrate it with other tool kits. OpenNLP, UIMA, Lucene, etc. Ope penNLP A Java library for NLP tasks. Need to evaluate the tools capabilities and gaps as well as how it can be incorporated into the UIMA Ope penCYC An open common sense reasoning platform. Need to better understand the tools role as well as how it fits within the other technologies UIMA An architecture for managing unstructured data. Further research is needed to understand how to run in parallel and how the SDK can be applied to NLP activities Luc Lucene A text search platform. Further research is needed to understand the library and how to incorporate it into UIMA
Sea Search
Text t Sear arch Scor
ing Algorithms are used to score search results based on their alignment with the question. Further research is needed to understand what models and scoring metrics can be applied to search results at various phases of DeepQA. Trip iple le Stor
Search Triple stores maintain data in a subject-predicate-object structure and is used for turning around quick facts. Further research is needed to understand the philosophy and technologies behind these data storage mechanisms Com
Reas asonin ing Research is required to understand the branch of AI, technologies and role within DeepQA. Doc
t/ Informatio ion Retr trie ieval Generate research on information and document retrieval practices. Technologies and algorithms need to be reviewed. Falls within a broader research topic for enterprise search.
Topi
De Descrip iptio ion Ma Machine Learnin ing
MetaLearners Research the concept and how they are to used evaluate learning models and assign a confidence score based on the learning models that are used to rank search results Qu Question Clas lassi sifi fication Identify techniques and models that can be employed to analyze and classify questions Sea Search Ran Ranking Mod
Research models are available for ranking search results based on the various search and recall techniques that are employed for a question
NL NLP
Log Logical Form
Ana Analysis Research how SNA is used to discover logical relationships within text and product an understanding about the information within the text Sem Semantic St Structure Ana Analysis Identify tools and algorithms that are employed to uncover semantic relationships within texts/phrases and how these relationships can be applied to extract relevant information for question analysis and search Rel elationship Ana Analysis Research techniques and tools for uncovering temporal, geospatial and spatial relationships within a knowledge set Featu ture Ex Extraction Evaluate tools and algorithms that are used to extract features of entities from text and identify methods for structuring the data for search Phr hrase Ana Analysi sis Identify algorithms and tools that can be applied to extract key phrases from text based on a search context