www.inforsense.com
www.inforsense.com Outline Part I Import & Export Import - - PDF document
www.inforsense.com Outline Part I Import & Export Import - - PDF document
www.inforsense.com Outline Part I Import & Export Import data from different data sources and file types Query from a relational database Export data to different locations Part II Pre-processing Description of
Outline
Part I – Import & Export
- Import data from different data sources and file types
- Query from a relational database
- Export data to different locations
Part II – Pre-processing
- Description of pre-processing nodes
- Build using table editor
Part III – Introduction to Interactive Browser
- Filtering
- Visualizing and Deployment of Results
Outline
Part IV – Introduction to Chemscience
- Import and Export SD files
- Clean SD files and Table nodes
- Convert Molecular format nodes
- Calculate Molecular properties
- Create Fingerprint Matrix
- Export to SD file
Part V – Structure and Similarity Searching
- Introduction to Sketch Pad
- Substructure filter and match nodes
- Similarity filter and score nodes
Part VI – Predictive Modelling
- Identifying actives and inactive compounds using similarity
- Predicting activity using classification algorithms using chemical
fingerprints
Project Windows
- Capture theories and analytical
processes and their information relationships
- Organize and plan complex processes
- I ntegrate data and applications in a
single process representing integrative knowledge
- Capture process provenance and
auditable history
- Share and reuse workflows through
templates
- Manage and deploy workflow to share
process knowledge
A project window is the area where you capture knowledge processes by building, editing and running data-mining tasks.
Anatomy of the workspace
Properties editor Navigator Project window Userspace Tool bar, Task manager and Inforsense.net Menu bar
Importing data
Data must be inserted to be used in Kensington models. The compatible formats are:
- Oracle, Access, My SQL databases
- Importing form a text file (flat file or
tall-slim format)
- From the clipboard. Ideal for copy and
pasting data from MS excel or Access.
- Import using a bookmark file
I mporting data can be done by selecting the appropriate option under the File I mport Data menu. Right clicking on a users workspace or folder icon allows you to create a new database bookmark.
Database bookmark manager
- A new bookmark is created by launching the
Database bookmark manager.
- I n this example, an Oracle database is
connected to the KDE. The servers host name and port number, an oracle user name and password and the name of the database is
- required. Please consult with your database
administrator for details.
- Access and My SQL databases are also
supported.
- I f you want to test the database connection
(to check if the parameters supplied are correct) you can press the "Test Connection" button.
- Once the required details have been filled in,
clicking the "Save" button launches a dialog asking where the database bookmark is to be stored in the user space
Node Icons
Database manager
- “Database Manager” window can be
launched by double-clicking on the database bookmark icon or by selection on a database node.
- Data can be selected by using either the
table-browser mode or by performing a manual query.
- Object types that appear in the list are
tables, views, alias, or synonym.
- The column name and type appear allowing
the user a choice of selection.
- The number of rows to load is also an
- ption.
- I mporting data creates a table in the
userspace.
Import Group
- The DB Procedure node, specific to Oracle
databases, provides another method for importing data.
- Data can be imported from a database using the
SQL node. The purpose is to allow the user to
specify the SQL SELECT query, which produces a table to be loaded into the system.
- I n DB Procedure and SQL nodes, execution is
delayed until the data is needed, enabling the mining procedure to be defined prior to any Oracle processing.
- Export to File produces a comma or tab delimited
file to be used in other applications.
- Export to Userspace produces a table in a specific
Userspace.
- Export to Database allows you to create, update or
re-create a table of data in an Oracle database.
Import & Export
Create a simple workflow that explains the process of importing and exporting data Our example illustrates:
- Import data from a database (e.g.
Oracle, Access)
- Perform a SQL query
- Export data to the user space and
file
- Perform a preprocess task
- Export data back to a database
To get started, we will add the previously created Training_Material_DB bookmark into a new project.
SQL Node
SQL node allows the user to perform an SQL SELECT query on the database
- Step 1 Illustrates the connectivity of the
- nodes. The exclamation mark indicates that
the properties of the SQL node need to be defined.
- Step 2: Query can be typed or loaded from
a text file into the properties editor.
- Step 3: The output metadata needs to be
specified, either by retrieving it from the database (by pressing the "Load metadata from database" button in the "Output" panel)
- r by typing it in manually in the "Output"
panel before using.
- Step 4 Illustrates the final connections
Export to Userspace & File
Export to Userspace and Export to File nodes produces a data table.
- Step 1 Illustrates the connectivity of the
nodes.
- Step 2: The table is exported in a user
defined name & location within a userspace.
- Step 3: The table is exported as a tab
delimited file outside of KDE. Files can be exported as comma, space, semi-colon delimited files. The delimiter can be set to none, single or double quotes.
- Step 4 Illustrates the final connections after
both nodes are executed. Tables were created in the appropriate user spaces and file location.
Export to Database Node
The preprocess delete node modifies the queried table then the export to database node creates a new table within the database
- Step 1 Illustrates node connectivity
- Step 2: Columns to be deleted are
selected and moved to the right hand side
- f the properties editor.
- Step 3: Illustrates that the properties
editor for the export to database node needs to be edited. In this case, the wizard will be executed. Screenshots appear on the next two slides.
- Step 4: The export to database properties
entered after running the wizard. The red exclamation point will then disappear in step 3.
Table Export Wizard: Steps 1 – 4
1 2 3 4
Table Export Wizard: Steps 5 & 6
- Step 1 Allows the selection of Create,
Update or Recreate table options.
- Step 2: A table name is given.
- Step 3: Allows the selection of data
columns for export
- Step 4: SQL to create table as illustrated in
step 3. Step 3 can be skipped and SQL could be directly placed into step 4.
- Step 5: Column Mapping for updating
- tables. I n our example, we are creating a
table so the user defined columns are empty.
- Step 6: SQL to update a table as illustrated
in step 5.
5 6
Import Text
Two formats of data can be imported into KDE
- Text (standard flat text file) - a
normalized table format.
- Tall-slim - an unnormalised table
that contains a set of rows, each
- f which contains a repeated
series of cases. The import wizards illustrated on the next two slide takes us through the steps for importing data from a file and the clipboard. I n each case, the text table exported to file in the previous example was used.
Steps for Importing Text
1 3 4 2
Steps for Copy & Paste Text
1 2 4 3
Preprocessing
- Data pre-processing and transformation
- perations enable you to clean and
prepare your data before executing data mining tasks.
- I nforsense KDE supports a wide range of
pre-processing functions that operate either on single or multiple tables. The input to a pre-processing component is
- ne (or more) tables and the output is a
table. Some Examples of single-table pre-processing functions include:
- Deleting columns from a table.
- Deriving and insert new columns.
- Filter data records in a table (based on
some condition).
- Rename or Replace columns of data.
Preprocessing in the Table Editor
- Preprocessing nodes are best
described in examples. We will build upon the previous workflow querying data from the training material database.
- Preprocessing nodes can be
assembled to modify data by drag & drop onto the project workspace or by using the table editor.
- The Table Editor is used to examine
values in a data table. I t also provides an easy interface to perform data pre- processing operations interactively by updating the display to show the effects of performing any of these
- perations.
- I n the example, the SQL can be
executed by viewing the table editor.
Select & Right click
Anatomy of the table editor
Data Table Pre-Processing Toolbar Editing Area Menu Table statistics
Stats Summary
- The Stats Summary pane in the
table editor provides a statistical summary for the sample rows used as an input to the table editor in the active table
- The "Calculate" buttons calculate
the stats for the columns to which they correspond and return the relevant info.
- The following screenshot shows a
typical initial view, containing information that has been generated from the sample
- dataset. Columns with no
duplicates will not show any statistics
Preprocessing Toolbar
- Preprocessing nodes can be selected,
tested and executed easily within the table editor.
- As an example, the NCI HI V dataset will be
arbitrarily modified using the Derive, Filter, Replace, and Delete nodes.
- The derive node adds a newly derived
column to the original input table. The name and type of the output column can be
- chosen. I n addition, you could enter an
expression describing how the values of the new column are derived from existing
- column. (e.g. 5 * Activity_Value)
- The table is automatically updated. The
undo button will remove the newly created
- node. Cancel will cancel editing session &
discard changes. Commit will create nodes
- n the project workspace.
Filter and Replace Nodes
- The filter component removes
(filters out) the rows from the input table that do not match a user-specified condition.
- A filter condition is specified as a
Boolean (logical) combination of mathematical operations on column values. For example: Comment = 1 or (Comment = 1 and NCI _Conclusion = ‘CA’)
- The replace component searches
a column for a given text string and replaces the string with a user-supplied one. This operation can only be applied to discrete- valued columns.
Using the Table Editor
- The delete component removes one or
more columns from an input table.
- A delete node can be created within
the table editor by selecting the desired columns and then clicking on the delete button.
- Each of the process nodes are listed in
the editing area.
- The table view is modified according
to the nodes processed.
- Commit creates the nodes in the
project.
Project Nodes and Grouping
- A group option allows a user to
group several icons together into
- ne icon. This operation is
typically used to organize workflow appearances.
- To apply group-specific
- perations you must first select
the group of nodes by highlighting the background area
- f the desired group in a project
- window. You do this by left-
clicking anywhere near these nodes and dragging a boxed area around them with your mouse
- pointer. Nodes that have been
selected within a group will appear in a darker color. Right- click anywhere in the highlighted area to bring up the group- specific pop-up menu.
Alternative method to Table Editor
- Connection of Drag & Drop processing
nodes in a project produces a similar
- workflow. Properties of each node
selected can be edited. For example, the Pivot Node will be used to create a simple table.
- The pivot node can be used to provide a
summarization of an input table. Pivot tables work similarly to that used in Excel and can quickly summarize report data.
- A key role aggregates in the vertical
- dimension. Other columns can be
ignored or included
- The parameter tab chooses the
categories and values used in the pivot table.
Pivot Node & Table Results
- Categories aggregates the
horizontal dimension for one or more values (the output table columns).
- Values selects how the
categories will be aggregated in the table
- The pivot table results are:
Introduction to Interactive Browser
- I nforsense KDE 1.9 contains a 2D and
3D interactive browsers.
- The I nteractive Browser is a
visualization tool presenting multiple 2-D views of the same data.
- The I nteractive Browser allows you to
visually and interactively select data elements from each of the displayed plots.
- Selected data can be saved as
Kensington tables and used to query
- databases. This allows you to use the
I nteractive Browser to visually define filter conditions over your data tables.
- I n our example, the database will be
queried and viewed in the I nteractive Browse.
Anatomy of Interactive Browser
List of components Display Options Panel Visualization Panel Grids Plot Types Visualization Panel Component and Filtering Panels Table and Graphs
Control Panel: Filtering Tab
- Check boxes show whether a filter is active or inactive.
- Collapsible panels allow the filter information to be
hidden.
- Range allows the user to reduce the amount of data
that appears on the graph. By adjusting the range selection, the range of each column is changed.
- The two boxes below the slider show the actual
numerical values of the range.
- I (I nvert filter) (for string values) inverts the selected
data
- R (Reset) allows users to reset modifications made to
individual filtering options
- Apply changes made in information panel are
visualized in display panel
- Reset (master reset for filter tab) allows the user to
convert to the original unfiltered data state
- Auto update by checking this box, changes made in the
information panel are automatically and interactively updated in the display panel
Control Panel: Components Tab
Components tab: provides a list of all the current objects created by the user.
- Select allows the user to drag an item from the graph list onto a visualization
panel.
- Rename allows the user to rename the graph
- Hide allows the user to hide graphs (i.e. removed from the visualization panel).
- Delete allows the user to delete unwanted graphs completely from the
browser.
- Export (only for tables) allows the user to export the table back to userspace
Display options panel
- Legend is used to create graph legends
by associating color ranges to data points.
- Dimensions provides a list of all the
available dimensions. The user can interactively manipulate the graph. The dimensions options depends upon the type of graph created.
- Options provides access to the different
graph options that determine various aspects of how the graph is displayed. For example, by selecting "Show Legend" the legend for the graph will appear in the graph window on the visualization panel. The user can also choose to "Show nulls" for data points
- n a graph.
Images of the Tabs are Illustrated Side by Side
Types of Plots
The Types of Plots available are:
- Box Plot
- Bar Charts
- Bin Plot
- Pie Chart
- Parallel Plot
- Scatter Plot
- Venn Diagram
- Table View
- Structure View
- Multiplots
View and Deployment Nodes
- Graphs designed in the
interaction browsers can be saved as view or deployment nodes.
- A view node gives you as static
snapshot of the data. For example, the view node illustrating the pie chart is produced.
- The deployment nodes allows the
user dynamic access to results.
- Currently, there are 6 types of
graphical deployment nodes.
Deployment Nodes
- The properties editor options of the
visualization deployment nodes are the same as in the I nteractive Browser.
- An alternative way of creating
dynamic graphical representations of data is Drag & Drop a deployment node in a project, connect it to a table
- f data, choose options in the
properties editor, connect it to a view deployable image node and execute.
- All images must be connected to a
view deployable image node for visualization.
- Multiple images must be connected to
a tile deployable images node before viewing.
Introduction to Chemscience
NCI Dataset
The dataset used was obtained from theNational Cancer I nstitute (NCI ). The NCI has screened thousands of compounds for evidence of anti-HI V activity and is freely available for download at http:/ / dtp.nci.nih.gov/ docs/ aids/ aids_data.html. A diverse subset of the AI DS Screening Results published in March 2002 has been preprocessed and being used in our examples. The results of the screening tests are evaluated by the NCI and placed into one of three categories:
- CA – Confirmed Active,
- CM –Confirmed moderatly active
- CI – confirmed active.
The columns in our dataset can be described as:
- CAS_RN: The chemical abstract service registry number.
- SDF: The 2D representation of the molecular structure.
- SMI LES: The 1D representation of the molecular structure.
- NSC: The primary identifier assigned to the compound by NCI
- NCI _CONCLUSI ON: The result from the screening tests marked as CA, CM or CI .
- ACTI VI TY: The NCI conclusions reduced to either active (= CA and CM) or inactive (= CI ).
- ACTI VI TY_VALUE: The results of the ACTI VI TY column annotated as an integer (CI = 0 and
CA= 1).
Import SD file
- An I mport SD file node allows
the user to drag an SD file, imported to the userspace, into a project as a node. The user can also set/ change the input SD file by modifying the Path parameter.
- I f a SD File is not located with in
a userspace, the user will need to insert the file into a userspace by right clicking on a location and uploading file. The software will automatically associate an extention of * .sdf as an SD File. Then you need only drag &drop the uploaded file from the userspace into a project.
Clean SD file
- The node accepts an SD file as input and
converts it into a table respresentation removing any badly formated molecules.
- A search throught the file will be performed
identifing any additional descriptor information and automatically add them to the table in a spearate column.
- I t is recommended that you clean SD files at the
beginning of any workflow to ensure that errros are not carried forward through the calculations.
- The Column Name in the properties editor
selects the molecular structure (SMI LES, MOL or SD format). Two different types of tables can be created during execution.
- A Cleaned table contains all the correctly
formated molecules.
- A Error table contains all the molecules that
were badly formated.
Calculating Molecular Descriptors
- The purpose of the node is to
calculates descriptor values for molecules.
- The types of molecular descriptors
that can be calculated are logP, logD, pKa, Mass, Hydrogen Bond Acceptors, Hydrogen Bond and Hydrogen Bond Donors.
- Each type of descriptor is added into a
new column.
- Descriptors can be calculated directly
from a SMI LES (1D) string or a MOL/ SD (2D/ 3D) format.
- A window will appear allowing the
choice of linking the “cleaned” or “error” table to the node.
Visualizing the Histograms of the Molecular Descriptors
- The molecular descriptors calculated
will be ploted in a histogram using the interactive browser.
- I n this example, we are going to
filter the data based on: MW < 800, Hydrogen Bond Donors < 5 and Hydrogent Bond Acceptors < 10
- A bar chart can be created that plots
the molecular weight against the count of molecules in the dataset.
- The X-axis is set to Mass and the
type of data point visualization to Bin width. The Y-axis is set to Molecular I D. The default function for a string is count.
- Our plot shows only the filtered
results.
Multiplot
- Multiple graphs can be produced
at the same time.
- For example, the multiplot option
under the Tools menu. Multiple different plot types and Layouts can be choosen.
- Each graph can be saved as a
node or a deployable node in the
- workspace. The “View
Deployable I mage” Node can be used to recreate graphical results
- nce the work in the I nteractive
Browser is completed.
Converting the Molecular Format from 2D to 3D
- The purpose of this node is to convert either
SMI LES or a 2D SD representation to produces a 3D molecular model.
- The new 3D mol data can be appended to the
table in a new column.
- The default setting replaces the column
containing the 1D or 2D representation and is dependent
- The user must select which port to connect
from and to before executing this node.
- The “Convert Molecular Format (2D-3D)” and
“Convert Molecular Format (1D-2D)” are used simlar to our example. You can interchangably convert the 2D strucutre in
- ur sample SD file to a 1D SMI LES
representation and then back to a 2D or even a 3D structure with ease as shown below. You can explore these nodes on your own leasure.
Examples:
Exporting SD file
- This node allows a user to export a
column containing molecules in SMI LES/ SD/ Mol format with associated descriptor information to an SD File.
- Other columns in the input Table can
be included within the SD File as descriptors by changing the column role to "I nclude as Descriptor" within the input tab of the Properties Editor window.
- An option exists to choose whether or
not to to add 3D coordinates (if not already present) in the SD file.
- I t is not recommended if the input
Table has a large number of entries, as the 3D cleaning process can increase processing time.
Identifying Actives & Inactive compounds
For example, we will build a project to identify active and inactive compounds by similarity and substructure matching. The training material database will be queried to extract chemical and biological data and used with Chemscience nodes.
Clean Table Node
- This node is required when the
initial data collection is in the form of an table. The main purpose of the node is to scan data for illogical errors and poor formatted structures.
- The output is a filtered table. The
columns containing molecular property information in the input table are unchanged.
- Table (Cleaned): A table
containing all the correctly formatted molecule
- Table (Errors): A table containing
all the badly formatted molecules.
Create Fingerprint Matrix
- The purpose of this node is to create a
fingerprint summary of a molecular structure.
- Fingerprints encode the connections
between atoms of the chemical graph and allow fast numerical comparisons
- f chemical structures without a
direct structural comparison.
- Topological or pharmacophoric
fingerprints are currently supported.
- The vector representation of the
fingerprint is shown in the table.
- An advanced example predicting
activity of an unknown set of compounds by classification trees using molecular fingerprints as descriptors is forthcoming.
Multiple Substructure Filtering
- This node provides a way to identify or reduce
molecules in the data set by means of a structure based search.
- Multiple substructures are first selected and
filtered out of a table in a single pass. Conversely, the substructures can be used to identify matching compounds.
- Table (Matching): A table of molecular data
where a structure search found a match.
- Table (Not Matching): A table of molecular
data where a structure search did not find a match.
- The substructures, such as Nitro, peroxide,
thiocyanate, aldehyde groups, can be selected.
Descriptor Based Filtering
- The purpose of this node is to provide a
means to specify simple constraints over
- descriptors. I t combines molecular
property calculations and filtering to remove molecular data that does not pass the constraints in one step on a data table.
- I n the properties editor, the Molecular
Column selects the molecular structure.
- For our example, We are going to
append the molecular descriptors to the
- utput table by ticking the value and
filter out using the “Rule of 5”
- The rule of 5 states that poor absorption
- r permeation are likely when there are
more than 5 Hydrogen bond donors (HBD), the molecular weight is over 500, the logP is over 5, and when there are more than 10 hydrogen bond acceptors (HBA).
Anatomy of the Sketch Pad
Drawing Area Common Aromatic Ring System Templates Bond Selector Atom Selector Menu Bar Sample Molecule Editing Tools Template Selector
Sketch a Molecule
- There are two ways to insert a
chemical structure. The molecule can be typed into the Drawn Molecule field as a SMI LES string or drawn using the Molecule Sketch pad. Drawing the molecule using the Molecule Sketch pad is often easier than remembring how to write a SMI LES string.
- The Molecular Sketch pad is used
similar to that of other chemical drawing packages.
- I n this example, we are going to draw
1H-Pyrimidine-2,4-dione a.k.a. Uracil.
- Thymidine containing compounds are
known to have in vitro activity against HI V-1. Uracil is going to be used to quickly identify similar compounds within the NCI data set.
Similarity Filter Node
- This node allows a table of molecules to be
filtered by calculating the dissimilarity value against a query molecule and specifying the bounds for filtering.
- The filtered NCI dataset will be identified as the
source library. The query is the molecular structure of Uracil..
- Two different types of outputs are available: one
containing all the molecules that fell within the bounds of the dissimilarity threshold and the
- ther for all the molecules which fell outside of
these bounds.
- Two types of dissimilarity distance
measurement: Tanimoto or Euclidean
- I n the example, the tanimoto dissimilarity
coefficient is to be calculated given a bound threshold value of less than 0.6. The number of results found with tanimoto dissimilarity coefficient < 0.6 will be 51. I f we decrease the theshold to 0.5, we will receive 8 results.
Analysis of the results using the Interactive Browser
- The dataset and molecule icons on the
browser can show tables and chemial structures.
- Multiple compounds can be shown by holding
down the shift key and selecting rows. SMI LES, SD or MOL formats can be visualized in the browser.
- The CI , CM and CA compounds have
dissimilarity scores of 0.33, 0.47 and 0.52, respectively.
- Each of these molecules pass the ‘Rule of 5’
and contain no toxic moities.
- Our search results identified 51 compounds.
Approximately, 72% of the compounds are confirmed moderatly (CM) active or active (CA) by the NCI . The results can be used to create a preliminary structure activity relationship for the series. Results are readily available to scientists by means of deployment.
Similarity Scores Node
- The purpose of this node is to
calculate dissimilarity scores of a target molecule against each molecule in a library of structures.
- The query molecule and a target
library of molecules are required for input.
- Two types of dissimilarity distance
measurement: Tanimoto or Euclidean
- Execution of the node produced a
table with the appended similarity scores as shown below. The main differences between the Similarity Filter and Score nodes are that the filter node allows entry of bound constraints whereas the Score node does not and that the score node appends the dissimilariy values to that table wheaas the Filter node does not.
Substructures Match Node
- The main purpose of this node is
to determine whether a substructure is present within a collection of molecules.
- The query molecule and a library
- f molecules for searching are
required for input.
- The output appends an extra
column to the library of molecules stating whether or not the substructure is present.
- The flag in the new column is
either Yes or No.
Substructure Filter node
- The purpose of this node allows a library of
molecules to be filtered using a specific substructure as the query.
- The query molecule and a library of molecules for
searching are required for input.
- Two different types of outputs are available: one
containing all the molecules that included the substructure at least once, and the other for all the molecules without any occurrence of the substructure.
- The main difference between the Substructure
Match and Substructure Filter nodes is that the filter node allows the user to quickly filter out the presence or absence of a match.
- The resulting table illustrated in the previous slide
be filtered using a postprocess command to create a table with identical information but with the appended column. The Substructure filter node simply does this in one step.
NOTE
Part VI – Predictive Modelling
- Predicting activity using classification
algorithms using chemical fingerprints
- SOME PROBLEMS CURRENTLY EXIST WITH DOING
THIS AND IT IS SOMETHING WE NEED TO FIX AND THEN ADD TO THE TRAINING MATERIAL
- THIS DATASET WILL GIVE GOOD RESULTS USING