www.inforsense.com Outline Part I Import & Export Import - - PDF document

inforsense com outline
SMART_READER_LITE
LIVE PREVIEW

www.inforsense.com Outline Part I Import & Export Import - - PDF document

www.inforsense.com Outline Part I Import & Export Import data from different data sources and file types Query from a relational database Export data to different locations Part II Pre-processing Description of


slide-1
SLIDE 1

www.inforsense.com

slide-2
SLIDE 2
slide-3
SLIDE 3

Outline

Part I – Import & Export

  • Import data from different data sources and file types
  • Query from a relational database
  • Export data to different locations

Part II – Pre-processing

  • Description of pre-processing nodes
  • Build using table editor

Part III – Introduction to Interactive Browser

  • Filtering
  • Visualizing and Deployment of Results
slide-4
SLIDE 4

Outline

Part IV – Introduction to Chemscience

  • Import and Export SD files
  • Clean SD files and Table nodes
  • Convert Molecular format nodes
  • Calculate Molecular properties
  • Create Fingerprint Matrix
  • Export to SD file

Part V – Structure and Similarity Searching

  • Introduction to Sketch Pad
  • Substructure filter and match nodes
  • Similarity filter and score nodes

Part VI – Predictive Modelling

  • Identifying actives and inactive compounds using similarity
  • Predicting activity using classification algorithms using chemical

fingerprints

slide-5
SLIDE 5

Project Windows

  • Capture theories and analytical

processes and their information relationships

  • Organize and plan complex processes
  • I ntegrate data and applications in a

single process representing integrative knowledge

  • Capture process provenance and

auditable history

  • Share and reuse workflows through

templates

  • Manage and deploy workflow to share

process knowledge

A project window is the area where you capture knowledge processes by building, editing and running data-mining tasks.

slide-6
SLIDE 6

Anatomy of the workspace

Properties editor Navigator Project window Userspace Tool bar, Task manager and Inforsense.net Menu bar

slide-7
SLIDE 7

Importing data

Data must be inserted to be used in Kensington models. The compatible formats are:

  • Oracle, Access, My SQL databases
  • Importing form a text file (flat file or

tall-slim format)

  • From the clipboard. Ideal for copy and

pasting data from MS excel or Access.

  • Import using a bookmark file

I mporting data can be done by selecting the appropriate option under the File I mport Data menu. Right clicking on a users workspace or folder icon allows you to create a new database bookmark.

slide-8
SLIDE 8

Database bookmark manager

  • A new bookmark is created by launching the

Database bookmark manager.

  • I n this example, an Oracle database is

connected to the KDE. The servers host name and port number, an oracle user name and password and the name of the database is

  • required. Please consult with your database

administrator for details.

  • Access and My SQL databases are also

supported.

  • I f you want to test the database connection

(to check if the parameters supplied are correct) you can press the "Test Connection" button.

  • Once the required details have been filled in,

clicking the "Save" button launches a dialog asking where the database bookmark is to be stored in the user space

Node Icons

slide-9
SLIDE 9

Database manager

  • “Database Manager” window can be

launched by double-clicking on the database bookmark icon or by selection on a database node.

  • Data can be selected by using either the

table-browser mode or by performing a manual query.

  • Object types that appear in the list are

tables, views, alias, or synonym.

  • The column name and type appear allowing

the user a choice of selection.

  • The number of rows to load is also an
  • ption.
  • I mporting data creates a table in the

userspace.

slide-10
SLIDE 10

Import Group

  • The DB Procedure node, specific to Oracle

databases, provides another method for importing data.

  • Data can be imported from a database using the

SQL node. The purpose is to allow the user to

specify the SQL SELECT query, which produces a table to be loaded into the system.

  • I n DB Procedure and SQL nodes, execution is

delayed until the data is needed, enabling the mining procedure to be defined prior to any Oracle processing.

  • Export to File produces a comma or tab delimited

file to be used in other applications.

  • Export to Userspace produces a table in a specific

Userspace.

  • Export to Database allows you to create, update or

re-create a table of data in an Oracle database.

slide-11
SLIDE 11

Import & Export

Create a simple workflow that explains the process of importing and exporting data Our example illustrates:

  • Import data from a database (e.g.

Oracle, Access)

  • Perform a SQL query
  • Export data to the user space and

file

  • Perform a preprocess task
  • Export data back to a database

To get started, we will add the previously created Training_Material_DB bookmark into a new project.

slide-12
SLIDE 12

SQL Node

SQL node allows the user to perform an SQL SELECT query on the database

  • Step 1 Illustrates the connectivity of the
  • nodes. The exclamation mark indicates that

the properties of the SQL node need to be defined.

  • Step 2: Query can be typed or loaded from

a text file into the properties editor.

  • Step 3: The output metadata needs to be

specified, either by retrieving it from the database (by pressing the "Load metadata from database" button in the "Output" panel)

  • r by typing it in manually in the "Output"

panel before using.

  • Step 4 Illustrates the final connections
slide-13
SLIDE 13

Export to Userspace & File

Export to Userspace and Export to File nodes produces a data table.

  • Step 1 Illustrates the connectivity of the

nodes.

  • Step 2: The table is exported in a user

defined name & location within a userspace.

  • Step 3: The table is exported as a tab

delimited file outside of KDE. Files can be exported as comma, space, semi-colon delimited files. The delimiter can be set to none, single or double quotes.

  • Step 4 Illustrates the final connections after

both nodes are executed. Tables were created in the appropriate user spaces and file location.

slide-14
SLIDE 14

Export to Database Node

The preprocess delete node modifies the queried table then the export to database node creates a new table within the database

  • Step 1 Illustrates node connectivity
  • Step 2: Columns to be deleted are

selected and moved to the right hand side

  • f the properties editor.
  • Step 3: Illustrates that the properties

editor for the export to database node needs to be edited. In this case, the wizard will be executed. Screenshots appear on the next two slides.

  • Step 4: The export to database properties

entered after running the wizard. The red exclamation point will then disappear in step 3.

slide-15
SLIDE 15

Table Export Wizard: Steps 1 – 4

1 2 3 4

slide-16
SLIDE 16

Table Export Wizard: Steps 5 & 6

  • Step 1 Allows the selection of Create,

Update or Recreate table options.

  • Step 2: A table name is given.
  • Step 3: Allows the selection of data

columns for export

  • Step 4: SQL to create table as illustrated in

step 3. Step 3 can be skipped and SQL could be directly placed into step 4.

  • Step 5: Column Mapping for updating
  • tables. I n our example, we are creating a

table so the user defined columns are empty.

  • Step 6: SQL to update a table as illustrated

in step 5.

5 6

slide-17
SLIDE 17

Import Text

Two formats of data can be imported into KDE

  • Text (standard flat text file) - a

normalized table format.

  • Tall-slim - an unnormalised table

that contains a set of rows, each

  • f which contains a repeated

series of cases. The import wizards illustrated on the next two slide takes us through the steps for importing data from a file and the clipboard. I n each case, the text table exported to file in the previous example was used.

slide-18
SLIDE 18

Steps for Importing Text

1 3 4 2

slide-19
SLIDE 19

Steps for Copy & Paste Text

1 2 4 3

slide-20
SLIDE 20

Preprocessing

  • Data pre-processing and transformation
  • perations enable you to clean and

prepare your data before executing data mining tasks.

  • I nforsense KDE supports a wide range of

pre-processing functions that operate either on single or multiple tables. The input to a pre-processing component is

  • ne (or more) tables and the output is a

table. Some Examples of single-table pre-processing functions include:

  • Deleting columns from a table.
  • Deriving and insert new columns.
  • Filter data records in a table (based on

some condition).

  • Rename or Replace columns of data.
slide-21
SLIDE 21

Preprocessing in the Table Editor

  • Preprocessing nodes are best

described in examples. We will build upon the previous workflow querying data from the training material database.

  • Preprocessing nodes can be

assembled to modify data by drag & drop onto the project workspace or by using the table editor.

  • The Table Editor is used to examine

values in a data table. I t also provides an easy interface to perform data pre- processing operations interactively by updating the display to show the effects of performing any of these

  • perations.
  • I n the example, the SQL can be

executed by viewing the table editor.

Select & Right click

slide-22
SLIDE 22

Anatomy of the table editor

Data Table Pre-Processing Toolbar Editing Area Menu Table statistics

slide-23
SLIDE 23

Stats Summary

  • The Stats Summary pane in the

table editor provides a statistical summary for the sample rows used as an input to the table editor in the active table

  • The "Calculate" buttons calculate

the stats for the columns to which they correspond and return the relevant info.

  • The following screenshot shows a

typical initial view, containing information that has been generated from the sample

  • dataset. Columns with no

duplicates will not show any statistics

slide-24
SLIDE 24

Preprocessing Toolbar

  • Preprocessing nodes can be selected,

tested and executed easily within the table editor.

  • As an example, the NCI HI V dataset will be

arbitrarily modified using the Derive, Filter, Replace, and Delete nodes.

  • The derive node adds a newly derived

column to the original input table. The name and type of the output column can be

  • chosen. I n addition, you could enter an

expression describing how the values of the new column are derived from existing

  • column. (e.g. 5 * Activity_Value)
  • The table is automatically updated. The

undo button will remove the newly created

  • node. Cancel will cancel editing session &

discard changes. Commit will create nodes

  • n the project workspace.
slide-25
SLIDE 25

Filter and Replace Nodes

  • The filter component removes

(filters out) the rows from the input table that do not match a user-specified condition.

  • A filter condition is specified as a

Boolean (logical) combination of mathematical operations on column values. For example: Comment = 1 or (Comment = 1 and NCI _Conclusion = ‘CA’)

  • The replace component searches

a column for a given text string and replaces the string with a user-supplied one. This operation can only be applied to discrete- valued columns.

slide-26
SLIDE 26

Using the Table Editor

  • The delete component removes one or

more columns from an input table.

  • A delete node can be created within

the table editor by selecting the desired columns and then clicking on the delete button.

  • Each of the process nodes are listed in

the editing area.

  • The table view is modified according

to the nodes processed.

  • Commit creates the nodes in the

project.

slide-27
SLIDE 27

Project Nodes and Grouping

  • A group option allows a user to

group several icons together into

  • ne icon. This operation is

typically used to organize workflow appearances.

  • To apply group-specific
  • perations you must first select

the group of nodes by highlighting the background area

  • f the desired group in a project
  • window. You do this by left-

clicking anywhere near these nodes and dragging a boxed area around them with your mouse

  • pointer. Nodes that have been

selected within a group will appear in a darker color. Right- click anywhere in the highlighted area to bring up the group- specific pop-up menu.

slide-28
SLIDE 28

Alternative method to Table Editor

  • Connection of Drag & Drop processing

nodes in a project produces a similar

  • workflow. Properties of each node

selected can be edited. For example, the Pivot Node will be used to create a simple table.

  • The pivot node can be used to provide a

summarization of an input table. Pivot tables work similarly to that used in Excel and can quickly summarize report data.

  • A key role aggregates in the vertical
  • dimension. Other columns can be

ignored or included

  • The parameter tab chooses the

categories and values used in the pivot table.

slide-29
SLIDE 29

Pivot Node & Table Results

  • Categories aggregates the

horizontal dimension for one or more values (the output table columns).

  • Values selects how the

categories will be aggregated in the table

  • The pivot table results are:
slide-30
SLIDE 30

Introduction to Interactive Browser

  • I nforsense KDE 1.9 contains a 2D and

3D interactive browsers.

  • The I nteractive Browser is a

visualization tool presenting multiple 2-D views of the same data.

  • The I nteractive Browser allows you to

visually and interactively select data elements from each of the displayed plots.

  • Selected data can be saved as

Kensington tables and used to query

  • databases. This allows you to use the

I nteractive Browser to visually define filter conditions over your data tables.

  • I n our example, the database will be

queried and viewed in the I nteractive Browse.

slide-31
SLIDE 31

Anatomy of Interactive Browser

List of components Display Options Panel Visualization Panel Grids Plot Types Visualization Panel Component and Filtering Panels Table and Graphs

slide-32
SLIDE 32

Control Panel: Filtering Tab

  • Check boxes show whether a filter is active or inactive.
  • Collapsible panels allow the filter information to be

hidden.

  • Range allows the user to reduce the amount of data

that appears on the graph. By adjusting the range selection, the range of each column is changed.

  • The two boxes below the slider show the actual

numerical values of the range.

  • I (I nvert filter) (for string values) inverts the selected

data

  • R (Reset) allows users to reset modifications made to

individual filtering options

  • Apply changes made in information panel are

visualized in display panel

  • Reset (master reset for filter tab) allows the user to

convert to the original unfiltered data state

  • Auto update by checking this box, changes made in the

information panel are automatically and interactively updated in the display panel

slide-33
SLIDE 33

Control Panel: Components Tab

Components tab: provides a list of all the current objects created by the user.

  • Select allows the user to drag an item from the graph list onto a visualization

panel.

  • Rename allows the user to rename the graph
  • Hide allows the user to hide graphs (i.e. removed from the visualization panel).
  • Delete allows the user to delete unwanted graphs completely from the

browser.

  • Export (only for tables) allows the user to export the table back to userspace
slide-34
SLIDE 34

Display options panel

  • Legend is used to create graph legends

by associating color ranges to data points.

  • Dimensions provides a list of all the

available dimensions. The user can interactively manipulate the graph. The dimensions options depends upon the type of graph created.

  • Options provides access to the different

graph options that determine various aspects of how the graph is displayed. For example, by selecting "Show Legend" the legend for the graph will appear in the graph window on the visualization panel. The user can also choose to "Show nulls" for data points

  • n a graph.

Images of the Tabs are Illustrated Side by Side

slide-35
SLIDE 35

Types of Plots

The Types of Plots available are:

  • Box Plot
  • Bar Charts
  • Bin Plot
  • Pie Chart
  • Parallel Plot
  • Scatter Plot
  • Venn Diagram
  • Table View
  • Structure View
  • Multiplots
slide-36
SLIDE 36

View and Deployment Nodes

  • Graphs designed in the

interaction browsers can be saved as view or deployment nodes.

  • A view node gives you as static

snapshot of the data. For example, the view node illustrating the pie chart is produced.

  • The deployment nodes allows the

user dynamic access to results.

  • Currently, there are 6 types of

graphical deployment nodes.

slide-37
SLIDE 37

Deployment Nodes

  • The properties editor options of the

visualization deployment nodes are the same as in the I nteractive Browser.

  • An alternative way of creating

dynamic graphical representations of data is Drag & Drop a deployment node in a project, connect it to a table

  • f data, choose options in the

properties editor, connect it to a view deployable image node and execute.

  • All images must be connected to a

view deployable image node for visualization.

  • Multiple images must be connected to

a tile deployable images node before viewing.

slide-38
SLIDE 38

Introduction to Chemscience

slide-39
SLIDE 39

NCI Dataset

The dataset used was obtained from theNational Cancer I nstitute (NCI ). The NCI has screened thousands of compounds for evidence of anti-HI V activity and is freely available for download at http:/ / dtp.nci.nih.gov/ docs/ aids/ aids_data.html. A diverse subset of the AI DS Screening Results published in March 2002 has been preprocessed and being used in our examples. The results of the screening tests are evaluated by the NCI and placed into one of three categories:

  • CA – Confirmed Active,
  • CM –Confirmed moderatly active
  • CI – confirmed active.

The columns in our dataset can be described as:

  • CAS_RN: The chemical abstract service registry number.
  • SDF: The 2D representation of the molecular structure.
  • SMI LES: The 1D representation of the molecular structure.
  • NSC: The primary identifier assigned to the compound by NCI
  • NCI _CONCLUSI ON: The result from the screening tests marked as CA, CM or CI .
  • ACTI VI TY: The NCI conclusions reduced to either active (= CA and CM) or inactive (= CI ).
  • ACTI VI TY_VALUE: The results of the ACTI VI TY column annotated as an integer (CI = 0 and

CA= 1).

slide-40
SLIDE 40

Import SD file

  • An I mport SD file node allows

the user to drag an SD file, imported to the userspace, into a project as a node. The user can also set/ change the input SD file by modifying the Path parameter.

  • I f a SD File is not located with in

a userspace, the user will need to insert the file into a userspace by right clicking on a location and uploading file. The software will automatically associate an extention of * .sdf as an SD File. Then you need only drag &drop the uploaded file from the userspace into a project.

slide-41
SLIDE 41

Clean SD file

  • The node accepts an SD file as input and

converts it into a table respresentation removing any badly formated molecules.

  • A search throught the file will be performed

identifing any additional descriptor information and automatically add them to the table in a spearate column.

  • I t is recommended that you clean SD files at the

beginning of any workflow to ensure that errros are not carried forward through the calculations.

  • The Column Name in the properties editor

selects the molecular structure (SMI LES, MOL or SD format). Two different types of tables can be created during execution.

  • A Cleaned table contains all the correctly

formated molecules.

  • A Error table contains all the molecules that

were badly formated.

slide-42
SLIDE 42

Calculating Molecular Descriptors

  • The purpose of the node is to

calculates descriptor values for molecules.

  • The types of molecular descriptors

that can be calculated are logP, logD, pKa, Mass, Hydrogen Bond Acceptors, Hydrogen Bond and Hydrogen Bond Donors.

  • Each type of descriptor is added into a

new column.

  • Descriptors can be calculated directly

from a SMI LES (1D) string or a MOL/ SD (2D/ 3D) format.

  • A window will appear allowing the

choice of linking the “cleaned” or “error” table to the node.

slide-43
SLIDE 43

Visualizing the Histograms of the Molecular Descriptors

  • The molecular descriptors calculated

will be ploted in a histogram using the interactive browser.

  • I n this example, we are going to

filter the data based on: MW < 800, Hydrogen Bond Donors < 5 and Hydrogent Bond Acceptors < 10

  • A bar chart can be created that plots

the molecular weight against the count of molecules in the dataset.

  • The X-axis is set to Mass and the

type of data point visualization to Bin width. The Y-axis is set to Molecular I D. The default function for a string is count.

  • Our plot shows only the filtered

results.

slide-44
SLIDE 44

Multiplot

  • Multiple graphs can be produced

at the same time.

  • For example, the multiplot option

under the Tools menu. Multiple different plot types and Layouts can be choosen.

  • Each graph can be saved as a

node or a deployable node in the

  • workspace. The “View

Deployable I mage” Node can be used to recreate graphical results

  • nce the work in the I nteractive

Browser is completed.

slide-45
SLIDE 45

Converting the Molecular Format from 2D to 3D

  • The purpose of this node is to convert either

SMI LES or a 2D SD representation to produces a 3D molecular model.

  • The new 3D mol data can be appended to the

table in a new column.

  • The default setting replaces the column

containing the 1D or 2D representation and is dependent

  • The user must select which port to connect

from and to before executing this node.

  • The “Convert Molecular Format (2D-3D)” and

“Convert Molecular Format (1D-2D)” are used simlar to our example. You can interchangably convert the 2D strucutre in

  • ur sample SD file to a 1D SMI LES

representation and then back to a 2D or even a 3D structure with ease as shown below. You can explore these nodes on your own leasure.

Examples:

slide-46
SLIDE 46

Exporting SD file

  • This node allows a user to export a

column containing molecules in SMI LES/ SD/ Mol format with associated descriptor information to an SD File.

  • Other columns in the input Table can

be included within the SD File as descriptors by changing the column role to "I nclude as Descriptor" within the input tab of the Properties Editor window.

  • An option exists to choose whether or

not to to add 3D coordinates (if not already present) in the SD file.

  • I t is not recommended if the input

Table has a large number of entries, as the 3D cleaning process can increase processing time.

slide-47
SLIDE 47

Identifying Actives & Inactive compounds

For example, we will build a project to identify active and inactive compounds by similarity and substructure matching. The training material database will be queried to extract chemical and biological data and used with Chemscience nodes.

slide-48
SLIDE 48

Clean Table Node

  • This node is required when the

initial data collection is in the form of an table. The main purpose of the node is to scan data for illogical errors and poor formatted structures.

  • The output is a filtered table. The

columns containing molecular property information in the input table are unchanged.

  • Table (Cleaned): A table

containing all the correctly formatted molecule

  • Table (Errors): A table containing

all the badly formatted molecules.

slide-49
SLIDE 49

Create Fingerprint Matrix

  • The purpose of this node is to create a

fingerprint summary of a molecular structure.

  • Fingerprints encode the connections

between atoms of the chemical graph and allow fast numerical comparisons

  • f chemical structures without a

direct structural comparison.

  • Topological or pharmacophoric

fingerprints are currently supported.

  • The vector representation of the

fingerprint is shown in the table.

  • An advanced example predicting

activity of an unknown set of compounds by classification trees using molecular fingerprints as descriptors is forthcoming.

slide-50
SLIDE 50

Multiple Substructure Filtering

  • This node provides a way to identify or reduce

molecules in the data set by means of a structure based search.

  • Multiple substructures are first selected and

filtered out of a table in a single pass. Conversely, the substructures can be used to identify matching compounds.

  • Table (Matching): A table of molecular data

where a structure search found a match.

  • Table (Not Matching): A table of molecular

data where a structure search did not find a match.

  • The substructures, such as Nitro, peroxide,

thiocyanate, aldehyde groups, can be selected.

slide-51
SLIDE 51

Descriptor Based Filtering

  • The purpose of this node is to provide a

means to specify simple constraints over

  • descriptors. I t combines molecular

property calculations and filtering to remove molecular data that does not pass the constraints in one step on a data table.

  • I n the properties editor, the Molecular

Column selects the molecular structure.

  • For our example, We are going to

append the molecular descriptors to the

  • utput table by ticking the value and

filter out using the “Rule of 5”

  • The rule of 5 states that poor absorption
  • r permeation are likely when there are

more than 5 Hydrogen bond donors (HBD), the molecular weight is over 500, the logP is over 5, and when there are more than 10 hydrogen bond acceptors (HBA).

slide-52
SLIDE 52

Anatomy of the Sketch Pad

Drawing Area Common Aromatic Ring System Templates Bond Selector Atom Selector Menu Bar Sample Molecule Editing Tools Template Selector

slide-53
SLIDE 53

Sketch a Molecule

  • There are two ways to insert a

chemical structure. The molecule can be typed into the Drawn Molecule field as a SMI LES string or drawn using the Molecule Sketch pad. Drawing the molecule using the Molecule Sketch pad is often easier than remembring how to write a SMI LES string.

  • The Molecular Sketch pad is used

similar to that of other chemical drawing packages.

  • I n this example, we are going to draw

1H-Pyrimidine-2,4-dione a.k.a. Uracil.

  • Thymidine containing compounds are

known to have in vitro activity against HI V-1. Uracil is going to be used to quickly identify similar compounds within the NCI data set.

slide-54
SLIDE 54

Similarity Filter Node

  • This node allows a table of molecules to be

filtered by calculating the dissimilarity value against a query molecule and specifying the bounds for filtering.

  • The filtered NCI dataset will be identified as the

source library. The query is the molecular structure of Uracil..

  • Two different types of outputs are available: one

containing all the molecules that fell within the bounds of the dissimilarity threshold and the

  • ther for all the molecules which fell outside of

these bounds.

  • Two types of dissimilarity distance

measurement: Tanimoto or Euclidean

  • I n the example, the tanimoto dissimilarity

coefficient is to be calculated given a bound threshold value of less than 0.6. The number of results found with tanimoto dissimilarity coefficient < 0.6 will be 51. I f we decrease the theshold to 0.5, we will receive 8 results.

slide-55
SLIDE 55

Analysis of the results using the Interactive Browser

  • The dataset and molecule icons on the

browser can show tables and chemial structures.

  • Multiple compounds can be shown by holding

down the shift key and selecting rows. SMI LES, SD or MOL formats can be visualized in the browser.

  • The CI , CM and CA compounds have

dissimilarity scores of 0.33, 0.47 and 0.52, respectively.

  • Each of these molecules pass the ‘Rule of 5’

and contain no toxic moities.

  • Our search results identified 51 compounds.

Approximately, 72% of the compounds are confirmed moderatly (CM) active or active (CA) by the NCI . The results can be used to create a preliminary structure activity relationship for the series. Results are readily available to scientists by means of deployment.

slide-56
SLIDE 56

Similarity Scores Node

  • The purpose of this node is to

calculate dissimilarity scores of a target molecule against each molecule in a library of structures.

  • The query molecule and a target

library of molecules are required for input.

  • Two types of dissimilarity distance

measurement: Tanimoto or Euclidean

  • Execution of the node produced a

table with the appended similarity scores as shown below. The main differences between the Similarity Filter and Score nodes are that the filter node allows entry of bound constraints whereas the Score node does not and that the score node appends the dissimilariy values to that table wheaas the Filter node does not.

slide-57
SLIDE 57

Substructures Match Node

  • The main purpose of this node is

to determine whether a substructure is present within a collection of molecules.

  • The query molecule and a library
  • f molecules for searching are

required for input.

  • The output appends an extra

column to the library of molecules stating whether or not the substructure is present.

  • The flag in the new column is

either Yes or No.

slide-58
SLIDE 58

Substructure Filter node

  • The purpose of this node allows a library of

molecules to be filtered using a specific substructure as the query.

  • The query molecule and a library of molecules for

searching are required for input.

  • Two different types of outputs are available: one

containing all the molecules that included the substructure at least once, and the other for all the molecules without any occurrence of the substructure.

  • The main difference between the Substructure

Match and Substructure Filter nodes is that the filter node allows the user to quickly filter out the presence or absence of a match.

  • The resulting table illustrated in the previous slide

be filtered using a postprocess command to create a table with identical information but with the appended column. The Substructure filter node simply does this in one step.

slide-59
SLIDE 59

NOTE

Part VI – Predictive Modelling

  • Predicting activity using classification

algorithms using chemical fingerprints

  • SOME PROBLEMS CURRENTLY EXIST WITH DOING

THIS AND IT IS SOMETHING WE NEED TO FIX AND THEN ADD TO THE TRAINING MATERIAL

  • THIS DATASET WILL GIVE GOOD RESULTS USING

FINGERPRINTS IN CLASSIFICATION MODELS.