Pavan Poluri Pavan Poluri Siddharth Deokar Varun Sudhakar Generic - - PowerPoint PPT Presentation

▶

Sep 18, 2023 280 likes •525 views

Pavan Poluri Pavan Poluri Siddharth Deokar Varun Sudhakar Generic View of System Get the article I/O Processing Workload count and word & File size Distribution count management Create one Create one Redistribute files

SLIDE 1

Pavan Poluri Pavan Poluri Siddharth Deokar Varun Sudhakar

SLIDE 2

I/O Processing & File size management Workload Distribution Get the article count and word count Redistribute files Create one Perform a Generic View of System Redistribute files for calculating suffix arrays Create one final Suffix Array per P Perform a Binomial Reduction Retrieve the top R interesting ngrams Done!!! LEGEND:

Siddharth Varun Pavan All

SLIDE 3

FOSTER’S DESIGN IN OUR PROJECT

Partitioning: Domain Decomposition Communication : Broadcasting, Point to Point

Communication and Customized Communication

Agglomeration: Gathering of suffix arrays Agglomeration: Gathering of suffix arrays Mapping: Cyclic Mapping Strategy

SLIDE 4

Data Structure

Customized suffix array1 to hold the following data

Position of ngram in the file File index to identify the file Term Frequency Term Frequency Document Frequency

Customized suffix array2 to hold the following data

Position of ngram in the file File index to identify the file Term Frequency TF*IDF value

SLIDE 5

Algorithm

I/O processing

Reading directory and storing file information

File size Management

Partitioning files

Partitioning files Communication

Workload Distribution

Interleaved Allocation

SLIDE 6

Contd…..

Alpha Requirement:

Calculating the number of words and articles Reduction

MPI_Reduce() MPI_Reduce()

SLIDE 7

Contd…

Suffix Array Calculation

Every word has a suffix array associated with it Allocating memory to suffix array based on the alpha

utput
utput

Filling the details of suffix arrays of all words

Getting the position of the word in the file Getting the file index of the file the word is in Assigning term frequency Assigning document frequency

SLIDE 8

Contd…

Sorting the Suffix arrays

Based on Quick sort algorithm

Timing Complexity of quick sort : O(NlogN) (average case) Memory Requirement : O(NlogN) Memory Requirement : O(NlogN)

SLIDE 9

Contd…

Finding Distinct terms in same article

Cat(TF=1, DF=1) Cat(TF=1, DF=1) House(TF=1,DF=1) House(TF=2,DF=1) House(TF=3,DF=1) House(TF=6,DF=1)

SLIDE 10

Contd…

Finding Distinct terms in different articles

Cat(TF=1, DF=1) Cat(TF=1, DF=1) House(TF=1,DF=1) House(TF=2,DF=1) House(TF=3,DF=2)

SLIDE 11

Contd…

Merging Suffix Arrays

Input: Two sorted suffix arrays Reading ngrams from file Output: One sorted suffix array Output: One sorted suffix array

SLIDE 12

F I R S C H Z C F I R S C H Z C F F I R S C H Z C F H

MERGE EXAMPLE

F I R S C H Z C F H I F I R S C H Z C F H I R F I R S C H Z C F H I R S F I R S C H Z C F H I R S Z

SLIDE 13

Contd…

Communication Strategies

Reading and Writing files (Strategy 1 - deprecated)

Binomial Tree Reduction and Nomenclature

Use of MPI_Barrier Use of MPI_Barrier

Single file corresponding to suffix array

Communicating Structures (Strategy 2)

Binomial Tree Reduction

Use of MPI_Pack, MPI_Unpack()

SLIDE 14

Contd…

Binomial Tree Reduction

1 1 1

SLIDE 15

Contd…

Finding top R interesting terms

Calculation and Storage

New suffix array structure with IDFTF measure

Sorting Sorting Merging

SLIDE 16

Analysis

Alpha

15 20

Alpha

10 16 32 64 128 256 Alpha Number of Processors Time in seconds

SLIDE 17

5000 6000 7000 8000 9000 10000 0.6mb 80mb seconds

1000

2000 3000 4000 2 4 8 16 32 64 80mb 120mb Number of Processors Time in sec

SLIDE 18

600 800 1000 1200

ngram = 1

ata in mb

400 600 2 4 8 16 32 64 ngram = 1 Number of Processors Data i

SLIDE 19

Formula

Amdahl’s Law

Ψ <= 1/f+(1-f)/p

where f is the serial component and p is the number of

processors processors

Ψ is the speedup

Gustafson’s Law

Ψ <= p+(1-p)s

Ψ is the scaled speed up s is the serial component and p is the number of processors

SLIDE 20

Contd…

Using our results for data of size 120 MB

Speed up = 7680/3156=2.4

Considering the case where 4 processors as serial and 16

processors as parallel processors as parallel

Using the formula for Amdahl’s Law and substituting Ψ

as 2.4 we get f = 0.22

According to Gustafson’s Law using s = 0.22, Ψ (scaled

speed up) = 3.34

SLIDE 21

SLIDE 22

Contact Info

Project web page: giga word corpus Email Pavan Poluri: polur007@d.umn.edu

Siddharth Deokar: deoka001@d.umn.edu

Siddharth Deokar: deoka001@d.umn.edu Varun Sudhakar: sudha002@d.umn.edu