SLIDE 1
Automatic Synchronization and Distribution of Biological Databases and Software over Low-Bandwidth Networks among Developing Countries
P2P Node Setup Guide
Authored by: Unitsa Sungket, Prince of Songkla University, Thailand Darran Nathan, APBioNet
SLIDE 2 2 Background Bioinformatics and the need for network bandwidth Bioinformatics involves the collection, organization and analysis of large amounts of biological data, using networks of computers and databases. Bioinformatics Centers around the world have to regularly update their database repositories with the latest releases. This is normally done by a file transfer over FTP; but the large and growing sizes of these databases means that a large network bandwidth is required to ensure the new database releases are downloaded quickly and without failure. To assist this, a network of database mirror sites was established in several countries worldwide in 1997, under the Bio-Mirror project. Developing countries in the Asia-Pacific region are just moving into this new field of bioinformatics, but the computational infrastructure and network bandwidths available in those countries are still at a primitive level compared to that in more developed countries. Network bandwidth within these countries are still very low, and the low reliability of connections means breaks / aborts in downloads are common. So, in spite of the Bio-Mirrors nodes being made available, many developing countries in the world still face a major problem in regularly updating these
- databases. And, with the large and growing sizes of these databases, the problem will only get
worse in the next years because the growth of databases outstrips the rate of bandwidth penetration to the end user. A revolution in file sharing technology In the late 90’s, the Internet community witnessed the start of a major revolution in the way people share files – Peer-to-Peer (P2P) file exchange was introduced with the wildly popular Napster in 1997. Internet users used this to share mp3 music and video files throughout the
- world. P2P technology involves exchanging files not just between a central server and
multiple clients that connect to it, but rather focus on using clients to exchange files amongst
The technology continued to evolve and improve, with the second generation P2P FastTrack / Kazaa network in 2001. In 2002, the BitTorrent protocol was introduced. This third generation P2P technology was a major advance over previous P2P protocols with BitTorrent, a large file to be distributed will be broken up into smaller fragments, typically around a quarter of a megabyte each. These fragments are distributed to each peer, and amongst peers, in a random manner, and are reassembled at the requesting machine. This difference between traditional client/server distribution of files, and 3rd generation P2P distribution, is illustrated in Figures 1 and 2 below:
SLIDE 3 3
Figure 1. Traditional Client / Server distribution of files Figure 2. BitTorrent distribution of files
These figures illustrate the power of the concept introduced by 3rd generation P2P technology: As the number of downloading clients in the traditional distribution architecture increases, demands for bandwidth placed on the server will only increase and lead to a bottleneck. However, for the case of the 3rd generation P2P architecture, the more peers there are, the more nodes are available to distribute fragments of the file. High demand will actually lead to greater throughput as more bandwidth from additional nodes becomes available to the group. Using P2P technology in distributing biological data From the comparison above, it can be seen that if 3rd generation P2P technology is used, it
- ffers to simultaneously solve the two major problems plaguing the distribution of biological
data to developing countries: 1) Low international bandwidth
- With a P2P architecture, downloads need not be from a central server in
another country – every peer that connects up to synchronize its databases or software, whether from the same institute, state, country or region, will provide additional bandwidth, that will speed up the overall download rate of all the peers 2) Unreliable connections
- In the conventional server/client architecture, all download is from a single
server and if this connection becomes very slow or unreliable, there can be no ‘failover’ to automatically continue downloading from another source
- For the 3rd generation P2P architecture however, downloads are automatically
sourced from peers with the best connections; and if a connection experiences a bottleneck, downloads automatically continue from the next best connections.
SLIDE 4 4 P2P technology can be applied in three areas – the distribution of biological software, courseware, and databases. Objectives 1) To develop a client application based on 3rd generation P2P protocols, or select and extend an existing open-source one, for use in the distribution of biological software, courseware, and databases 2) To set up and test the performance of this biological software, courseware, and database distribution P2P network, with nodes in countries in the Asia-Pacific region starting with Singapore, Thailand, and Korea, and to beyond. These tests will include
- Benchmarking performance against more traditional rsync and FTP techniques
- Assessing the effect of bandwidth saturation in using P2P
- Identifying P2P architecture and topology variations most suited for
distributing the datasets of different sizes P2P Software Selected After extensive analysis and trials of various available P2P software, the Azureus program was selected for this work becase of the following reasons:
- It is open-source and has a large active development community
- It runs on Java, allowing it to be deployed on any OS
- It has a well documented plug-in interface that makes it easy to develop additional
enhancements that may be necessary for this project Setup of the P2P node This section describes the steps needed to set up a P2P node. After a server has been assigned by your institute and set up with the OS as well as necessary misc software such as antivirus, firewall, and intrusion detection systems:
- 1. Installation of Azureus
a) Download and install the Azureus program from http://azureus.sourceforge.net/download.php Linux users can view the installation details at: http://azureus.sourceforge.net/howto_linux.php Windows users can view the installation details at http://azureus.sourceforge.net/howto_win.php
SLIDE 5 5
Azureus has 2 sections that should be set up – ‘client’ settings and ‘server’ settings. 1) Client settings - for download of data from peers 1.1) Go to the Tools menu and choose Options. In the list on the left click
- Connection. Pick a number between 49152 and 65534, and enter that in the
incoming TCP listen port and UDP listen port boxes as shows in Figure 3. Then click Save to save this change. Ensure that you have opened this port in your firewall for both download and upload. Figure 3. Setting the incoming TCP and UDP ports 1.2) To test the download of data from a Seed node, download a torrent from the KOBIC Tracker (http://ftp.kobic.re.kr:6969/) as shown in Figure 4. In the File menu click Open and choose Add file, to add the torrent that you have
- downloaded. This is shown in Figure 5.
If everything has been set up correctly, download should begin. Figure 6 shows that the PSU node is downloading go_200608-assocdb.rdf-xml.gz file
SLIDE 6 6 from the KOBIC node (go_200608-assocdb.rdf-xml.gz.torrent), and the download speed is 12.4 kB/s. Note: if your Health indicator on the torrent is red, it means that your server is not connected to any peer. This may be either because the tracker server down,
- r there is no Seed node present.
Figure 4. KOBIC tracker
SLIDE 7
7 Figure 5. Opening the Torrent file Figure 6. PSU node downloading data from KOBIC node. 2) Server settings - for setting up a Seed node to host and manage a database If you want to upload or distribute your data to any peer, you must create a torrent for that data, and keep the torrent in a ‘tracker’. 2.1) Go to the Tools menu and choose Options. In the list on the left click Tracker, then click Server. Enter your external IP address or server name. Select HTTP port check box, and enter a port such as 6969 as shown in Figure 7. Ensure that you have opened port 6969 on your firewall. 2.2) In the list on the left click Plugins, and next click Tracker Web. Then select Publish torrent, enter title of you tracker web as shows in Figure 8, and select all RSS feed options for automatic synchoronization as shown in Figure 9.
SLIDE 8
8 Figure 7. Tracker server settings Figure 8. Tracker Web settings
SLIDE 9
9 Figure 9. RSS feed setting in Tracker Web 2.3) Create a torrent by clicking New Torrent in the File menu and select Embedded Tracker as shown in Figure 10. Before clicking on ‘Finish’ to create the torrent, check the boxes to Open the torrent and Host the torrent as shown in Figure 11.
SLIDE 10
10 Figure 10. Creating a new torrent Figure 11. Options selected to create a new torrent
SLIDE 11 11 2.4) The tracker URL will be of the form http://yourexternalIPAddress:6969/ (as shown in Figure 12). Figure 12. PSU node tracker server URL The Seed / Tracker node is now set up. The Azureus User Guide details more generic user information that may be useful - http://azureus.sourceforge.net/doc/Azureus%20User%20Guide.htm
- 3. RSSFeed Scanner Plugin settings for automatic synchoronization of data
1) Download RSSFeed Scanner plugin from http://azureus.sourceforge.net/plugin_list.php (select version 1.3.1). 2) Install this plugin by unzipping rssfeed_1.3.1.zip and place it in the “plugins” folder of the Azureus program path. For example, if you install the Azureus program to C:\Azureus, you must place “rssfeed_1.3.1” folder in C:\Azureus\plugins. After that, restart Azureus program. 3) Click ‘open RSSFeed Scanner plugin’ in the Plugins menu and select RSSFeed, then select Options tab. 3.1) Create an RSSFeed URL for the KOBIC node by clicking the “+” label and set the options as shown in Figure 13. If you enter 1800 in Delay time box, it means that rss_KOBIC is refreshed and the corresponding new torrent (if any) is downloaded automatically every 1,800 seconds.
SLIDE 12 12 Figure 13. RSS Feed URL setting 3.2) Create Filters for the KOBIC node by clicking the “+” label and set the
- ptions as shown in Figure 14.
Figure 14. Filters setting
Queued
SLIDE 13
13 3.3) When you click the Status tab, you can see that some torrents from the KOBIC node are downloaded automatically as shown in Figure 15 and Figure 16. Figure 15. Status of RSSFeed Scanner Figure 16. The Torrent files in Azureus after running RSSFeed Scanner
SLIDE 14 14 3.4) You can read more details in the Help tab.
- 4. Advanced Statistics Plugin setting for performance tests
1) Download the Advanced Statistics plugin from http://azureus.sourceforge.net/plugin_list.php 2) To install this plugin, please follow the same instructions as detailed in the section on installing the RSSFeed Scanner plugin. 3) Go to the Plugins menu and choose Advanced Statistics. When you select the Progress tab, you can see the percentage or size of data that you have download or uploaded, as shown in Figure 17. You can also see the current download or upload speed in the Activity tab as shown in Figure 18. Figure 17. Progress of each torrent
SLIDE 15
15 Figure 18. Activity Initial trial results and next steps Four trial nodes have been setup for the first phase of testing this P2P network. These sites comprise 1) Prince of Songkla University (PSU, Thailand), 2) Korean Bioinformation Center (KOBIC, Korea), 3) National University of Singapore (NUS, Singapore), and 4) National Center for Genetic Engineering and Biotechnology (BIOTEC, Thailand). We have set up three tracker sites to publish torrents, as shown in Table 1. An RSSFeed Scanner Plugin was used to trigger automatic synchronization of data at regular intervals - this allows Azureus client program to download data automatically from the seed nodes without user intervention. Table 1. Tracker sites to publish torrents, Node Tracker URL PSU http://biotracker.psu.ac.th:6969 KOBIC http://ftp.kobic.re.kr:6969 BIOTEC http://protcluster.biotec.or.th:6969 Trace results from the PSU node are shown in Figure 19. We used the FileZilla program to benchmark downloads using FTP, and the Azureus application for P2P downloads. With
SLIDE 16 16
- nly four nodes, the results already demonstrated an improvement in download performance
using P2P. After seven days, 23.2 GB of data was successfully downloaded using FTP, and about 70 GB using P2P. The use of P2P therefore enable approximately three times faster
- verall downloading than conventional FTP.
In conclusion, the P2P protocol is more effective than traditional FTP downloads for synchronizing large databases. In the next phase more nodes from various Asia-Pacific countries will be included for larger scale tests of the performance of this P2P network.
- Fig. 3. Total data size downloaded
Credits Software and P2P network trials: Unitsa Sungket, Prince of Songkla University, Thailand Sungsam Gong and Kim Woo Yeon, Korea Bioinformation Centre Lin Honghuang, National University of Singapore Coordinated by: Darran Nathan, APBioNet Tan Tin Wee, APBioNet Amornrat Phongdara, Prince of Songkla University, Thailand Jong Bhak, Korea Bioinformation Centre