Scalable Tools - Part II Adisak Sukul, Ph.D., Lecturer, Department - - PowerPoint PPT Presentation

scalable tools part ii
SMART_READER_LITE
LIVE PREVIEW

Scalable Tools - Part II Adisak Sukul, Ph.D., Lecturer, Department - - PowerPoint PPT Presentation

Scalable Tools - Part II Adisak Sukul, Ph.D., Lecturer, Department of Computer Science, adisak@iastate.edu http://web.cs.iastate.edu/~adisak/Bigdata/ Scalable Tools session We will be using Spark, Python and PySpark. We will use


slide-1
SLIDE 1

Scalable Tools - Part II

Adisak Sukul, Ph.D., Lecturer, Department of Computer Science, adisak@iastate.edu http://web.cs.iastate.edu/~adisak/Bigdata/

slide-2
SLIDE 2

Scalable Tools session

  • We will be using Spark, Python and PySpark.
  • We will use Jupyter Notebook as IDE.

2

slide-3
SLIDE 3

Install your software VM on your laptop Create Ubuntu VM in VirtualBox

  • This lecture will walk through how to download

and set-up VirtualBox with Ubuntu.

  • Then we will walk through installing

Spark,Python, and the Jupyter Notebook on this VirtualBox Unbtunu.

3

slide-4
SLIDE 4

Option 1: host it locally on your laptop. (recommended)

  • This option require you to download a large files. I will also have

those files available locally for copy in session.

  • 1. Download Virtualbox (for Windows and mac)
  • https://www.virtualbox.org/wiki/Downloads
  • (108 MB for Windows, 91MB for Mac)
  • (If you are using Linux, you don't need this)
  • 2. Download Ubuntu
  • https://www.ubuntu.com/download/desktop
  • Select Ubuntu Desktop 18.04 LTS
  • (size 1.8GB)

4

slide-5
SLIDE 5
  • 3. Install Virtualbox
  • 4. Create VM using Ubuntu .iso file
  • 5. Login to Ubuntu
  • We will install pyspark in class

5

slide-6
SLIDE 6

VirtualBox (issue if you have docker)

  • Only 32-bit option available?

6

Disable your Hyper-V from Windows

  • Features. Then restart

windows.

slide-7
SLIDE 7

Update your Ubuntu

  • sudo apt-get update
  • sudo apt-get upgrade

7

slide-8
SLIDE 8

Verify your python3

  • python
  • python3

8

slide-9
SLIDE 9

Install jupyter with pip

  • Install pip3 by
  • sudo apt install python3-pip

9

slide-10
SLIDE 10

Install jupyter by pip3

  • (You can also get the full Anaconda too)
  • pip3 install jupyter

10

slide-11
SLIDE 11

Install Java jdk

  • Spark based on java, so don’t forget to install it. (you will get

weird error)

  • sudo apt-get install openjdk-8-jdk
  • Or
  • sudo apt-get install default-jdk

11

slide-12
SLIDE 12

Start the jupyter notebook

  • Type “jupyter notebook”
  • If it shows “command not found” then pip

haven’t place your jupyter in to system path

  • Restart your Ubuntu
  • Or run jupyter this way
  • ~/.local/bin/jupyter-notebook

12

slide-13
SLIDE 13

Install Pyspark

  • Pyspark available on pypi, but pip3 doesn’t

work!!

  • Get pip by
  • sudo apt-get install python-pip
  • pip install pyspark

 Wow. it’s super easy!!

  • If you wish to use conda:
  • conda install -c conda-forge pyspark
  • Make sure you see the Spark logo ->>
  • If not, it’s a trap :P

13

http://sigdelta.com/blog/how-to-install-pyspark-locally/

slide-14
SLIDE 14

Verify your pyspark

14

slide-15
SLIDE 15

Let’s run your 1st spark program

15

slide-16
SLIDE 16

Now, your Pyspark exercise

  • Install Hadoop
  • Use spark to count words from your favorite site
  • Example: we are using cs.iastate.edu

16

slide-17
SLIDE 17

Installing Hadoop

  • https://www.digitalocean.com/community/tutorials/how-to-install-

hadoop-in-stand-alone-mode-on-ubuntu-16-04

17

slide-18
SLIDE 18

18

slide-19
SLIDE 19

19

slide-20
SLIDE 20

Run MapReduce example

20

slide-21
SLIDE 21
  • 1. Get the text from website
  • You could use bs4 (BeautifulSoup4) to scrap the web – more

elegant

  • Or manually save web page to .txt file - less elegant :P

21

slide-22
SLIDE 22
  • pip3 install bs4

22

  • Python code to easily get text (process) from web

page

slide-23
SLIDE 23
  • import urllib2
  • import html2text
  • url=''
  • page = urllib2.urlopen(url)
  • html_content = page.read()
  • rendered_content =

html2text.html2text(html_content)

  • file = open('file_text.txt', 'w')
  • file.write(rendered_content)
  • file.close()

23

slide-24
SLIDE 24

Create pyspark word count program

  • https://spark.apache.org/examples.html

24

slide-25
SLIDE 25

Thank you

  • Questions?
  • adisak@iastate.edu
  • http://web.cs.iastate.edu/~adisak/MBDS2018/

25