Scalable Tools - Part II Adisak Sukul, Ph.D., Lecturer, Department - - PowerPoint PPT Presentation

▶

Jan 26, 2023 222 likes •494 views

Scalable Tools - Part II Adisak Sukul, Ph.D., Lecturer, Department of Computer Science, adisak@iastate.edu http://web.cs.iastate.edu/~adisak/Bigdata/ Scalable Tools session We will be using Spark, Python and PySpark. We will use

SLIDE 1

Scalable Tools - Part II

Adisak Sukul, Ph.D., Lecturer, Department of Computer Science, adisak@iastate.edu http://web.cs.iastate.edu/~adisak/Bigdata/

SLIDE 2

Scalable Tools session

We will be using Spark, Python and PySpark.
We will use Jupyter Notebook as IDE.

SLIDE 3

Install your software VM on your laptop Create Ubuntu VM in VirtualBox

This lecture will walk through how to download

and set-up VirtualBox with Ubuntu.

Then we will walk through installing

Spark,Python, and the Jupyter Notebook on this VirtualBox Unbtunu.

SLIDE 4

Option 1: host it locally on your laptop. (recommended)

This option require you to download a large files. I will also have

those files available locally for copy in session.

1. Download Virtualbox (for Windows and mac)
https://www.virtualbox.org/wiki/Downloads
(108 MB for Windows, 91MB for Mac)
(If you are using Linux, you don't need this)
2. Download Ubuntu
https://www.ubuntu.com/download/desktop
Select Ubuntu Desktop 18.04 LTS
(size 1.8GB)

SLIDE 5

3. Install Virtualbox
4. Create VM using Ubuntu .iso file
5. Login to Ubuntu
We will install pyspark in class

SLIDE 6

VirtualBox (issue if you have docker)

Only 32-bit option available?

Disable your Hyper-V from Windows

Features. Then restart

windows.

SLIDE 7

Update your Ubuntu

sudo apt-get update
sudo apt-get upgrade

SLIDE 8

Verify your python3

python
python3

SLIDE 9

Install jupyter with pip

Install pip3 by
sudo apt install python3-pip

SLIDE 10

Install jupyter by pip3

(You can also get the full Anaconda too)
pip3 install jupyter

SLIDE 11

Install Java jdk

Spark based on java, so don’t forget to install it. (you will get

weird error)

sudo apt-get install openjdk-8-jdk
Or
sudo apt-get install default-jdk

SLIDE 12

Start the jupyter notebook

Type “jupyter notebook”
If it shows “command not found” then pip

haven’t place your jupyter in to system path

Restart your Ubuntu
Or run jupyter this way
~/.local/bin/jupyter-notebook

SLIDE 13

Install Pyspark

Pyspark available on pypi, but pip3 doesn’t

work!!

Get pip by
sudo apt-get install python-pip
pip install pyspark

 Wow. it’s super easy!!

If you wish to use conda:
conda install -c conda-forge pyspark
Make sure you see the Spark logo ->>
If not, it’s a trap :P

http://sigdelta.com/blog/how-to-install-pyspark-locally/

SLIDE 14

Verify your pyspark

SLIDE 15

Let’s run your 1st spark program

SLIDE 16

Now, your Pyspark exercise

Install Hadoop
Use spark to count words from your favorite site
Example: we are using cs.iastate.edu

SLIDE 17

Installing Hadoop

https://www.digitalocean.com/community/tutorials/how-to-install-

hadoop-in-stand-alone-mode-on-ubuntu-16-04

SLIDE 18

SLIDE 19

SLIDE 20

Run MapReduce example

SLIDE 21

1. Get the text from website
You could use bs4 (BeautifulSoup4) to scrap the web – more

elegant

Or manually save web page to .txt file - less elegant :P

SLIDE 22

pip3 install bs4

Python code to easily get text (process) from web

html2text.html2text(html_content)

file = open('file_text.txt', 'w')
file.write(rendered_content)
file.close()

SLIDE 24

Create pyspark word count program

https://spark.apache.org/examples.html

SLIDE 25

Thank you

Questions?
adisak@iastate.edu
http://web.cs.iastate.edu/~adisak/MBDS2018/