IF you’re interested, with emphasis on the IF, I provide an easy to follow step-by-step guide for building a predictive analytics, or data profiling platform utilizing Hadoop, and Spark to overcome some of the memory limitations that come with using a desktop version of R for extremely large data sets. Also, it provides notes that I can refer to if I ever have to rebuild mine. 🙂
The computer used for this installation is:
Hardware Overview:
Model Name: Mac Pro
Model Identifier: MacPro6,1
Processor Name: 12-Core Intel Xeon E5
Processor Speed: 2.7 GHz
Number of Processors: 1
Total Number of Cores: 12
L2 Cache (per Core): 256 KB
L3 Cache: 30 MB
Memory: 64 GB
Boot ROM Version: MP61.0116.B21
SMC Version (system): 2.20f18
Illumination Version: 1.4a6
There is no way I could attempt to provide a solution for every platform and operating system. Even within Mac, there can be variations to this process depending on the version of the OSX being run. I am currently running Sierra 10.12.3.
To install R on a Mac is a very simple process. Simply download the correct package from the Comprehensive R Archive Network (CRAN) for your OS; then follow the instructions.
Same goes for RStudio, simply go to their website, and download the appropriate version of the FREE desktop version (unless of course you are doing this at work). As of the date for this posting, the appropriate version is: RStudio 1.0.136 – Mac OS X 10.6+ (64-bit)
Python is just as simple, go their website and download the appropriate version. I personally prefer using 3.x (as of this post, they are on version 3.6). At some point you will have to make the leap, and I made that leap about 6 months ago. None of the new packages are being developed in 2.7 any longer, and eventually, you will have to make that leap as well. As for IDE for Python, I use PyCharm, but there are any number of IDE’s available, including many free IDEs like Eclipse. I used Canopy, but at the time they did not support version 3.x so I switched to PyCharm. All things considered, I prefer PyCharm over any other IDE, but a lot of that is personal preference.
Now that you have some of the tools, let’s install Apache Hadoop, but first, a few simple suggestions. You can download Apache’s version and go through all of their process, or you can install Homebrew and make life much simpler. The install is very simple, again, simply go to their website, and follow the directions.
The instructions for all of the installs assume you know how to open a terminal (CMD SPACE bar and type in ‘terminal’ and hit enter), and are familiar with some basic UNIX commands, like ‘cd’, ‘ls’ and some sort of ‘EMAC-like’ editor. I personally prefer vi, but again, personal preference. You just need to be able to edit files. Knowing the UNIX file structure would be helpful, but is not critical.
Back to installing Apache Hadoop in a Single Node Cluster. First, get ready for it, from a Mac OSX terminal window, enter the command ‘brew install hadoop’ and hit <enter>. That is all there is to the installation. Of course, this makes the assumption that you have the proper privileges. If not, you can easily Google the process for changing your permissions.
Once Hadoop is installed, you must make some configuration changes. To determine where brew installed Hadoop, I believe it lists it in the output during installation, but if not, from a terminal window type in `find /usr -name hadoop` , and you will see the path to the bin directory `/usr/local/Cellar/hadoop/2.7.3/bin/hadoop`. From the home directory of Hadoop,
`/usr/local/Cellar/hadoop/2.7.3/` locate the /libexec/hadoop/etc
directory and edit the `core-site.xml` file using your Emac/vi editor:
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration>
Save the file, and then edit the hdfs-site.xml
file from the same directory:
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
For everything to work properly, you must be able to start anssh
session on localhost without a passphrase. First, go to 'System Preferences' > 'Sharing' and enable 'Remote Login'. Try entering the following command at a terminal prompt to see if you are asked for the passphrase:ssh localhost
If asked for a passphrase, from your home directory issue the following commands:
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys $ chmod 0600 ~/.ssh/authorized_keys
ssh localhost
Once you can ssh to your localhost without using the passphrase, you might want to add the following two lines to your .bash_profile
file located in your home directory (just type cd
<enter> and it will take you home, then pwd<enter>
to ensure you are in your home directory, and ls -al <enter>
to make sure the file is present):
alias hstart="/usr/local/Cellar/hadoop/2.7.3/sbin/start-dfs.sh;/usr/local/Cellar/hadoop/2.7.3/sbin/start-yarn.sh" alias hstop="/usr/local/Cellar/hadoop/2.7.3/sbin/stop-yarn.sh;/usr/local/Cellar/hadoop/2.7.3/sbin/stop-dfs.sh"
With these lines in your .bash_profile
, all you have to do is open a terminal window and type hstart
, or hstop
to start and stop your Single Node Hadoop Cluster. You are now ready to install Spark.
Again, very simple install on a Mac with brew install. From the terminal command line interface type: `brew install apache-spark`
To see the different options, type spark
at the command line and hit the tab key a couple of times. It should display the various modes that Spark can be run in.
Mac-Pro:~ RPy$ spark spark-beeline spark-class spark-shell spark-sql spark-submit sparkR Mac-Pro:~ RPy$ spark
Notice that sparkR
is present, but where is pySpark
? Well, pySpark does not begin with spark
, so you must type pyspark
from the command line and it will start pyspark. It is also installed along with the others that start with spark.
Hope this is of some value. If you have questions, or run into a problem that I didn’t address, please feel free to comment and I will see if I can help you resolve the problem (or correct my oversight).
Happy predicting!
Leave a Reply
Your email is safe with us.