sparkはhadoopのmap reduceに代わると言われている、scalaの分散処理を行なうためのフレームワークです。
今回、sparkとscalaをインストールして、hdfsの領域を参照する手順を確認してみます。

[root@node01 opt]# wget http://www.scala-lang.org/files/archive/scala-2.12.0-M3.tgz
--2015-12-20 01:16:09--  http://www.scala-lang.org/files/archive/scala-2.12.0-M3.tgz
www.scala-lang.org をDNSに問いあわせています... 128.178.154.159
www.scala-lang.org|128.178.154.159|:80 に接続しています... 接続しました。
HTTP による接続要求を送信しました、応答を待っています... 200 OK
長さ: 20935936 (20M) [application/x-gzip]
`scala-2.12.0-M3.tgz' に保存中

100%[======================================>] 20,935,936  52.4K/s 時間 7m 18s

2015-12-20 01:23:28 (46.7 KB/s) - `scala-2.12.0-M3.tgz' へ保存完了 [20935936/20935936]

[root@node01 opt]# tar zxvf scala-2.12.0-M3.tgz
[root@node01 opt]# chown -R hdspark:hdspark scala-2.12.0-M3
[root@node01 opt]# ln -sv scala-2.12.0-M3 scala


まずは言語のscalaのバイナリをダウンロード
展開して、所有権の変更、バイナリのリンクを行ないます。

[root@node01 opt]# wget http://archive.apache.org/dist/spark/spark-1.5.1/spark-1.5.1-bin-hadoop2.6.tgz
--2015-12-20 01:28:17--  http://archive.apache.org/dist/spark/spark-1.5.1/spark-1.5.1-bin-hadoop2.6.tgz
archive.apache.org をDNSに問いあわせています... 192.87.106.229, 140.211.11.131, 2001:610:1:80bc:192:87:106:229
archive.apache.org|192.87.106.229|:80 に接続しています... 接続しました。
HTTP による接続要求を送信しました、応答を待っています... 200 OK
長さ: 280901736 (268M) [application/x-tar]
`spark-1.5.1-bin-hadoop2.6.tgz' に保存中

100%[======================================>] 280,901,736  173K/s 時間 45m 8s

2015-12-20 02:13:27 (101 KB/s) - `spark-1.5.1-bin-hadoop2.6.tgz' へ保存完了 [280901736/280901736]

[root@node01 opt]# tar xvf spark-1.5.1-bin-hadoop2.6.tgz
[root@node01 opt]# chown hdspark:hdspark spark-1.5.1-bin-hadoop2.6
[root@node01 opt]# ln -sv spark-1.5.1-bin-hadoop2.6 spark


同じようにフレームワークのsparkをダウンロード
展開、所有権の変更、バイナリのリンクを行ないます。

[root@node01 opt]# su - hdspark
[hdspark@node01 ~]$ vi .bash_profile



export SCALA_HOME=/opt/scala
export SPARK_HOME=/opt/spark
export PATH=$SCALA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$JAVA_HOME/bin:$HIVE_HOME/bin:$PATH

[hdspark@node01 ~]$ source .bash_profile


hadoopの実行ユーザーにsuして、環境変数を追記します。
SCALA_HOME、SPARK_HOMEの行を追加、新しいPATHを追加しています。

[hdspark@node01 ~]$ cd $SPARK_HOME
[hdspark@node01 spark]$ ./bin/spark-shell
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's repl log4j profile: org/apache/spark/log4j-defaults-repl.properties
To adjust logging level use sc.setLogLevel("INFO")
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.5.1
      /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_65)
Type in expressions to have them evaluated.
Type :help for more information.
15/12/21 00:36:29 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
Spark context available as sc.
15/12/21 00:36:34 WARN General: Plugin (Bundle) "org.datanucleus.store.rdbms" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/opt/spark/lib/datanucleus-rdbms-3.2.9.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/opt/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar."
15/12/21 00:36:34 WARN General: Plugin (Bundle) "org.datanucleus.api.jdo" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/opt/spark/lib/datanucleus-api-jdo-3.2.6.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/opt/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar."
15/12/21 00:36:34 WARN General: Plugin (Bundle) "org.datanucleus" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/opt/spark/lib/datanucleus-core-3.2.10.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/opt/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar."
15/12/21 00:36:34 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
15/12/21 00:36:34 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
15/12/21 00:36:40 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
15/12/21 00:36:41 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
15/12/21 00:36:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/12/21 00:36:44 WARN General: Plugin (Bundle) "org.datanucleus.store.rdbms" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/opt/spark/lib/datanucleus-rdbms-3.2.9.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/opt/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar."
15/12/21 00:36:44 WARN General: Plugin (Bundle) "org.datanucleus.api.jdo" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/opt/spark/lib/datanucleus-api-jdo-3.2.6.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/opt/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar."
15/12/21 00:36:44 WARN General: Plugin (Bundle) "org.datanucleus" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/opt/spark/lib/datanucleus-core-3.2.10.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/opt/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar."
15/12/21 00:36:44 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
15/12/21 00:36:44 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
15/12/21 00:36:50 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
15/12/21 00:36:50 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
SQL context available as sqlContext.


インストールが終わったので、早速sparkのshellにログインします。

scala> val txtFile = sc.textFile("README.md")
txtFile: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at <console>:21

scala> txtFile.count()
res0: Long = 98


通常のファイルシステム上にあるテキストファイルを読み込み
行数をカウントしています。

scala> val txtFile = sc.textFile("hdfs://127.0.0.1:9000/output/part-r-00000")
txtFile: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at textFile at <console>:21

scala> txtFile.count()
res1: Long = 4


hdfs上にあるファイルを読み込み
行数をカウントしています。
特に問題ありません。