大数据spark入门-环境

之前做过大数据spark分析的工程,既然打算写学习交流了,所以也慢慢把整理过的东西放出来,求交流指点。


主要框架如下,大多数的安装资料来源于官网和网上靠谱的,经过测试选定了如下的参数。



1、所有服务器上添加每个服务器的ip映射


vi/etc/hosts

服务器的hostname

2、SSH免密登陆

所有服务器安装SSH,设置所有服务器之间可以免密登陆

ssh-keygen -t

rsa //加回车

[if !supportLists](一、       [endif]出现ssh不通的问题

[if !supportLists](二、       [endif]手动将pub和auth文件备份并手动传输过去

每个机子上的id_rsa.pub发给master节点,传输公钥可以用scp来传输。

scp~/.ssh/id_rsa.pub root@master:~/.ssh/id_rsa.pub.slave1

在master上,将所有公钥加到用于认证的公钥文件authorized_keys中

cat~/.ssh/id_rsa.pub* >> ~/.ssh/authorized_keys

将公钥文件authorized_keys分发给每台slave

scp~/.ssh/authorized_keys spark@master:~/.ssh/

在每台机子上验证SSH无密码通信

3、安装JAVA

下载java8 解压

修改环境变量sudo vi /etc/profile,添加下列内容,注意将home路径替换成该服务器的:

exportWORK_SPACE=/home/spark/workspace/

export JAVA_HOME=$WORK_SPACE/jdk

export JRE_HOME=/home/spark/work/jdk/jre

export PATH=$JAVA_HOME/bin:$JAVA_HOME/jre/bin:$PATH

export CLASSPATH=$CLASSPATH:.:$JAVA_HOME/lib:$JAVA_HOME/jre/lib

然后使环境变量生效,并验证 Java 是否安装成功

$ source /etc/profile #生效环境变量

$ java -version #

如果打印出如下版本信息,则说明安装成功

java version

Java(TM) SE Runtime Environment

Java HotSpot(TM) 64-Bit Server VM (build 24.75-b04, mixed mode)

4、安装Scala

Spark官方要求 Scala 版本为2.10.x,下载Scala版本2.10.4,解压

再次修改环境变量sudo vi /etc/profile,添加以下内容,注意将home路径替换成服务器的:

exportSCALA_HOME=$WORK_SPACE/scala

export PATH=$PATH:$SCALA_HOME/bin

同样的方法使环境变量生效,并验证 scala 是否安装成功

$ source /etc/profile #生效环境变量

$ scala -version #

如果打印出如下版本信息,则说明安装成功

Scala code runner version  -- Copyright2002-2013, LAMP/EPFL

5、配置Hadoop集群

从官网下载 Hadoop 2.7.3版本,解压

cd ~/workspace/hadoop- 2.7.3/etc/hadoop进入hadoop配置目录,需要配置有以下7个文件:hadoop-env.sh,yarn-env.sh,slaves,core-site.xml,hdfs-site.xml,maprd-site.xml,yarn-site.xml,capacity-scheduler.xml

在hadoop-env.sh中配置JAVA_HOME,对应服务器的home路径

#The java implementation to use.

export JAVA_HOME=/home/spark/workspace/jdk

exportHADOOP_HEAPSIZE=6000

在yarn-env.sh中配置JAVA_HOME

#some Java parameters

export JAVA_HOME=/home/spark/workspace/jdk

YARN_HEAPSIZE=6000

在slaves中配置slave节点的ip或者host,

所有服务器的hostname

XX

修改core-site.xml

 fs.defaultFS

 hdfs://xx:9000/

 hadoop.tmp.dir

 /home/hadooptemp

修改hdfs-site.xml

 dfs.namenode.secondary.http-address

xx:9001

 dfs.namenode.name.dir

 file:/usr/local/hadoop-2.7.3/dfs/name

 dfs.datanode.data.dir

 file:/usr/local/hadoop-2.7.3/dfs/data

 dfs.replication

 3

修改mapred-site.xml

 mapreduce.framework.name

 yarn

修改yarn-site.xml

yarn.nodemanager.resource.cpu-vcores

 16

 yarn.nodemanager.resource.memory-mb

 16384

 yarn.nodemanager.aux-services.mapreduce.shuffle.class

 org.apache.hadoop.mapred.ShuffleHandler

 yarn.nodemanager.aux-services

 spark_shuffle

 yarn.nodemanager.aux-services.spark_shuffle.class

 org.apache.spark.network.yarn.YarnShuffleService

 yarn.resourcemanager.address

xx:8032

 yarn.resourcemanager.scheduler.address

xx:8030

 yarn.resourcemanager.resource-tracker.address

 xx:8035

 yarn.resourcemanager.admin.address

xx:8033

 yarn.resourcemanager.webapp.address

xx:8088

修改capacity-scheduler.xml

   yarn.scheduler.capacity.maximum-applications

    10000

      Maximum number of applications that canbe pending and running.

   yarn.scheduler.capacity.maximum-am-resource-percent

    0.1

      Maximum percent of resources in thecluster which can be used to run

      application masters i.e. controls numberof concurrent running

      applications.

   yarn.scheduler.capacity.resource-calculator

    org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator


      The ResourceCalculator implementation tobe used to compare

      Resources in the scheduler.

      The default i.e.DefaultResourceCalculator only uses Memory while

      DominantResourceCalculator usesdominant-resource to compare

      multi-dimensional resources such asMemory, CPU etc.

   yarn.scheduler.capacity.root.queues

    default,cdn


      The queues at the this level (root is theroot queue).

   yarn.scheduler.capacity.root.cdn.capacity

    30

    Default queue targetcapacity.

   arn.scheduler.capacity.root.cdn.state

    RUNNING


      The state of the default queue. State canbe one of RUNNING or STOPPED.

   yarn.scheduler.capacity.root.default.capacity

    70

    Default queue targetcapacity.

   yarn.scheduler.capacity.root.default.user-limit-factor

    1.0

      Default queue user limit a percentagefrom 0.0 to 1.0. 

  yarn.scheduler.capacity.root.default.maximum-capacity

    100   

      The maximum capacity of the defaultqueue.

   yarn.scheduler.capacity.root.cdn.maximum-capacity

    100   

      The maximum capacity of the defaultqueue.

   yarn.scheduler.capacity.root.default.state

    RUNNING   

      The state of the default queue. State canbe one of RUNNING or STOPPED. 

   yarn.scheduler.capacity.root.default.acl_submit_applications

    *  

      The ACL of who can submit jobs to thedefault queue.

    yarn.scheduler.capacity.root.default.acl_administer_queue

      The ACL of who can administer jobs on thedefault queue.

   yarn.scheduler.capacity.node-locality-delay

    40

      Number of missed scheduling opportunitiesafter which the CapacityScheduler

      attempts to schedule rack-localcontainers.

      Typically this should be set to number ofnodes in the cluster, By default is setting

      approximately number of nodes in one rackwhich is 40.

   yarn.scheduler.capacity.queue-mappings

      A list of mappings that will be used toassign jobs to queues

      The syntax for this list is[u|g]:[name]:[queue_name][,next mapping]*

      Typically this list will be used to mapusers to queues,

      for example, u:%user:%user maps all usersto queues with the same name

      as the user. 

   yarn.scheduler.capacity.queue-mappings-override.enable

    false

        If a queue mapping is present, will itoverride the value specified

      by the user? This can be used byadministrators to place jobs in queues

      that are different than the one specifiedby the user.

      The default is false.

将配置好的hadoop- 2.7.3文件夹分发给所有slaves吧

scp

-r ~/workspace/hadoop- 2.7.3 root@slave1:~/workspace/

启动Hadoop

在 master 上执行以下操作,就可以启动 hadoop 了。

cd

~/workspace/hadoop- 2.7.3.

#进入hadoop目录

bin/hadoop namenode -format#

格式化namenode

sbin/start-dfs.sh #

启动dfs

sbin/start-yarn.sh #

启动yarn

验证 Hadoop 是否安装成功可以通过jps命令查看各个节点启动的进程是否正常。在master 上应该有以下几个进程:

$jps #run on master

3407 SecondaryNameNode

3218 NameNode

3552 ResourceManager

3910 Jps

在每个slave上应该有以下几个进程:

$jps #run on slaves

2072 NodeManager

2213 Jps

1962 DataNode

或者在浏览器中输入 http://master:8088 ,应该有 hadoop 的管理界面出来了,可以看到配置的情况。

6、配置spark集群

 从官网下载Spark2.0.1版本,解压

配置spark以下仅列出了一些重要的相关环境参数,需配置conf文件夹下的Spark-env.sh,Spark-defaluts.conf配置Spark-env.sh

exportSCALA_HOME=/usr/local/scala-2.11.8

exportJAVA_HOME=/usr/local/java

exportHADOOP_HOME=/usr/local/hadoop-2.7.3

exportHADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

SPARK_MASTER_IP=BJYZ-VA-TJFX1

SPARK_LOCAL_DIRS=/usr/local/spark-2.0.1-bin-hadoop2.7.3

SPARK_DRIVER_MEMORY=2G

SPARK_EXECUTOR_MEMORY=1G

配置Spark-defaluts.conf

spark.default.parallelism    64

spark.executor.extraJavaOptions-Xss1024k -XX:PermSize=128M -XX:MaxPermSize=256M

spark.dynamicAllocation.enabled    true

spark.shuffle.service.enabled      true

spark.dynamicAllocation.initialExecutors4

spark.dynamicAllocation.minExecutors     2

spark.dynamicAllocation.maxExecutors     10

spark.dynamicAllocation.executorIdleTimeout30s

spark.executor.memory   6g

spark.executor.cores    4

验证 Spark 是否安装成功。

Sbin目录下start-all

用jps检查,在 master 上应该有以下几个进程:

$jps

7949 Jps

7328 SecondaryNameNode

7805 Master

7137 NameNode

7475 ResourceManager

在 slave 上应该有以下几个进程:

$jps

3132 DataNode

3759 Worker

3858 Jps

3231 NodeManager

进入Spark的Web管理页面:http://master:8080

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。