【上海校区】Linux下Pig的安装和配置

一.Pig简介

Apache Pig是用来处理大规模数据的高级查询语言，配合Hadoop使用，可以在处理海量数据时达到事半功倍的效果，比使用Java，C++等语言编写大规模数据处理程序的难度要小N倍，实现同样的效果的代码量也小N倍。Apache Pig为大数据集的处理提供了更高层次的抽象，为mapreduce算法(框架)实现了一套类SQL的数据处理脚本语言的shell脚本，在Pig中称之为Pig Latin，在这套脚本中我们可以对加载出来的数据进行排序、过滤、求和、分组(group by)、关联(Joining)，Pig也可以由用户自定义一些函数对数据集进行操作，也就是传说中的UDF(user-defined functions)。

二.Pig的安装和配置

1.Pig的安装条件

(1).Hadoop

Pig有两种运行模式：Local模式和MapReduce模式。如果需要作业在分布式环境下运行，则需要安装Hadoop，否则可以选择不安装。另外，我安在的是Hadoop2.4.0，当然用户可以安装其他不同的版本，不过建议安装最新或较新的版本，因为版本是不端完善的嘛。

(2).Java 1.7

建议安装Java 1.7以上的版本，Java环境对于Pig来说是必需的。安装好后配置好环境变量即可。

2.Pig的下载、安装和配置

当前Pig的最新版本为0.13.0，我用的就是这个版本。官网下载地址：http://www.apache.org/dyn/closer.cgi/pig

下载好后解压：tar -zvxf pig-0.13.0-tar.gz

我们可以将Pig放在系统的任何位置，只要配置好环境变量就可以使用Pig了，但为了管理方便，最好把Pig放到Hadoop所在的目录下。

解压完之后设置环境变量，终端输入一下命令：

vim /etc/profile
export PIG_HOME=/opt/pig-0.13.0(替换成你的目录)
export PATH=$PATH:$PIG_HOME/bin:$PIG_HOME/conf

保存退出后执行：source /etc/profile 使环境变量配置生效。

通过“pig --help”命令来查看Pig是否安装成功。Pig安装成功后会有如下提示：

root@Ubuntu-Kylin:/opt# pig --help
Apache Pig version 0.13.0 (r1606446)
compiled Jun 29 2014, 02:27:58
USAGE: Pig [options] [-] : Run interactively in grunt shell.
Pig [options] -e[xecute] cmd [cmd ...] : Run cmd(s).
Pig [options] [-f[ile]] file : Run cmds found in file.
options include:
-4, -log4jconf - Log4j configuration file, overrides log conf
-b, -brief - Brief logging (no timestamps)
-c, -check - Syntax check
-d, -debug - Debug level, INFO is default
-e, -execute - Commands to execute (within quotes)
-f, -file - Path to the script to execute
-g, -embedded - ScriptEngine classname or keyword for the ScriptEngine
-h, -help - Display this message. You can specify topic to get help for that topic.
properties is the only topic currently supported: -h properties.
-i, -version - Display version information
-l, -logfile - Path to client side log file; default is current working directory.
-m, -param_file - Path to the parameter file
-p, -param - Key value pair of the form param=val
-r, -dryrun - Produces script with substituted parameters. Script is not executed.
-t, -optimizer_off - Turn optimizations off. The following values are supported:
SplitFilter - Split filter conditions
PushUpFilter - Filter as early as possible
MergeFilter - Merge filter conditions
PushDownForeachFlatten - Join or explode as late as possible
LimitOptimizer - Limit as early as possible
ColumnMapKeyPrune - Remove unused data
AddForEach - Add ForEach to remove unneeded columns
MergeForEach - Merge adjacent ForEach
GroupByConstParallelSetter - Force parallel 1 for "group all" statement
All - Disable all optimizations
All optimizations listed here are enabled by default. Optimization values are case insensitive.
-v, -verbose - Print all error messages to screen
-w, -warning - Turn warning logging on; also turns warning aggregation off
-x, -exectype - Set execution mode: local|mapreduce, default is mapreduce.
-F, -stop_on_failure - Aborts execution on the first failed job; default is off
-M, -no_multiquery - Turn multiquery optimization off; default is on
-N, -no_fetch - Turn fetch optimization off; default is on
-P, -propertyFile - Path to property file
-printCmdDebug - Overrides anything else and prints the actual command used to run Pig, including
any environment variables that are set by the pig command.

三.Pig的运行模式

Pig有两种运行模式：Loca模式和MapReduce模式。当Pig在Local模式下运行时，Pig只访问本地一台主机；当Pig在MapReduce模式下运行时，它将访问一个Hadoop集群和HDFS的安装位置。这时，Pig将自动地对这个集群进行分配和回收。因为Pig系统可以自动对MapReduce程序进行优化，所以当用户使用Pig Latin语言进行编程的时候，不必关心程序运行的效率，Pig系统将会自动对程序进行优化，这样可以大了节省编程时间。

Pig的Local模式和MapReduce模式都有三种运行方式，分别为：GruntShell方式、脚本文件方式和嵌入式程序方式。

1.Local模式

(1)GruntShell方式

用户使用 Grunt Shell 方式时，需要首先使用命令开启 Pig 的 Grunt Shell ，只需在 Linux 终端中输入如下命令并执行即可：

$pig –x local

这样 Pig 将进入 Grunt Shell 的 Local 模式，如果直接输入“ $pig ”命令， Pig 将首先检测 Pig 的环境变量设置，然后进入相应的模式。如果没有设置 MapReduce 环境变量， Pig 将直接进入 Local 模式。

Grunt Shell 和 Windows 中的 Dos 窗口非常类似，这里用户可以一条一条地输入命令对数据进行操作。

(2)脚本文件方式

使用脚本文件作为批处理作业来运行 Pig 命令，它实际上就是第一种运行方式中命令的集合，使用如下命令可以在本地模式下运行 Pig 脚本：

$pig –x local script.pig

其中，“ script.pig ”是对应的 Pig 脚本，用户在这里需要正确指定 Pig 脚本的位置，否则，系统将不能识别。例如， Pig 脚本放在“ /root/pigTmp ”目录下，那么这里就要写成“ /root/pigTmp/script.pig ”。用户在使用的时候需要注意 Pig 给出的一些提示，充分利用这些能够帮助用户更好地使用 Pig 进行相关的操作。

(3)嵌入式程序方式

我们可以把 Pig 命令嵌入到主机语言中，并且运行这个嵌入式程序。和运行普通的 Java 程序相同，这里需要书写特定的 Java 程序，并且将其编译生成对应的 class 文件或 package 包，然后再调用 main 函数运行程序。用户可以使用下面的命令对 Java 源文件进行编译：

$javac -cp pig-*.*.*-core.jar local.java

这里“ pig-*.*.*-core.jar ”放在 Pig 安装目录下，“ local.java ”为用户编写的 java 源文件，并且“ pig-*.*.*-core.jar ”和“ local.java ”需要用户正确地指定相应的位置。例如，我们的“ pig-*.*.*-core.jar ”文件放在“ /root/hadoop-0.20.2/ ”目录下，“ local.java ”文件放在“ /root/pigTmp ”目录下，所以这一条命令我们应该写成：$javac -cp /root/hadoop-0.20.2/ pig- 0 . 20 . 2 -core.jar /root/pigTmp/ local.java当编译完成后， Java 会生成“ local.class ”文件，然后用户可以通过如下命令调用执行此文件。

$ java -cp pig-*.*.*-core.jar:. Local

2.MapReduce 模式

(1)Grunt Shell 方式

用户在 Linux 终端下输入如下命令进入 Grunt Shell 的 MapReduce 模式：

$pig –x mapreduce

(2)脚本文件方式

用户可以使用如下命令在 MapReduce 模式下运行 Pig 脚本文件。

$pig –x mapreduce script.pig

(3)嵌入式程序

和 Local 模式相同，在 MapReduce 模式下运行嵌入式程序同样需要经过编译和执行两个步骤。用户可以使用如下两条命令，完成相应的操作。

javac -cp pig-0.7.0-core.jar mapreduce.java

java -cp pig-0.7.0-core.jar:. mapreduce

至此，Pig的安装完成。

不二晨 · 不二晨

奈斯，优秀

wuqiong · wuqiong

吴琼老师 · 吴琼老师

帐号		自动登录	找回密码
密码			加入黑马

【上海校区】Linux下Pig的安装和配置

3 个回复