Apache Spark is a fast and general engine for large-scale data processing.说Spark是一个用于大规模数据处理的快速通用分布式引擎,到底有多快,请看下面这句
Lightning-fast cluster computing快如闪电的集群计算。感觉官网用闪电来形容有点抽象,我总结了Spark大致有以下特点:
spark.apache.org/docs/latest…看官网的教程基本给出了Scala,Java,Python三种代码的例子,也就是说,你只要会其中一种就可以玩了。但是考虑到Spark源码是Scala写的,从学习源码的角度来看,用Scala是比较靠谱的。说下学习前的相关知识储备
spark.apache.org/downloads.h…目前Spark中的1.X版本中最新的是1.6.3,而2.X版本中最新的是2.1.0。你可以安装Hadoop使用其中的分布式文件系统HDFS,也可以不安装只使用本地文件系统。解压完成后,建议将Spark加入Linux的环境变量,方便使用Spark的相关命令。Spark提供很好用的交互式工具,使用下面命令直接调用
The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.Spark提供弹性分布式数据集(RDD)作为主要抽象,它可以提供并行的操作在集群的节点之间。(RDD创建)可以通过从Hadoop文件系统(或任何其他Hadoop支持的文件系统)中的文件或驱动程序中的现有Scala集合,或者通过RDD的转化操作得到。(RDD的持久化)用户还可以要求Spark 在内存中保留 RDD,从而在并行操作中有效地重用RDD。(RDD基于血缘关系的高容错)最后,RDD自动从节点故障中恢复。
There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.创建RDD的两种方法:并行化(即通过parallelize方法),或者在外部存储系统(如共享文件系统,HDFS,HBase或提供Hadoop InputFormat的任何数据源)中引用数据集。
RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset.RDDS支持两种类型的操作:transformations(转化),从现有的RDD创建一个新的RDD。actions(行动),其上运行的数据集计算后获取值返回驱动程序。
All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently.
Spark SQL是用于结构化数据处理的Spark模块。Spark SQL的一个用途是执行SQL查询。Spark SQL也可用于从Hive中读取数据。当编程语言运行SQL时,结果将以Dataset / DataFrame的形式返回。还可以使用命令行或JDBC / ODBC与SQL界面进行交互。
欢迎光临 黑马程序员技术交流社区 (http://bbs.itheima.com/) | 黑马程序员IT技术论坛 X3.2 |