6. 스파크의 핵심 RDD Resilient Distributed Datasets

6. 스파크의 핵심 RDD #RDD #Resilient Distributed Datasets #fault-tolerant #Lineage #DAG #directed acydic graph #Method chaining

RDD Resilient Distributed Datasets

Spark : A Fault-Tolerant Abstraction for In Memory Cluster Computing

Hadoop MapReduce의 단점?

Machine Learning에 적합하지 않다
데이터 처리 시 HDFS(Hadoop Distributed File System)를 거치기 때문에 IO에서 시간이 오래 걸린다

Spark는?

RAM에서 Read-Only로 처리해서 running time이 빠르다!

fault-tolerant?

클러스터에서 데이터 처리 중간에 fault가 나면 어떻게 될까?

복사해 두거나replicating, Disk에 써둬야checkpointing 한다? -> 느려진다

RAM을 read-only로 써보자. 이것이 RDD!

계보 Lineage

read-only는 만들어진 이래 고쳐지지update 않는다 즉, 어떻게 만들었는지만 기록해 두면 언제든지 또 만들 수 있다

부모로부터 어떻게 만들어졌는지 계보Lineage만 기록해도 fault-tolerant하다.

그리고 Lineage는 용량이 적다.

DAG directed acydic graph 디자인

코딩을 하는 것은 실제 계산 작업이 되는 것이 아니라, Lineage 계보를 디자인 해 가는 것

RDD Operator

Transformation : 데이터의 흐름, 계보를 만듬

Action : Transformation에서 작성된 계보를 따라 데이터 처리하여 결과를 생성함 (lazy-execution)

Narrow dependency 형태로 코딩해야 한다

책상 한자리에서 다 처리할 수 있는 일은 모아서 하는게 좋다는 개념
네트워크를 안타고 메모리의 속도로 동작해서 빠르다
파티션이 부셔져도 해당 노드에서 바로 복원 가능하다
map, filter, union, join with inputs co-partitioned

Wide dependency

여러 책상에 있는 자료 훑어와야 한다는 개념
네트워크의 속도로 동작해서 느리다
노드끼리 셔플이 일어나야 한다
파티션이 부셔지면 계산비용이 비싸다
groupByKey, join with inputs not co-partitioned

[Code] *파란색은 스크립트, 검정색은 결과값입니다

1) 로컬에 있는 README 파일을 spark context 객체에서 제공하는 textFile 메서드로 읽어 RDD로 만듬

scala> val lines = sc.textFile("/usr/local/lib/spark/README.md")

lines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at <console>:21

scala> lines.take(5).foreach(println)

# Apache Spark

Spark is a fast and general cluster computing system for Big Data. It provides

high-level APIs in Scala, Java, Python, and R, and an optimized engine that

supports general computation graphs for data analysis. It also supports a

2) 각 라인별로 공백으로 잘라서 단어들의 집합을 만듬

scala> val words = lines.flatMap(line=>line.split(" "))

scala> words.take(10)

Array[String] = Array(#, Apache, Spark, "", Spark, is, a, fast, and, general)

3) 각 단어들을 Key, Value 형태로 만들면서 각 단어별 1씩 넣어줌

scala> val mapResult = words.map(word =>(word,1))

scala> mapResult.take(10)

Array[(String, Int)] = Array((#,1), (Apache,1), (Spark,1), ("",1), (Spark,1), (is,1), (a,1), (fast,1), (and,1), (general,1))

4) 각 Key가 같은 것들끼리 Value sum을 하는 reduce작업 진행

scala> val reduceResult = mapResult.reduceByKey(_+_)

scala> reduceResult.take(10)

Array[(String, Int)] = Array((package,1), (For,2), (Programs,1), (processing.,1), (Because,1), (The,1), (cluster.,1), (its,1), ([run,1), (APIs,1))

Method Chaining

위 메서드 들을 연결해서 연속으로 아래와 같이 한꺼번에 쓸 수 있는 메서드 체이닝이 가능 (동일 결과)

scala> val wordsCount = lines.flatMap(line=>line.split(" ")).map(word=>(word,1)).reduceByKey(_+_)

scala> wordsCount.take(10)

Array[(String, Int)] = Array((package,1), (For,2), (Programs,1), (processing.,1), (Because,1), (The,1), (cluster.,1), (its,1), ([run,1), (APIs,1))

참고 : http://www.slideshare.net/yongho/rdd-paper-review

http://spark.apache.org/

안되는 부분이나 궁금한 점 있으면 댓글달아 주세요 :)

저작자표시 비영리 변경금지

'Data > SPARK' 카테고리의 다른 글

8.스파크 RDD의 연산 기본 함수 예제 (3)	2016.03.02
7. 머신러닝 kmeans 알고리즘 (0)	2016.02.18
5. 웹 기반 명령어 해석기 Zeppelin Install (4)	2016.02.12
4. CentOS 스파크 설치 Spark Install "Hello Spark" (5)	2016.02.05
[Spark] Command (Terminal, Spark, Hadoop) (0)	2016.01.29

On the ball

6. 스파크의 핵심 RDD Resilient Distributed Datasets

'Data > SPARK' 카테고리의 다른 글

티스토리툴바

6. 스파크의 핵심 RDD Resilient Distributed Datasets

'Data > SPARK' 카테고리의 다른 글

관련글

티스토리툴바