Interview Questions for – Spark GraphX
Q1 Name a few commonly used Spark Ecosystems.
Answer: Spark SQL (Shark)
Q2 What is “GraphX” in Spark?
Answer: “GraphX” is a component in Spark which is used for graph processing. It helps to build and transform interactive graphs. Spark uses GraphX for graph processing to build and transform interactive graphs. The GraphX component enables programmers to reason about structured data at scale.
Q3 Define “PageRank”.
Answer: “PageRank” is the measure of each vertex in a graph.
Q4 What is lineage graph?
Answer: The RDDs in Spark, depend on one or more other RDDs. The representation of dependencies in between RDDs is known as the lineage graph. Lineage graph information is used to compute each RDD on demand, so that whenever a part of persistent RDD is lost, the data that is lost can be recovered using the lineage graph information.
Q5 Explain about the major libraries that constitute the Spark Ecosystem
Answer: Spark MLib- Machine learning library in Spark for commonly used learning algorithms like clustering, regression, classification, etc.
Spark Streaming – This library is used to process real time streaming data.
Spark GraphX – Spark API for graph parallel computations with basic operators like join Vertices, subgraph, aggregate Messages, etc.
Spark SQL – Helps execute SQL like queries on Spark data using standard visualization or BI tools.
Q6 Does Apache Spark provide check pointing?
Answer: Lineage graphs are always useful to recover RDDs from a failure but this is generally time consuming if the RDDs have long lineage chains. Spark has an API for check pointing i.e. a REPLICATE flag to persist. However, the decision on which data to checkpoint – is decided by the user. Checkpoints are useful when the lineage graphs are long and have wide dependencies.
*************All The Best**************