1. What are the main features of Apache Spark?
Main features of Apache Spark are as follows:
- Performance: The key feature of Apache Spark is its Performance. With Apache Spark we can run programs up to 100 times faster than Hadoop MapReduce in memory. On disk we can run it 10 times faster than Hadoop.
- Ease of Use: Spark supports Java, Python, R, Scala etc. languages. So it makes it much easier to develop applications for Apache Spark.
- Integrated Solution: In Spark we can create an integrated solution that combines the power of SQL, Streaming and data analytics. Run Everywhere: Apache Spark can run on many platforms. It can run on Hadoop, Mesos, in Cloud or standalone. It can also connect to many data sources like HDFS, Cassandra, HBase, S3 etc.
- Stream Processing: Apache Spark also supports real time stream processing. With real time streaming we can provide real time analytics solutions. This is very useful for real-time data.
2. What is a Resilient Distribution Dataset in Apache Spark?
Resilient Distribution Dataset (RDD) is an abstraction of data in Apache Spark. It is a distributed and resilient collection of records spread over many partitions.
RDD hides the data partitioning and distribution behind the scenes. Main features of RDD are as follows:
- Distributed: Data in a RDD is distributed across multiple nodes.
- Resilient: RDD is a fault-tolerant dataset. In case of node failure, Spark can re-compute data.
- Dataset: It is a collection of data similar to collections in Scala.
- Immutable: Data in RDD cannot be modified after creation. But we can transform it using a Transformation.
3. What is a Transformation in Apache Spark?
Transformation in Apache Spark is a function that can be applied to a RDD. Output of a Transformation is another RDD.
Transformation in Spark is a lazy operation. It means it is not executed immediately. Once we call an action, transformation is executed.
A Transformation does not change the input RDD.
We can also create a pipeline of certain Transformations to create a Data flow.
4. What are security options in Apache Spark?
Apache Spark provides following security options:
- Encryption: Apache Spark supports encryption by SSL. We can use HTTPS protocol for secure data transfer. Therefore data is transmitted in encrypted mode. We can use spark.ssl parameters to set the SSL configuration.
- Authentication: We can perform authentication by a shared secret in Apache Spark. We can use spark.authenticate to configure authentication in Spark.
- Event Logging: If we use Event Logging, then we can set the permissions on the directory where event logs are stored. These permissions can ensure access control for Event log.
5. How will you monitor Apache Spark?
We can use the Web UI provided by SparkContext to monitor Spark. We can access this Web UI at port 4040 to get the useful information. Some of the information that we can monitor is:
- Scheduler tasks and stages
- RDD Sizes and Memory usage
- Spark Environment Information
- Executors Information
Spark also provides a Metrics library. This library can be used to send Spark information to HTTP, JMX, CSV files etc.
This is another option to collect Spark runtime information for monitoring another dashboard tool.
6. What are the main libraries of Apache Spark?
Some of the main libraries of Apache Spark are as follows:
- MLib: This is Spark’s Machine Learning library. We can use it to create scalable machine learning system. We can use various machine learning algorithms as well as features like pipelining etc with this library.
- GraphX: This library is used for computation of Graphs. It helps in creating a Graph abstraction of data and then use various Graph operators like- SubGraph, joinVertices etc.
- Structured Streaming: This library is used for handling streams in Spark. It is a fault tolerant system built on top of Spark SQL Engine to process streams.
- Spark SQL: This is another popular component that is used for processing SQL queries on Spark platform.
- SparkR: This is a package in Spark to use Spark from R language. We can use R data frames, dplyr etc from this package. We can also start SparkR from RStudio.
7. What are the main functions of Spark Core in Apache Spark?
Spark Core is the central component of Apache Spark. It serves following functions:
- Distributed Task Dispatching
- Job Scheduling
- I/O Functions
8. How will you do memory tuning in Spark?
In case of memory tuning we have to take care of these points.
- Amount of memory used by objects
- Cost of accessing objects
- Overhead of Garbage Collection
Apache Spark stores objects in memory for caching. So it becomes important to perform memory tuning in a Spark application. First we determine the memory usage by the application.
To do this we first create a RDD and put it in cache. Now we can see the size of the RDD in storage page of Web UI. This will tell the amount of memory consumed by RDD.
Based on the memory usage, we can estimate the amount of memory needed for our task.
In case we need tuning, we can follow these practices to reduce memory usage:
- Use data structures like Array of objects or primitives instead of Linked list or HashMap. Fastutil library provides convenient collection classes for primitive types compatible with Java.
- We have to reduce the usage of nested data structures with a large number of small objects and pointes. E.g. Linked list has pointers within each node.
- It is a good practice to use numeric IDs instead of Strings for keys.
- We can also use JVM flag -XX:+UseCompressedOops to make pointers be four bytes instead of eight.
9. What are the two ways to create RDD in Spark?
We can create RDD in Spark in following two ways:
- Internal: We can parallelize an existing collection of data within our Spark Driver program and create a RDD out of it.
- External: We can also create RDD by referencing a Dataset in an external data source like AWS S3, HDFS, HBASE etc.
10. What are the main operations that can be done on a RDD in Apache Spark?
There are two main operations that can be performed on a RDD in Spark:
- Transformation: This is a function that is used to create a new RDD out of an existing RDD.
- Action: This is a function that returns a value to Driver program after running a computation on RDD.
11. What are the common Transformations in Apache Spark?
Some of the common transformations in Apache Spark are as follows:
- Map(func): This is a basic transformation that returns a new dataset by passing each element of input dataset through func function.
- Filter(func): This transformation returns a new dataset of elements that return true for func function. It is used to filter elements in a dataset based on criteria in func function.
- Union(other Dataset): This is used to combine a dataset with another dataset to form a union of two datasets.
- Intersection(other Dataset): This transformation gives the elements common to two datasets.
- Pipe(command, [envVars]): This transformation passes each partition of the dataset through a shell command.
12. What are the common Actions in Apache Spark?
Some of the commonly used Actions in Apache Spark are as follows:
- Reduce(func): This Action aggregates the elements of a dataset by using func function.
- Count(): This action gives the total number of elements in a Dataset.
- Collect(): This action will return all the elements of a dataset as an Array to the driver program.
- First(): This action gives the first element of a collection.
- Take(n): This action gives the first n elements of dataset.
- Foreach(func): This action runs each element in dataset through a for loop and executes function func on each element.
13. What is a Shuffle operation in Spark?
Shuffle operation is used in Spark to re-distribute data across multiple partitions.
It is a costly and complex operation.
In general a single task in Spark operates on elements in one partition. To execute shuffle, we have to run an operation on all elements of all partitions. It is also called all-to-all operation.
14. What are the operations that can cause a shuffle in Spark?
Some of the common operations that can cause a shuffle internally in Spark are as follows:
- Repartition
- Coalesce
- GroupByKey
- ReduceByKey
- Cogroup
- Join
15. What is purpose of Spark SQL?
Spark SQL is used for running SQL queries. We can use Spark SQL to interact with SQL as well as Dataset API in Spark.
During execution, Spark SQL uses same computation engine for SQL as well as Dataset API.
With Spark SQL we can get more information about the structure of data as well as computation being performed.
We can also use Spark SQL to read data from an existing Hive installation.
Spark SQL can also be accessed by using JDBC/ODBC API as well as command line.
16. What is a DataFrame in Spark SQL?
A DataFrame in SparkSQL is a Dataset organized into names columns. It is conceptually like a table in SQL.
In Java and Scala, a DataFrame is a represented by a DataSet of rows.
We can create a DataFrame from an existing RDD, a Hive table or from other Spark data sources.
17. What is a Parquet file in Spark?
Apache Parquet is a columnar storage format that is available to any project in Hadoop ecosystem. Any data processing framework, data model or programming language can use it.
It is a compressed, efficient and encoding format common to Hadoop system projects.
Spark SQL supports both reading and writing of parquet files. Parquet files also automatically preserves the schema of the original data.
During write operations, by default all columns in a parquet file are converted to nullable column.
18. What is the difference between Apache Spark and Apache Hadoop MapReduce?
Some of the main differences between Apache Spark and Hadoop MapReduce are follows:
- Speed: Apache Spark is 10X to 100X faster than Hadoop due to its usage of in memory processing.
- Memory: Apache Spark stores data in memory, whereas Hadoop MapReduce stores data in hard disk.
- RDD: Spark uses Resilient Distributed Dataset (RDD) that guarantee fault tolerance. Where Apache Hadoop uses replication of data in multiple copies to achieve fault tolerance.
- Streaming: Apache Spark supports Streaming with very less administration. This makes it much easier to use than Hadoop for real-time stream processing.
- API: Spark provides a versatile API that can be used with multiple data sources as well as languages. It is more extensible than the API provided by Apache Hadoop.
19. What are the main languages supported by Apache Spark?
Some of the main languages supported by Apache Spark are as follows:
- Java: We can use JavaSparkContext object to work with Java in Spark.
- Scala: To use Scala with Spark, we have to create SparkContext object in Scala.
- Python: We also used SparkContext to work with Python in Spark.
- R: We can use SparkR module to work with R language in Spark ecosystem.
- SQL: We can also SparkSQL to work with SQL language in Spark.
20. What are the file systems supported by Spark?
Some of the popular file systems supported by Apache Spark are as follows:
- HDFS
- S3
- Local File System
- Cassandra
- OpenStack Swift
- MapR File System
21. What is a Spark Driver?
Spark Driver is a program that runs on the master node machine. It takes care of declaring any operation- Transformation or Action on a RDD.
With Spark Driver was can keep track of all the operations on a Dataset. It can also be used to rebuild a RDD in Spark.
22. What is an RDD Lineage?
Resilient Distributed Dataset (RDD) Lineage is a graph of all the parent RDD of a RDD. Since Spark does not replicate data, it is possible to lose some data. In case some Dataset is lost, it is possible to use RDD Lineage to recreate the lost Dataset.
Therefore RDD Lineage provides solution for better performance of Spark as well as it helps in building a resilient system.
23. What are the two main types of Vector in Spark?
There are two main types of Vector in Spark:
Dense Vector: A dense vector is backed by an array of double data type. This array contains the values.
E.g. {1.0 , 0.0, 3.0}
Sparse Vector: A sparse vector is backed by two parallel arrays. One array is for indices and the other array is for values.
E.g. {3, [0,2], [1.0,3.0]}
In this array, the first element is the number of elements in vector. Second element is the array of indices of non-zero values. Third element is the array of non-zero values.
24. What are the different deployment modes of Apache Spark?
Some of the popular deployment modes of Apache Spark are as follows:
- Amazon EC2: We can use AWS cloud product Elastic Compute Cloud (EC2) to deploy and run a Spark cluster.
- Mesos: We can deploy a Spark application in a private cluster by using Apache Mesos.
- YARN: We can also deploy Spark on Apache YARN (Hadoop NextGen)
- Standalone: This is the mode in which we can start Spark by hand. We can launch standalone cluster manually.
25. What is lazy evaluation in Apache Spark?
Apache Spark uses lazy evaluation as a performance optimization technique. In Laze evaluation as transformation is not applied immediately to a RDD. Spark records the transformations that have to be applied to a RDD. Once an Action is called, Spark executes all the transformations.
Since Spark does not perform immediate execution based on transformation, it is called lazy evaluation.
26. What are the core components of a distributed application in Apache Spark?
Core components of a distributed application in Apache Spark are as follows:
- Cluster Manager: This is the component responsible for launching executors and drivers on multiple nodes. We can use different types of cluster managers based on our requirements. Some of the common types are Standalone, YARN, Mesos etc.
- Driver: This is the main program in Spark that runs the main() function of an application. A Driver program creates the SparkConetxt. Driver program listens and accepts incoming connections from its executors. Driver program can schedule tasks on the cluster. It runs closer to worker nodes.
- Executor: This is a process on worker node. It is launched on the node to run an application. It can run tasks and use data in memory or disk storage to perform the task.
27. What is the difference in cache() and persist() methods in Apache Spark?
Both cache() and persist() functions are used for persisting a RDD in memory across operations. The key difference between persist() and cache() is that in persist() we can specify the Storage level that we select for persisting. Where as in cache(), default strategy is used for persisting. The default storage strategy is MEMORY_ONLY.
28. How will you remove data from cache in Apache Spark?
In general, Apache Spark automatically removes the unused objects from cache. It uses Least Recently Used (LRU) algorithm to drop old partitions. There are automatic monitoring mechanisms in Spark to monitor cache usage on each node.
In case we want to forcibly remove an object from cache in Apache Spark, we can use RDD.unpersist() method.
29. What is the use of SparkContext in Apache Spark?
SparkContext is the central object in Spark that coordinates different Spark applications in a cluster.
In a cluster we can use SparkContext to connect to multiple Cluster Managers that allocate resources to multiple applications.
For any Spark program we first create SparkContext object. We can access a cluster by using this object. To create a SparkContext object, we first create a SparkConf object. This object contains the configuration information of our application.
In Spark Shell, by default we get a SparkContext for the shell.
30. Do we need HDFS for running Spark application?
This is a trick question. Spark supports multiple file-systems. Spark supports HDFS, HBase, local file system, S3, Cassandra etc. So HDFS is not the only file system for running Spark application.
31. What is Spark Streaming?
Spark Streaming is a very popular feature of Spark for processing live streams with a large amount of data.
Spark Streaming uses Spark API to create a highly scalable, high throughput and fault tolerant system to handle live data streams.
Spark Streaming supports ingestion of data from popular sources like- Kafka, Kinesis, Flume etc.
We can apply popular functions like map, reduce, join etc on data processed through Spark Streams.
The processed data can be written to a file system or sent to databases and live dashboards.
32. How does Spark Streaming work internally?
Spark Streams listen to live data streams from various sources. On receiving data, it is divided into small batches that can be handled by Spark engine. These small batches of data are processed by Spark Engine to generate another output stream of resultant data.
Internally, Spark uses an abstraction called DStream or discretized stream. A DStream is a continuous stream of data. We can create DStream from Kafka, Flume, Kinesis etc.
A DStream is nothing but a sequence of RDDs in Spark.
We can apply transformations and actions on this sequence of RDDs to create further RDDs.
33. What is a Pipeline in Apache Spark?
Pipeline is a concept from Machine learning. It is a sequence of algorithms that are executed for processing and learning from data.
A Pipeline is similar to a workflow. There can be one or more stages in a Pipeline.
34. How does Pipeline work in Apache Spark?
A Pipeline is a sequence of stages. Each stage in Pipeline can be a Transformer or an Estimator. We run these stages in an order. Initially a DataFrame is passed as an input to Pipeline.
This DataFrame keeps on transforming with each stage of Pipeline. Most of the time, Runtime checking is done on DataFrame passing through the Pipeline. We can also save a Pipeline to a disk. It can be re-read from disk a later point of time.
35. What is the difference between Transformer and Estimator in Apache Spark?
A Transformer is an abstraction for feature transformer and learned model. A Transformer implements transform() method. It converts one DataFrame to another DataFrame. It appends one or more columns to a DataFrame.
In a feature transformer a DataFrame is the input and the output is a new DataFrame with a new mapped column.
An Estimator is an abstraction for a learning algorithm that fits or trains on data. An Estimator implements fit() method. The fit() method takes a DataFrame as input and results in a Model.
36. What are the different types of Cluster Managers in Apache Spark?
Main types of Cluster Managers for Apache Spark are as follows:
- Standalone: It is a simple cluster manager that is included with Spark. We can start Spark manually by hand in this mode.
- Spark on Mesos: In this mode, Mesos master replaces Spark master as the cluster manager. When driver creates a job, Mesos will determine which machine will handle the task.
- Hadoop YARN: In this setup, Hadoop YARN is used in cluster. There are two modes in this setup. In cluster mode, Spark driver runs inside a master process managed by YARN on cluster. In client mode, the Spark driver runs in the client process and application master is used for requesting resources from YARN.
37. How will you minimize data transfer while working with Apache Spark?
Generally Shuffle operation in Spark leads to a large amount of data transfer. We can configure Spark Shuffle process for optimum data transfer. Some of the main points are as follows:
- spark.shuffle.compress: This configuration can be set to true to compress map output files. This reduces the amount of data transfer due to compression.
- ByKey operations: We can minimize the use of ByKey operations to minimize the shuffle calls.
38. What is the main use of MLib in Apache Spark?
MLib is a machine-learning library in Apache Spark. Some of the main uses of MLib in Spark are as follows:
- ML Algorithms: It contains Machine Learning algorithms such as classification, regression, clustering, and collaborative filtering.
- Featurization: MLib provides algorithms to work with features. Some of these are feature extraction, transformation, dimensionality reduction, and selection.
- Pipelines: It contains tools for constructing, evaluating, and tuning ML Pipelines.
- Persistence: It also provides methods for saving and load algorithms, models, and Pipelines.
- Utilities: It contains utilities for linear algebra, statistics, data handling, etc.
39. What is the Checkpointing in Apache Spark?
In Spark Streaming, there is a concept of Checkpointing to add resiliency in the application. In case of a failure, a streaming application needs a checkpoint to recover. Due to this Spark provides Checkpointing. There are two types of Checkpointing:
- Metadata Checkpointing: Metadata is the configuration information and other information that defines a Streaming application. We can create a Metadata checkpoint for a node to recover from the failure while running the driver application. Metadata includes configuration, DStream operations and incomplete batches etc.
- Data Checkpointing: In this checkpoint we save RDD to a reliable storage. This is useful in stateful transformations where generated RDD depends on RDD of previous batch. There can be a long chain of RDDs in some cases. To avoid such a large recovery time, it is easier to create Data Checkpoint with RDDs at intermediate steps.
40. What is an Accumulator in Apache Spark?
An Accumulator is a variable in Spark that can be added only through an associative and commutative operation. An Accumulator can be supported in parallel.
It is generally used to implement a counter or cumulative sum.
We can create numeric type Accumulators by default in Spark.
An Accumulator variable can be named as well as unnamed.
41. What is a Broadcast variable in Apache Spark?
As per Spark online documentation, “A Broadcast variable allows a programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.” Spark can also distribute broadcast variable with an efficient broadcast algorithm to reduce communication cost.
In Shuffle operations, there is a need of common data. This common data is broadcast by Spark as a Broadcast variable. The data in these variables is serialized and de-serialized before running a task.
We can use SparkContext.broadcast(v) to create a broadcast variable.
It is recommended that we should use broadcast variable in place of original variable for running a function on cluster.
42. What is Structured Streaming in Apache Spark?
Structured Streaming is a new feature in Spark 2.1. It is a scalable and fault-tolerant stream-processing engine. It is built on Spark SQL engine. We can use Dataset or DataFrame API to express streaming aggregations, event-time windows etc. The computations are done on the optimized Spark SQL engine.
43. How will you pass functions to Apache Spark?
In Spark API, we pass functions to driver program so that it can be run on a cluster. Two common ways to pass functions in Spark are as follows:
- Anonymous Function Syntax: This is used for passing short pieces of code in an anonymous function.
- Static Methods in a Singleton object: We can also define static methods in an object with only once instance i.e. Singleton. This object along with its methods can be passed to cluster nodes.
44. What is a Property Graph?
A Property Graph is a directed multigraph. We can attach an object on each vertex and edge of a Property Graph.
In a directed multigraph, we can have multiple parallel edges that share same source and destination vertex.
During modeling the data, the option of parallel edges helps in creating multiple relationships between same pair of vertices.
E.g. Two persons can have two relationships Boss as well as Mentor.
45. What is Neighborhood Aggregation in Spark?
Neighborhood Aggregation is a concept in Graph module of Spark. It refers to the task of aggregating information about the neighborhood of each vertex.
E.g. We want to know the number of books referenced in a book. Or number of times a Tweet is retweeted.
This concept is used in iterative graph algorithms. Some of the popular uses of this concept are in Page Rank, Shortest Path etc.
We can use aggregateMessages[] and mergeMsg[] operations in Spark for implementing Neighborhood Aggregation.
46. What are different Persistence levels in Apache Spark?
Different Persistence levels in Apache Spark are as follows:
- MEMORY_ONLY: In this level, RDD object is stored as a de-serialized Java object in JVM. If an RDD doesn’t fit in the memory, it will be recomputed.
- MEMORY_AND_DISK: In this level, RDD object is stored as a de-serialized Java object in JVM. If an RDD doesn’t fit in the memory, it will be stored on the Disk.
- MEMORY_ONLY_SER: In this level, RDD object is stored as a serialized Java object in JVM. It is more efficient than de-serialized object.
- MEMORY_AND_DISK_SER: In this level, RDD object is stored as a serialized Java object in JVM. If an RDD doesn’t fit in the memory, it will be stored on the Disk.
- DISK_ONLY: In this level, RDD object is stored only on Disk.
47. How will you select the storage level in Apache Spark?
We use storage level to maintain balance between CPU efficiency and Memory usage.
If our RDD objects fit in memory, we use MEMORY_ONLY option. In this option, the performance is very good due to objects being in Memory only.
In case our RDD objects cannot fit in memory, we go for MEMORY_ONLY_SER option and select a serialization library that can provide space savings with serialization. This option is also quite fast in performance.
In case our RDD object cannot fit in memory with a big gap in memory vs. total object size, we go for MEMORY_AND_DISK option. In this option some RDD object are stored on Disk.
For fast fault recovery we use replication of objects to multiple partitions.
48. What are the options in Spark to create a Graph?
We can create a Graph in Spark from a collection of vertices and edges. Some of the options in Spark to create a Graph are as follows:
- Graph.apply: This is the simplest option to create graph. We use this option to create a graph from RDDs of vertices and edges.
- Graph.fromEdges: We can also create a graph from RDD of edges. In this option, vertices are created automatically and a default value is assigned to each vertex.
- Graph.fromEdgeTuples: We can also create a graph from only an RDD of tuples.
49. What are the basic Graph operators in Spark?
Some of the common Graph operators in Apache Spark are as follows:
- numEdges
- numVertices
- inDegrees
- outDegrees
- degrees
- vertices
- edges
- persist
- cache
- unpersistVertices
- partitionBy
50. What is the partitioning approach used in GraphX of Apache Spark?
GraphX uses Vertex-cut approach to distributed graph partitioning.
In this approach, a graph is not split along edges. Rather we partition graph along vertices. These vertices can span on multiple machines.
This approach reduces communication and storage overheads.
Edges are assigned to different partitions based on the partition strategy that we select.