Top 50 Pig Interview Questions

raju2006
June 17, 2016 0 Comments

Top 50 Pig Interview Questions

Q1 How will you explain co group in Pig?

Answer: COGROUP is found in Pig that works in several tuples. The operator can also be applied on several statements which contain or have a few relations at least a hundred and twenty seven relations at every time. When you are making use of the operator on tables, then Pig will immediately book two tables and post that it will join two of the tables on some of the columns that are grouped.

 

Q2 what is pig?

Answer: Pig is a Apache open source project which is run on hadoop,provides engine for data flow in parallel on hadoop.It includes language called pig latin,which is  for expressing these data flow.It includes different operations like joins,sort,filter ..etc and also ability to write UserDefine Functions(UDF) for proceesing and reaing and writing.pig uses both HDFS and MapReduce i,e storing and processing.

 

Q3 What is BloomMapFile used for?

Answer: The BloomMapFile is a class that extends MapFile. So its functionality is similar to MapFile.

BloomMapFile uses dynamic Bloom filters to provide quick membership test for the keys. It is used in Hbase table format.

 

Q4 what is difference between pig and sql?

Answer: Pig latin is procedural version of SQl.pig has certainly similarities,more difference from sql.sql is a query language for user asking question in query form.sql makes answer for given but dont tell how to answer the given question.suppose ,if user want to do multiple operations on tables,we have write multiple queries and also use temporary table for storing,sql is support for subqueries but intermediate we have to use temporary tables,SQL users find subqueries confusing and difficult to form properly.using sub-queries creates an inside-out design where the first step in the data pipeline is the innermost query .pig is designed with a long series of data operations in mind, so there is no need to write the data pipeline in an inverted set of subqueries or to worry about storing data in temporary tables.

 

Q5 What is the difference between logical and physical plans?

Answer: Pig undergoes some steps when a Pig Latin Script is converted into MapReduce jobs. After performing the basic parsing and semantic checking, it produces a logical plan. The logical plan describes the logical operators that have to be executed by Pig during execution. After this, Pig produces a physical plan. The physical plan describes the physical operators that are needed to execute the script.

 

Q6 Does ‘ILLUSTRATE’ run MR job?

Answer: No, illustrate will not pull any MR, it will pull the internal data. On the console, illustrate will not do any job. It just shows output of each stage and not the final output.

 

Q7 How Pig differs from MapReduce

Answer: In mapreduce,groupby operation performed at reducer side and filter,projection can be implemented in the map phase.pig latin also provides standard-operation similar to mapreduce like orderby and filters,group by..etc.we can analyze pig script and know data flows ans also early to find the error checking.pig Latin is much lower cost to write and maintain than Java code for MapReduce.

 

Q8 Is the keyword ‘DEFINE’ like a function name?

Answer: Yes, the keyword ‘DEFINE’ is like a function name. Once you have registered, you have to define it. Whatever logic you have written in Java program, you have an exported jar and also a jar registered by you. Now the compiler will check the function in exported jar. When the function is not present in the library, it looks into your jar.

 

Q9 Is the keyword ‘FUNCTIONAL’ a User Defined Function (UDF)?

Answer: No, the keyword ‘FUNCTIONAL’ is not a User Defined Function (UDF). While using UDF, we have to override some functions. Certainly you have to do your job with the help of these functions only. But the keyword ‘FUNCTIONAL’ is a built-in function i.e a pre-defined function, therefore it does not work as a UDF.

 

Q10 How is Pig Useful For?

Answer: In three categories,we can use pig .they are 1)ETL data pipline 2)Research on raw data 3)Iterative processing

Most common usecase for pig is data pipeline.Let us take one example, web based compaines gets the weblogs,so before storing data into warehouse,they do some operations on data like cleaning and aggregation operations..etc.i,e transformations on data.

 

Q11 Why do we need MapReduce during Pig programming?

Answer: Pig is a high-level platform that makes many Hadoop data analysis issues easier to execute. The language we use for this platform is: Pig Latin. A program written in Pig Latin is like a query written in SQL, where we need an execution engine to execute the query. So, when a program is written in Pig Latin, Pig compiler will convert the program into MapReduce jobs. Here, MapReduce acts as the execution engine.

 

Q12 What are the scalar datatypes in pig?

Answer: scalar datatype

  • int    -4bytes,
  • float  -4bytes,
  • double -8bytes,
  • long   -8bytes,
  • chararray,
  • bytearray

 

Q13 What are the different execution mode available in Pig?

Answer: There are 3 modes of execution available in pig

  • Interactive Mode (Also known as Grunt Mode)
  • Batch Mode
  • Embedded Mode

 

Q14 Are there any problems which can only be solved by MapReduce and cannot be solved by PIG? In which kind of scenarios MR jobs will be more useful than PIG?

Answer: Let us take a scenario where we want to count the population in two cities. I have a data set and sensor list of different cities. I want to count the population by using one mapreduce for two cities. Let us assume that one is Bangalore and the other is Noida. So I need to consider key of Bangalore city similar to Noida through which I can bring the population data of these two cities to one reducer. The idea behind this is some how I have to instruct map reducer program – whenever you find city with the name ‘Bangalore‘ and city with the name ‘Noida’, you create the alias name which will be the common name for these two cities so that you create a common key for both the cities and it get passed to the same reducer. For this, we have to write custom partitioner.

In mapreduce when you create a ‘key’ for city, you have to consider ’city’ as the key. So, whenever the framework comes across a different city, it considers it as a different key. Hence, we need to use customized partitioner. There is a provision in mapreduce only, where you can write your custom partitioner and mention if city = bangalore or noida then pass similar hashcode. However, we cannot create custom partitioner in Pig. As Pig is not a framework, we cannot direct execution engine to customize the partitioner. In such scenarios, MapReduce works better than Pig.

 

Q15 Whether pig latin language is  case-sensitive or not?

Answer: pig latin is sometimes not a case sensitive.let us see example,Load is equivalent to load.

A=load ‘b’ is not equivalent to a=load ‘b’

UDF are also case sensitive,count is not equivalent to COUNT.

 

Q16 What is the purpose of ‘dump’ keyword in pig?

Answer: dump display the output on the screen

dump ‘processed’

 

Q17 Does Pig give any warning when there is a type mismatch or missing field?

Answer: No, Pig will not show any warning if there is no matching field or a mismatch. If you assume that Pig gives such a warning, then it is difficult to find in log file. If any mismatch is found, it assumes a null value in Pig.

 

Q18 What is grunt shell?

Answer: Pig interactive shell is known as Grunt Shell. It provides a shell for users to interact with HDFS.

 

Q19 What co-group does in Pig?

Answer: Co-group joins the data set by grouping one particular data set only. It groups the elements by their common field and then returns a set of records containing two separate bags. The first bag consists of the record of the first data set with the common data set and the second bag consists of the records of the second data set with the common data set.

 

Q20 what are relational operations in pig latin?

Answer: they are

  1.  for each
  2.  order by
  3.  filters
  4.  group
  5.  distinct
  6.  join
  7.  limit

 

 

Q21 What are the different Relational Operators available in pig language?

 

Answer: Relational operators in pig can be categorized into the following list

 

  • Loading and Storing
  • Filtering
  • Grouping and joining
  • Sorting
  • Combining and Splitting
  • Diagnostic

 

 

Q22 What are the different modes available in Pig?

 

Answer: Two modes are available in the pig.

 

  • Local Mode (Runs on localhost file system)
  • MapReduce Mode (Runs on Hadoop Cluster)

 

 

Q23 Can we say cogroup is a group of more than 1 data set?

 

Answer: Cogroup is a group of one data set. But in the case of more than one data sets, cogroup will group all the data sets and join them based on the common field. Hence, we can say that cogroup is a group of more than one data set and join of that data set as well.

 

 

Q23 What does FOREACH do?

 

Answer: FOREACH is used to apply transformations to the data and to generate new data items. The name itself is indicating that for each element of a data bag, the respective action will be performed.

 

Syntax : FOREACH bagname GENERATE expression1, expression2, …..

 

The meaning of this statement is that the expressions mentioned after GENERATE will be applied to the current record of the data bag.

 

 

Q24 why should we use ‘filters’ in pig scripts?

 

Answer: Filters are similar to where clause in SQL.filter which contain predicate.If that predicate evaluates to true for a given record, that record will be passed down the pipeline. Otherwise, it will not.predicate contain different operators like ==,>=,<=,!=.so,== and != can be applied to maps and tuples.

 

A= load ‘inputs’ as(name,address)

 

B=filter A by symbol matches ‘CM.*’;

 

 

Q25 What is bag?

 

Answer: A bag is one of the data models present in Pig. It is an unordered collection of tuples with possible duplicates. Bags are used to store collections while grouping. The size of bag is the size of the local disk, this means that the size of the bag is limited. When the bag is full, then Pig will spill this bag into local disk and keep only some parts of the bag in memory. There is no necessity that the complete bag should fit into memory. We represent bags with “{}”.

 

 

Q26 why should we use ‘orderby’ keyword in pig scripts?

 

Answer: The order statement sorts your data for you, producing a total order of your output data.The syntax of order is similar to group. You indicate a key or set of keys by which you wish to order your data

 

input2 = load ‘daily’ as (exchanges, stocks);

 

grpds = order input2 by exchanges;

 

 

Q27 Pig Features ?

 

Answer: i) Data Flow Language

 

User Specifies a Sequence of Steps where each step specifies only a single high-level data transformation.

 

 

  1. ii) User Defined Functions (UDF)

 

iii)Debugging Environment

 

 

  1. iv) Nested data Model

 

 

Q28 What are the advantages of pig language?

 

Answer: The pig is easy to learn: Pig is easy to learn, it overcomes the need for writing complex MapReduce programs to some extent. Pig works in a step by step manner. So it is easy to write, and even better, it is easy to read.

 

It can handle heterogeneous data: Pig can handle all types of data – structured, semi-structured, or unstructured.

 

  • Pig is Faster: Pig’s multi-query approach combines certain types of operations together in a single pipeline, reducing the number of times data is scanned.
  • Pig does more with less: Pig provides the common data operations (filters, joins, ordering, etc.) And nested data types (e.g. Tuples, bags, and maps) which can be used in processing data.
  • Pig is Extensible: Pig is easily extensible by UDFs – including Python, Java, JavaScript, and Ruby so you can use them to load, aggregate and analysis. Pig insulates your code from changes to the Hadoop Java API.

 

 

Q29 What is the Physical plan in pig architecture?

 

Answer: The physical form of execution of pig script happens at this stage. Physical plan is responsible for converting operators to Physical Plan.

 

 

Q30 What Is Difference Between Mapreduce and Pig ?

 

Answer:

 

 

  • In MR Need to write entire logic for operations like join,group,filter,sum etc ..
  • In Pig Built in functions are available
  • In MR Number of lines of code required is too much even for a simple functionality
  • In Pig 10 lines of pig latin equal to 200 lines of java
  • In MR Time of effort in coding is high
  • In Pig What took 4hrs to write in java took 15 mins in pig latin (approx)
  • In MRLess productivity
  • In PIG High Productivity

 

 

Q31 What are the relational operators available related to Grouping and joining in pig language?

 

Answer: Grouping and Joining operators are the most powerful operators in pig language. Because core MapReduce  creation for grouping and joins are very typical in low-level MapReduce language.

 

  1. JOIN
  2. GROUP
  3. COGROUP
  4. CROSS

 

JOIN is used to join two or more relations. GROUP is used for aggregation of a single relation. COGROUP is used for the aggregation of multiple relations. CROSS is used to create a cartesian product of two or more relations.

 

 

Q32 Why do we need Pig?

 

Answer: Pig is a high level scripting language that is used with Apache Hadoop. Pig excels at describing data analysis problems as data flows. Pig is complete in that you can do all the required data manipulations in Apache Hadoop with Pig. In addition through the User Defined Functions(UDF) facility in Pig you can have Pig invoke code in many languages like JRuby, Jython and Java. Conversely you can execute Pig scripts in other languages. The result is that you can use Pig as a component to build larger and more complex applications that tackle real business problems.

 

 

Q33 What are the different String functions available in pig?

 

Answer: Below are most commonly used STRING pig functions

 

  • UPPER
  • LOWER
  • TRIM
  • SUBSTRING
  • INDEXOF
  • STRSPLIT
  • LAST_INDEX_OF

 

 

Q34 What is a relation in Pig?

 

Answer: A Pig relation is a bag of tuples. A Pig relation is similar to a table in a relational database, where the tuples in the bag correspond to the rows in a table. Unlike a relational table, however, Pig relations don t require that every tuple contain the same number of fields or that the fields in the same position (column) have the same type.

 

 

Q35 What is a tuple?

 

Answer: A tuple is an ordered set of fields and A field is a piece of data.

 

 

Q36 What is the MapReduce plan in pig architecture?

 

Answer: In MapReduce than the output of Physical plan is converted into an actual MapReduce program. Which then executed across the Hadoop Cluster.

 

 

Q37 What is the logical plan in pig architecture?

 

Answer: In the Logical plan stage of Pig, statements are parsed for syntax error. Validation of input files and the data structure of the file is also analysed. A DAG (Directed Acyclic Graph) of operators as nodes and data flow as edges are then created. Optimization of pig scripts also materialized to the logical plan.

 

 

Q38 What is UDF in Pig?

 

Answer: The pig has wide-ranging inbuilt functions, but occasionally we need to write complex business logic, which may not be implemented using primitive functions. Thus, Pig provides support to allow writing User Defined Functions (UDFs) as a way to stipulate custom processing.

 

Pig UDFs can presently be implemented in Java, Python, JavaScript, Ruby and Groovy. The most far-reaching support is provided for Java functions. You can customize all parts of the processing, including data load/store, column transformation, and aggregation. Java functions are also additional efficient because they are implemented in the same language as Pig and because additional interfaces are supported. Such as the Algebraic Interface and the Accumulator Interface. Limited support is provided for Python, JavaScript, Ruby and Groovy functions.

 

 

Q39 What are the primitive data types in pig?

 

Answer: Following are the primitive data types in pig

 

  1. Int
  2. Long
  3. Float
  4. Double
  5. Char array
  6. Byte array

 

 

Q40 What is bag data type in pig?

 

Answer: The bag data type worked as a container for tuples and other bags. It is a complex data type in pig latin language.

 

 

Q41 why should we use ‘distinct’ keyword in pig scripts?

 

Answer: The distinct statement is very simple. It removes duplicate records. It works only on entire records, not on individual fields:

 

input2 = load ‘daily’ as (exchanges, stocks);

 

grpds = distinct exchanges;

 

 

Q42 What are the different math functions available in pig?

 

Answer: Below are most commonly used math pig functions

 

  • ABS
  • ACOS
  • EXP
  • LOG
  • ROUND
  • CBRT
  • RANDOM
  • SQRT

 

 

Q43 What are the different Eval functions available in pig?

 

Answer: Below are most commonly used Eval pig functions

 

  • AVG
  • CONCAT
  • MAX
  • MIN
  • SUM
  • SIZE
  • COUNT
  • COUNT_STAR
  • DIFF
  • TOKENIZE
  • IsEmpty

 

 

Q44 What are the relational operators available related to loading and storing in pig language?

 

Answer: For Loading data and Storing it into HDFS, Pig uses following operators.

 

  1. LOAD
  2. STORE

 

LOADS, load the data from the file system. STORE, stores the data in the file system.

 

 

Q45 Explain about co-group in Pig.

 

Answer: COGROUP operator in Pig is used to work with multiple tuples. COGROUP operator is applied on statements that contain or involve two or more relations. The COGROUP operator can be applied on up to 127 relations at a time. When using the COGROUP operator on two tables at once-Pig first groups both the tables and after that joins the two tables on the grouped columns.

 

 

Q46 What are the relational operators available related to combining and splitting in pig language?

 

Answer: UNION and SPLIT used for combining and splitting relations in the pig.

 

 

Q47 What are different modes of execution in Apache Pig?

 

Answer: Apache Pig runs in 2 modes- one is the “Pig (Local Mode) Command Mode” and the other is the “Hadoop MapReduce (Java) Command Mode”. Local Mode requires access to only a single machine where all files are installed and executed on a local host whereas MapReduce requires accessing the Hadoop cluster.

 

 

Q48 Does Pig support multi-line commands?

 

Answer: Yes

 

 

Q49 How would you diagnose or do exception handling in the pig?

 

For exception handling of pig script, we can use following operators.

 

  • DUMP
  • DESCRIBE
  • ILLUSTRATE
  • EXPLAIN

 

DUMP displays the results on screen. DESCRIBE displays the schema of a particular relation. ILLUSTRATE displays step by step execution of a sequence of pig statements. EXPLAIN displays the execution plan for pig latin statements.

 

 

Q50 What is the difference between store and dumps commands?

 

Answer: Dump Command after process the data displayed on the terminal, but it’s not stored anywhere. Where as store store in local file system or HDFS and output execute in a folder. In the protection environment most opften hadoop developer used ‘store’ command to store data in the HDFS.

 

***************All The Best***********