Question # 1
Problem Scenario 77 : You have been given MySQL DB with following details. user=retail_dba password=cloudera database=retail_db table=retail_db.orders table=retail_db.order_items jdbc URL = jdbc:mysql://quickstart:3306/retail_db Columns of order table : (orderid , order_date , order_customer_id, order_status) Columns of ordeMtems table : (order_item_id , order_item_order_ld , order_item_product_id, order_item_quantity,order_item_subtotal,order_ item_product_price) Please accomplish following activities. 1. Copy "retail_db.orders" and "retail_db.order_items" table to hdfs in respective directory p92_orders and p92 order items . 2. Join these data using orderid in Spark and Python 3. Calculate total revenue perday and per order 4. Calculate total and average revenue for each date. - combineByKey -aggregateByKey |
Answer: See the explanation for Step by Step Solution and configuration. Explanation: Solution : Step 1 : Import Single table . sqoop import -connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba - password=cloudera -table=orders -target-dir=p92_orders –m 1 sqoop import -connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba - password=cloudera -table=order_items -target-dir=p92_order_items –m1 Note : Please check you dont have space between before or after '=' sign. Sqoop uses the MapReduce framework to copy data from RDBMS to hdfs Step 2 : Read the data from one of the partition, created using above command, hadoop fs -cat p92_orders/part-m-00000 hadoop fs -cat p92_order_items/part-m-00000 Step 3 : Load these above two directory as RDD using Spark and Python (Open pyspark terminal and do following). orders = sc.textFile("p92_orders") orderltems = sc.textFile("p92_order_items") Step 4 : Convert RDD into key value as (orderjd as a key and rest of the values as a value) #First value is orderjd ordersKeyValue = orders.map(lambda line: (int(line.split(",")[0]), line)) #Second value as an Orderjd orderltemsKeyValue = orderltems.map(lambda line: (int(line.split(",")[1]), line)) Step 5 : Join both the RDD using orderjd joinedData = orderltemsKeyValue.join(ordersKeyValue) #print the joined data for line in joinedData.collect(): print(line) Format of joinedData as below. [Orderld, 'All columns from orderltemsKeyValue', 'All columns from orders Key Value'] Step 6 : Now fetch selected values Orderld, Order date and amount collected on this order. //Retruned row will contain ((order_date,order_id),amout_collected) revenuePerDayPerOrder = joinedData.map(lambda row: ((row[1][1].split(M,M)[1],row[0]}, float(row[1][0].split(",")[4]))) #print the result for line in revenuePerDayPerOrder.collect(): print(line) Step 7 : Now calculate total revenue perday and per order A. Using reduceByKey totalRevenuePerDayPerOrder = revenuePerDayPerOrder.reduceByKey(lambda runningSum, value: runningSum + value) for line in totalRevenuePerDayPerOrder.sortByKey().collect(): print(line) #Generate data as (date, amount_collected) (Ignore ordeMd) dateAndRevenueTuple = totalRevenuePerDayPerOrder.map(lambda line: (line[0][0], line[1])) for line in dateAndRevenueTuple.sortByKey().collect(): print(line) Step 8 : Calculate total amount collected for each day. And also calculate number of days. #Generate output as (Date, Total Revenue for date, total_number_of_dates) #Line 1 : it will generate tuple (revenue, 1) #Line 2 : Here, we will do summation for all revenues at the same time another counter to maintain number of records. #Line 3 : Final function to merge all the combiner totalRevenueAndTotalCount = dateAndRevenueTuple.combineByKey( \ lambda revenue: (revenue, 1), \ lambda revenueSumTuple, amount: (revenueSumTuple[0] + amount, revenueSumTuple[1] + 1), \ lambda tuplel, tuple2: (round(tuple1[0] + tuple2[0], 2}, tuple1[1] + tuple2[1]) \ for line in totalRevenueAndTotalCount.collect(): print(line) Step 9 : Now calculate average for each date averageRevenuePerDate = totalRevenueAndTotalCount.map(lambda threeElements: (threeElements[0], threeElements[1][0]/threeElements[1][1]}} for line in averageRevenuePerDate.collect(): print(line) Step 10 : Using aggregateByKey #line 1 : (Initialize both the value, revenue and count) #line 2 : runningRevenueSumTuple (Its a tuple for total revenue and total record count for each date) #line 3 : Summing all partitions revenue and count totalRevenueAndTotalCount = dateAndRevenueTuple.aggregateByKey( \ (0,0), \ lambda runningRevenueSumTuple, revenue: (runningRevenueSumTuple[0] + revenue, runningRevenueSumTuple[1] + 1), \ lambda tupleOneRevenueAndCount, tupleTwoRevenueAndCount: (tupleOneRevenueAndCount[0] + tupleTwoRevenueAndCount[0], tupleOneRevenueAndCount[1] + tupleTwoRevenueAndCount[1]) \ ) for line in totalRevenueAndTotalCount.collect(): print(line) Step 11 : Calculate the average revenue per date averageRevenuePerDate = totalRevenueAndTotalCount.map(lambda threeElements: (threeElements[0], threeElements[1][0]/threeElements[1][1])) for line in averageRevenuePerDate.collect(): print(line)
Question # 2
Problem Scenario 12 : You have been given following mysql database details as well as other info. user=retail_dba password=cloudera database=retail_db jdbc URL = jdbc:mysql://quickstart:3306/retail_db Please accomplish following. 1. Create a table in retailedb with following definition. CREATE table departments_new (department_id int(11), department_name varchar(45), created_date T1MESTAMP DEFAULT NOW()); 2. Now isert records from departments table to departments_new 3. Now import data from departments_new table to hdfs. 4. Insert following 5 records in departmentsnew table. Insert into departments_new values(110, "Civil" , null); Insert into departments_new values(111, "Mechanical" , null); Insert into departments_new values(112, "Automobile" , null); Insert into departments_new values(113, "Pharma" , null); Insert into departments_new values(114, "Social Engineering" , null); 5. Now do the incremental import based on created_date column.
|
Answer: See the explanation for Step by Step Solution and configuration. Explanation: Solution : Step 1 : Login to musql db mysql -user=retail_dba -password=cloudera show databases; use retail db; show tables; Step 2 : Create a table as given in problem statement. CREATE table departments_new (department_id int(11), department_name varchar(45), createddate T1MESTAMP DEFAULT NOW()); show tables; Step 3 : isert records from departments table to departments_new insert into departments_new select a.", null from departments a; Step 4 : Import data from departments new table to hdfs. sqoop import \ -connect jdbc:mysql://quickstart:330G/retail_db \ ~username=retail_dba \ -password=cloudera \ -table departments_new\ -target-dir /user/cloudera/departments_new \ -split-by departments Stpe 5 : Check the imported data. hdfs dfs -cat /user/cloudera/departmentsnew/part" Step 6 : Insert following 5 records in departmentsnew table. Insert into departments_new values(110, "Civil" , null); Insert into departments_new values(111, "Mechanical" , null); Insert into departments_new values(112, "Automobile" , null); Insert into departments_new values(113, "Pharma" , null); Insert into departments_new values(114, "Social Engineering" , null); commit; Stpe 7 : Import incremetal data based on created_date column. sqoop import \ -connect jdbc:mysql://quickstart:330G/retaiI_db \ -username=retail_dba \ -password=cloudera \ -table departments_new\ -target-dir /user/cloudera/departments_new \ -append \ -check-column created_date \ -incremental lastmodified \ -split-by departments \ -last-value "2016-01-30 12:07:37.0" Step 8 : Check the imported value. hdfs dfs -cat /user/cloudera/departmentsnew/part"
Question # 3
Problem Scenario 96 : Your spark application required extra Java options as below. - XX:+PrintGCDetails-XX:+PrintGCTimeStamps Please replace the XXX values correctly ./bin/spark-submit -name "My app" -master local[4] -conf spark.eventLog.enabled=talse - -conf XXX hadoopexam.jar
|
Answer: See the explanation for Step by Step Solution and configuration. Explanation: Solution XXX: Mspark.executoi\extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" Notes: ./bin/spark-submit \ -class <maln-class> -master <master-url> \ -deploy-mode <deploy-mode> \ -conf <key>=<value> \ # other options <application-jar> \ [application-arguments] Here, conf is used to pass the Spark related contigs which are required for the application to run like any specific property(executor memory) or if you want to override the default property which is set in Spark-default.conf.
Question # 4
Problem Scenario 65 : You have been given below code snippet. val a = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), 2) val b = sc.parallelize(1 to a.count.tolnt, 2) val c = a.zip(b) operation1 Write a correct code snippet for operationl which will produce desired output, shown below. Array[(String, Int)] = Array((owl,3), (gnu,4), (dog,1), (cat,2>, (ant,5))
|
Answer: See the explanation for Step by Step Solution and configuration. Explanation: Solution : c.sortByKey(false).collect sortByKey [Ordered] : This function sorts the input RDD's data and stores it in a new RDD. "The output RDD is a shuffled RDD because it stores data that is output by a reducer which has been shuffled. The implementation of this function is actually very clever. First, it uses a range partitioner to partition the data in ranges within the shuffled RDD. Then it sorts these ranges individually with mapPartitions using standard sort mechanisms.
Question # 5
Problem Scenario 2 : There is a parent organization called "ABC Group Inc", which has two child companies named Tech Inc and MPTech. Both companies employee information is given in two separate text file as below. Please do the following activity for employee details. Tech Inc.txt 1,Alok,Hyderabad 2,Krish,Hongkong 3,Jyoti,Mumbai 4,Atul,Banglore 5,Ishan,Gurgaon MPTech.txt 6,John,Newyork 7,alp2004,California 8,tellme,Mumbai9,Gagan21,Pune 10,Mukesh,Chennai 1. Which command will you use to check all the available command line options on HDFS and How will you get the Help for individual command. 2. Create a new Empty Directory named Employee using Command line. And also create an empty file named in it Techinc.txt 3. Load both companies Employee data in Employee directory (How to override existing file in HDFS). 4. Merge both the Employees data in a Single tile called MergedEmployee.txt, merged tiles should have new line character at the end of each file content. 5. Upload merged file on HDFS and change the file permission on HDFS merged file, so that owner and group member can read and write, other user can read the file. 6. Write a command to export the individual file as well as entire directory from HDFS to local file System.
|
Answer: See the explanation for Step by Step Solution and configuration. Explanation: Solution : Step 1 : Check All Available command hdfs dfs Step 2 : Get help on Individual command hdfs dfs -help get Step 3 : Create a directory in HDFS using named Employee and create a Dummy file in it called e.g. Techinc.txt hdfs dfs -mkdir Employee Now create an emplty file in Employee directory using Hue. Step 4 : Create a directory on Local file System and then Create two files, with the given data in problems. Step 5 : Now we have an existing directory with content in it, now using HDFS command line , overrid this existing Employee directory. While copying these files from local file System to HDFS. cd /home/cloudera/Desktop/ hdfs dfs -put -f Employee Step 6 : Check All files in directory copied successfully hdfs dfs -Is Employee Step 7 : Now merge all the files in Employee directory, hdfs dfs -getmerge -nl Employee MergedEmployee.txt Step 8 : Check the content of the file. cat MergedEmployee.txt Step 9 : Copy merged file in Employeed directory from local file ssytem to HDFS. hdfs dfs - put MergedEmployee.txt Employee/ Step 10 : Check file copied or not. hdfs dfs -Is Employee Step 11 : Change the permission of the merged file on HDFS hdfs dfs -chmpd 664 Employee/MergedEmployee.txt Step 12 : Get the file from HDFS to local file system, hdfs dfs -get Employee
Helping People Grow Their Careers
1. Updated CCA Spark and Hadoop Developer Exam Dumps Questions
2. Free CCA175 Updates for 90 days
3. 24/7 Customer Support
4. 96% Exam Success Rate
5. CCA175 Cloudera Dumps PDF Questions & Answers are Compiled by Certification Experts
6. CCA Spark and Hadoop Developer Dumps Questions Just Like on the Real Exam Environment
7. Live Support Available for Customer Help
8. Verified Answers
9. Cloudera Discount Coupon Available on Bulk Purchase
10. Pass Your CCA Spark and Hadoop Developer Exam Exam Easily in First Attempt
11. 100% Exam Passing Assurance
-->
|