Question # 1
Problem Scenario 87 : You have been given below three files product.csv (Create this file in hdfs) productID,productCode,name,quantity,price,supplierid 1001,PEN,Pen Red,5000,1.23,501 1002,PEN,Pen Blue,8000,1.25,501 1003,PEN,Pen Black,2000,1.25,501 1004,PEC,Pencil 2B,10000,0.48,502 1005,PEC,Pencil 2H,8000,0.49,502 1006,PEC,Pencil HB,0,9999.99,502 2001,PEC,Pencil 3B,500,0.52,501 2002,PEC,Pencil 4B,200,0.62,501 2003,PEC,Pencil 5B,100,0.73,501 2004,PEC,Pencil 6B,500,0.47,502 supplier.csv supplierid,name,phone 501,ABC Traders,88881111 502,XYZ Company,88882222 503,QQ Corp,88883333 products_suppliers.csv productID,supplierID 2001,501 2002,501 2003,501 2004,502 2001,503 Now accomplish all the queries given in solution. Select product, its price , its supplier name where product price is less than 0.6 using SparkSQL
|
Answer: See the explanation for Step by Step Solution and configuration. Explanation: Solution : Step 1: hdfs dfs -mkdir sparksql2 hdfs dfs -put product.csv sparksq!2/ hdfs dfs -put supplier.csv sparksql2/ hdfs dfs -put products_suppliers.csv sparksql2/ Step 2 : Now in spark shell // this Is used to Implicitly convert an RDD to a DataFrame. import sqlContext.impIicits._ // Import Spark SQL data types and Row. import org.apache.spark.sql._ // load the data into a new RDD val products = sc.textFile("sparksql2/product.csv") val supplier = sc.textFileC'sparksq^supplier.csv") val prdsup = sc.textFile("sparksql2/products_suppliers.csv"} // Return the first element in this RDD products.fi rst() supplier.first{). prdsup.first() //define the schema using a case class case class Product(productid: Integer, code: String, name: String, quantity:lnteger, price: Float, supplierid:lnteger) case class Suplier(supplierid: Integer, name: String, phone: String) case class PRDSUP(productid: Integer.supplierid: Integer) // create an RDD of Product objects val prdRDD = products.map(_.split('\")).map(p => Product(p(0).tolnt,p(1),p(2),p(3).tolnt,p(4).toFloat,p(5).toint)) val supRDD = supplier.map(_.split(",")).map(p => Suplier(p(0).tolnt,p(1),p(2))) val prdsupRDD = prdsup.map(_.split(",")).map(p => PRDSUP(p(0).tolnt,p(1}.tolnt}} prdRDD.first() prdRDD.count() supRDD.first() supRDD.count() prdsupRDD.first() prdsupRDD.count(} // change RDD of Product objects to a DataFrame val prdDF = prdRDD.toDF() val supDF = supRDD.toDF() val prdsupDF = prdsupRDD.toDF() // register the DataFrame as a temp table prdDF.registerTempTablef'products") supDF.registerTempTablef'suppliers") prdsupDF.registerTempTablef'productssuppliers"} //Select product, its price , its supplier name where product price is less than 0.6 val results = sqlContext.sql(......SELECT products.name, price, suppliers.name as sup_name FROM products JOIN suppliers ON products.supplierlD= suppliers.supplierlD WHERE price < 0.6......] results. show()
Question # 2
Problem Scenario 70 : Write down a Spark Application using Python, In which it read a file "Content.txt" (On hdfs) with following content. Do the word count and save the results in a directory called "problem85" (On hdfs) Content.txt Hello this is ABCTECH.com This is XYZTECH.com Apache Spark Training This is Spark Learning Session Spark is faster than MapReduce
|
Answer: See the explanation for Step by Step Solution and configuration. Explanation: Solution : Step 1 : Create an application with following code and store it in problem84.py # Import SparkContext and SparkConf from pyspark import SparkContext, SparkConf # Create configuration object and set App name conf = SparkConf().setAppName("CCA 175 Problem 85") sc = sparkContext(conf=conf) #load data from hdfs contentRDD = sc.textFile(MContent.txt") #filter out non-empty lines nonemptyjines = contentRDD.filter(lambda x: len(x) > 0) #Split line based on space words = nonempty_lines.ffatMap(lambda x: x.split(''}} #Do the word count wordcounts = words.map(lambda x: (x, 1)) \ reduceByKey(lambda x, y: x+y) \ map(lambda x: (x[1], x[0]}}.sortByKey(False} for word in wordcounts.collect(): print(word) #Save final data " wordcounts.saveAsTextFile("problem85") step 2 : Submit this application spark-submit -master yarn problem85.py
Question # 3
Problem Scenario 90 : You have been given below two files course.txt id,course 1,Hadoop 2,Spark 3,HBase fee.txt id,fee 2,3900 3,4200 4,2900 Accomplish the following activities. 1. Select all the courses and their fees , whether fee is listed or not. 2. Select all the available fees and respective course. If course does not exists still list the fee 3. Select all the courses and their fees , whether fee is listed or not. However, ignore records having fee as null.
|
Answer: See the explanation for Step by Step Solution and configuration. Explanation: Solution : Step 1: hdfs dfs -mkdir sparksql4 hdfs dfs -put course.txt sparksql4/ hdfs dfs -put fee.txt sparksql4/ Step 2 : Now in spark shell // load the data into a new RDD val course = sc.textFile("sparksql4/course.txt") val fee = sc.textFile("sparksql4/fee.txt") // Return the first element in this RDD course.fi rst() fee.fi rst() //define the schema using a case class case class Course(id: Integer, name: String) case class Fee(id: Integer, fee: Integer) // create an RDD of Product objects val courseRDD = course.map(_.split(",")).map(c => Course(c(0).tolnt,c(1))) val feeRDD =fee.map(_.split(",")).map(c => Fee(c(0}.tolnt,c(1}.tolnt)) courseRDD.first() courseRDD.count(} feeRDD.first() feeRDD.countQ // change RDD of Product objects to a DataFrame val courseDF = courseRDD.toDF(} val feeDF = feeRDD.toDF{) // register the DataFrame as a temp table courseDF. registerTempTable("course") feeDF. registerTempTablef'fee") // Select data from table val results = sqlContext.sql(......SELECT' FROM course """ ) results. showQ val results = sqlContext.sql(......SELECT' FROM fee......) results. showQ val results = sqlContext.sql(......SELECT * FROM course LEFT JOIN fee ON course.id = fee.id......) results-showQ val results ="sqlContext.sql(......SELECT * FROM course RIGHT JOIN fee ON course.id = fee.id "MM ) results. showQ val results = sqlContext.sql(......SELECT' FROM course LEFT JOIN fee ON course.id = fee.id where fee.id IS NULL" results. show()
Question # 4
Problem Scenario 25 : You have been given below comma separated employee information. That needs to be added in /home/cloudera/flumetest/in.txt file (to do tail source) sex,name,city 1,alok,mumbai 1,jatin,chennai 1,yogesh,kolkata 2,ragini,delhi 2,jyotsana,pune 1,valmiki,bangloreCreate a flume conf file using fastest non-durable channel, which write data in hive warehouse directory, in two separate tables called flumemaleemployee1 and flumefemaleemployee1 (Create hive table as well for given data}. Please use tail source with /home/cloudera/flumetest/in.txt file. Flumemaleemployee1 : will contain only male employees data flumefemaleemployee1 : Will contain only woman employees data |
Answer: See the explanation for Step by Step Solution and configuration. Explanation: Solution : Step 1 : Create hive table for flumemaleemployeel and .' CREATE TABLE flumemaleemployeel ( sex_type int, name string, city string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; CREATE TABLE flumefemaleemployeel ( sex_type int, name string, city string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; Step 2 : Create below directory and file mkdir /home/cloudera/flumetest/ cd /home/cloudera/flumetest/ Step 3 : Create flume configuration file, with below configuration for source, sink and channel and save it in flume5.conf. agent.sources = tailsrc agent.channels = mem1 mem2 agent.sinks = stdl std2 agent.sources.tailsrc.type = exec agent.sources.tailsrc.command = tail -F /home/cloudera/flumetest/in.txt agent.sources.tailsrc.batchSize = 1 agent.sources.tailsrc.interceptors = i1 agent.sources.tailsrc.interceptors.i1.type = regex_extractor agent.sources.tailsrc.interceptors.il.regex = A(\\d} agent.sources.tailsrc. interceptors. M.serializers = t1 agent.sources.tailsrc. interceptors, i1.serializers.t1. name = type agent.sources.tailsrc.selector.type = multiplexing agent.sources.tailsrc.selector.header = type agent.sources.tailsrc.selector.mapping.1 = memi agent.sources.tailsrc.selector.mapping.2 = mem2 agent.sinks.std1.type = hdfs agent.sinks.stdl.channel = mem1 agent.sinks.stdl.batchSize = 1 agent.sinks.std1.hdfs.path = /user/hive/warehouse/flumemaleemployeei agent.sinks.stdl.rolllnterval = 0 agent.sinks.stdl.hdfs.tileType = Data Stream agent.sinks.std2.type = hdfs agent.sinks.std2.channel = mem2 agent.sinks.std2.batchSize = 1 agent.sinks.std2.hdfs.path = /user/hi ve/warehouse/fIumefemaleemployee1 agent.sinks.std2.rolllnterval = 0 agent.sinks.std2.hdfs.tileType = Data Stream agent.channels.mem1.type = memory agent.channels.meml.capacity = 100 agent.channels.mem2.type = memory agent.channels.mem2.capacity = 100 agent.sources.tailsrc.channels = mem1 mem2 Step 4 : Run below command which will use this configuration file and append data in hdfs. Start flume service: flume-ng agent -conf /home/cloudera/flumeconf -conf-file /home/cloudera/fIumeconf/flume5.conf -name agent Step 5 : Open another terminal create a file at /home/cloudera/flumetest/in.txt. Step 6 : Enter below data in file and save it. l.alok.mumbai 1 jatin.chennai 1,yogesh,kolkata 2,ragini,delhi 2,jyotsana,pune 1,valmiki,banglore
Question # 5
Problem Scenario 88 : You have been given below three files product.csv (Create this file in hdfs) productID,productCode,name,quantity,price,supplierid 1001,PEN,Pen Red,5000,1.23,501 1002,PEN,Pen Blue,8000,1.25,501 1003,PEN,Pen Black,2000,1.25,501 1004,PEC,Pencil 2B,10000,0.48,502 1005,PEC,Pencil 2H,8000,0.49,502 1006,PEC,Pencil HB,0,9999.99,502 2001,PEC,Pencil 3B,500,0.52,501 2002,PEC,Pencil 4B,200,0.62,501 2003,PEC,Pencil 5B,100,0.73,501 2004,PEC,Pencil 6B,500,0.47,502 supplier.csv supplierid,name,phone 501,ABC Traders,88881111 502,XYZ Company,88882222 503,QQ Corp,88883333 products_suppliers.csv productID,supplierID 2001,501 2002,501 2003,501 2004,502 2001,503 Now accomplish all the queries given in solution. 1. It is possible that, same product can be supplied by multiple supplier. Now find each product, its price according to each supplier. 2. Find all the supllier name, who are supplying 'Pencil 3B' 3. Find all the products , which are supplied by ABC Traders.
|
Answer: See the explanation for Step by Step Solution and configuration. Explanation: Solution : Step 1 : It is possible that, same product can be supplied by multiple supplier. Now find each product, its price according to each supplier. val results = sqlContext.sql(......SELECT products.name AS Product Name', price, suppliers.name AS Supplier Name' FROM products_suppliers JOIN products ON products_suppliers.productlD = products.productID JOIN suppliers ON products_suppliers.supplierlD = suppliers.supplierlD null t results.show() Step 2 : Find all the supllier name, who are supplying 'Pencil 3B' val results = sqlContext.sql(......SELECT p.name AS 'Product Name", s.name AS "Supplier Name' FROM products_suppliers AS ps JOIN products AS p ON ps.productID = p.productID JOIN suppliers AS s ON ps.supplierlD = s.supplierlD WHERE p.name = 'Pencil 3B"",M ) results.show() Step 3 : Find all the products , which are supplied by ABC Traders. val results = sqlContext.sql(......SELECT p.name AS 'Product Name", s.name AS "Supplier Name' FROM products AS p, products_suppliers AS ps, suppliers AS s WHERE p.productID = ps.productID AND ps.supplierlD = s.supplierlD AND s.name = 'ABC Traders".....) results. show()
Helping People Grow Their Careers
1. Updated CCA Spark and Hadoop Developer Exam Dumps Questions
2. Free CCA175 Updates for 90 days
3. 24/7 Customer Support
4. 96% Exam Success Rate
5. CCA175 Cloudera Dumps PDF Questions & Answers are Compiled by Certification Experts
6. CCA Spark and Hadoop Developer Dumps Questions Just Like on the Real Exam Environment
7. Live Support Available for Customer Help
8. Verified Answers
9. Cloudera Discount Coupon Available on Bulk Purchase
10. Pass Your CCA Spark and Hadoop Developer Exam Exam Easily in First Attempt
11. 100% Exam Passing Assurance
-->
|