apache spark - BigQuery connector for pyspark via Hadoop Input Format example -


i have large dataset stored bigquery table , load pypark rdd etl data processing.

i realized bigquery supports hadoop input / output format

https://cloud.google.com/hadoop/writing-with-bigquery-connector

and pyspark should able use interface in order create rdd using method "newapihadooprdd".

http://spark.apache.org/docs/latest/api/python/pyspark.html

unfortunately, documentation on both ends seems scarce , goes beyond knowledge of hadoop/spark/bigquery. there has figured out how this?

google has example on how use bigquery connector spark.

there seem problem using gsonbigqueryinputformat, got simple shakespeare word counting example working

import json import pyspark sc = pyspark.sparkcontext()  hadoopconf=sc._jsc.hadoopconfiguration() hadoopconf.get("fs.gs.system.bucket")  conf = {"mapred.bq.project.id": "<project_id>", "mapred.bq.gcs.bucket": "<bucket>", "mapred.bq.input.project.id": "publicdata", "mapred.bq.input.dataset.id":"samples", "mapred.bq.input.table.id": "shakespeare"  }  tabledata = sc.newapihadooprdd("com.google.cloud.hadoop.io.bigquery.jsontextbigqueryinputformat", "org.apache.hadoop.io.longwritable", "com.google.gson.jsonobject", conf=conf).map(lambda k: json.loads(k[1])).map(lambda x: (x["word"], int(x["word_count"]))).reducebykey(lambda x,y: x+y) print tabledata.take(10) 

Comments

Popular posts from this blog

javascript - Using jquery append to add option values into a select element not working -

Android soft keyboard reverts to default keyboard on orientation change -

jquery - javascript onscroll fade same class but with different div -