apache spark - BigQuery connector for pyspark via Hadoop Input Format example -


i have large dataset stored bigquery table , load pypark rdd etl data processing.

i realized bigquery supports hadoop input / output format

https://cloud.google.com/hadoop/writing-with-bigquery-connector

and pyspark should able use interface in order create rdd using method "newapihadooprdd".

http://spark.apache.org/docs/latest/api/python/pyspark.html

unfortunately, documentation on both ends seems scarce , goes beyond knowledge of hadoop/spark/bigquery. there has figured out how this?

google has example on how use bigquery connector spark.

there seem problem using gsonbigqueryinputformat, got simple shakespeare word counting example working

import json import pyspark sc = pyspark.sparkcontext()  hadoopconf=sc._jsc.hadoopconfiguration() hadoopconf.get("fs.gs.system.bucket")  conf = {"mapred.bq.project.id": "<project_id>", "mapred.bq.gcs.bucket": "<bucket>", "mapred.bq.input.project.id": "publicdata", "mapred.bq.input.dataset.id":"samples", "mapred.bq.input.table.id": "shakespeare"  }  tabledata = sc.newapihadooprdd("com.google.cloud.hadoop.io.bigquery.jsontextbigqueryinputformat", "org.apache.hadoop.io.longwritable", "com.google.gson.jsonobject", conf=conf).map(lambda k: json.loads(k[1])).map(lambda x: (x["word"], int(x["word_count"]))).reducebykey(lambda x,y: x+y) print tabledata.take(10) 

Comments

Popular posts from this blog

user interface - how to replace an ongoing process of image capture from another process call over the same ImageLabel in python's GUI TKinter -

javascript - Using jquery append to add option values into a select element not working -

javascript - Restarting Supervisor and effect on FlaskSocketIO -