apache spark - BigQuery connector for pyspark via Hadoop Input Format example -
i have large dataset stored bigquery table , load pypark rdd etl data processing.
i realized bigquery supports hadoop input / output format
https://cloud.google.com/hadoop/writing-with-bigquery-connector
and pyspark should able use interface in order create rdd using method "newapihadooprdd".
http://spark.apache.org/docs/latest/api/python/pyspark.html
unfortunately, documentation on both ends seems scarce , goes beyond knowledge of hadoop/spark/bigquery. there has figured out how this?
google has example on how use bigquery connector spark.
there seem problem using gsonbigqueryinputformat, got simple shakespeare word counting example working
import json import pyspark sc = pyspark.sparkcontext() hadoopconf=sc._jsc.hadoopconfiguration() hadoopconf.get("fs.gs.system.bucket") conf = {"mapred.bq.project.id": "<project_id>", "mapred.bq.gcs.bucket": "<bucket>", "mapred.bq.input.project.id": "publicdata", "mapred.bq.input.dataset.id":"samples", "mapred.bq.input.table.id": "shakespeare" } tabledata = sc.newapihadooprdd("com.google.cloud.hadoop.io.bigquery.jsontextbigqueryinputformat", "org.apache.hadoop.io.longwritable", "com.google.gson.jsonobject", conf=conf).map(lambda k: json.loads(k[1])).map(lambda x: (x["word"], int(x["word_count"]))).reducebykey(lambda x,y: x+y) print tabledata.take(10)
Comments
Post a Comment