apache spark - BigQuery connector for pyspark via Hadoop Input Format example -

i have large dataset stored bigquery table , load pypark rdd etl data processing.

i realized bigquery supports hadoop input / output format

https://cloud.google.com/hadoop/writing-with-bigquery-connector

and pyspark should able use interface in order create rdd using method "newapihadooprdd".

http://spark.apache.org/docs/latest/api/python/pyspark.html

unfortunately, documentation on both ends seems scarce , goes beyond knowledge of hadoop/spark/bigquery. there has figured out how this?

google has example on how use bigquery connector spark.

there seem problem using gsonbigqueryinputformat, got simple shakespeare word counting example working

import json import pyspark sc = pyspark.sparkcontext()  hadoopconf=sc._jsc.hadoopconfiguration() hadoopconf.get("fs.gs.system.bucket")  conf = {"mapred.bq.project.id": "<project_id>", "mapred.bq.gcs.bucket": "<bucket>", "mapred.bq.input.project.id": "publicdata", "mapred.bq.input.dataset.id":"samples", "mapred.bq.input.table.id": "shakespeare"  }  tabledata = sc.newapihadooprdd("com.google.cloud.hadoop.io.bigquery.jsontextbigqueryinputformat", "org.apache.hadoop.io.longwritable", "com.google.gson.jsonobject", conf=conf).map(lambda k: json.loads(k[1])).map(lambda x: (x["word"], int(x["word_count"]))).reducebykey(lambda x,y: x+y) print tabledata.take(10)

Search This Blog

Brant

apache spark - BigQuery connector for pyspark via Hadoop Input Format example -

Comments

Post a Comment

Popular posts from this blog

Rendering JButton to get the JCheckBox behavior in a JTable by using images does not update my table -

javascript - Using jquery append to add option values into a select element not working -

Android soft keyboard reverts to default keyboard on orientation change -