Skip to content
João Antonio Ferreira edited this page May 21, 2017 · 9 revisions

Welcome to the spark wiki!

Git LFS on Ubuntu

If you get this error:

tar: This does not look like a tar archive
tar: Exiting with failure status due to previous errors

You will need git-lfs support to download Large TAR and GZ files

sudo apt-get install software-properties-common
sudo apt-get install python-software-properties
sudo add-apt-repository ppa:git-core/ppa
sudo apt-get update
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
sudo apt-get install git-lfs
git lfs install
git lfs track "*.tar"
git lfs track "*.gz"
cd install/
git checkout .

Mongo DB integration

docker run -i -t -h my-spark -p 8095:8080 --rm parana/spark bash

or, if you need see the test folder inside container, use :

docker run -i -t -h my-spark -p 8095:8080 --rm -v $PWD/test:/mongo parana/spark bash

Inside the container use this:

cd /mongo/myspark 
spark-shell --packages org.mongodb.spark:mongo-spark-connector_2.11:2.0.0,br.com.joao-parana:myspark:1.0-SNAPSHOT 

On your host computer you can open the Web Browser

http://localhost:8095

https://www.mongodb.com/presentations/webinar-introducing-the-spark-connector-for-mongodb

Reading CSV file without header line

val df = spark.read.
  format("org.apache.spark.csv").
  option("header", false).
  option("inferSchema", "true").
  csv("data/customer.csv")

df.printSchema

Reading CSV file with header line

val df=spark.read.
  format("org.apache.spark.csv").
  option("header",true).
  option("inferSchema", "true").
  csv("data/region.csv")

df.printSchema