Site Loader
Rock Street, San Francisco

I guarantee you that this short tutorial will save you a TON of time from reading the long documentations. Ready to jump on the Big Data Train? Let’s get to it!


Basic concepts

RDD: the basic abstraction of Spark. It distributes data object on to multiple machines, and process in memory, which is faster than processing on a single machine like R or Python. But it’s not organized in named columns.

Dataframe: built on top of rdd and can have a schema with column names and data types. It’s faster than rdd because it gives Spark more information about the data.

Lazy evaluation: Spark doesn’t actually execute data when you use transformations until you call an action.

Transformations: select, filter/where, sort/orderBy, join, groupBy, agg

Actions: show, take, count, collect, describe, cache, printSchema

Basic Syntax

Create a Rdd

# 1. Read external data in Spark as Rdd
rdd = sc.textFile("path")
# 2. Create rdd from a list
rdd = sc.parallelize(["id","name","3","5"])
# 3. Dataframe to rdd
rdd = df.rdd

Create a dataframe

# 1. Read in Spark as Dataframe directly 
# header and schema are optional
df = sqlContext.read.csv("path", header = True/False, schema=df_schema)
# 2.1 rdd to dataframe with column names
df = spark.createDataFrame(rdd,["name","age"])
# 2.2 rdd to dataframe with schema
from pyspark.sql.types import *
df_schema = StructType([
... StructField("name", StringType(), True),
... StructField("age", IntegerType(), True)])
df = spark.createDataFrame(rdd,df_schema)

Transformations:

# Basic transformations:
# 1. select. Index column by df.column_name or use "column_name"
df.select(df.name)
df.select("name")
# 2. filter/where they are the same
df.filter(df.age>20)
df.filter("age>20")
df.where("age>20")
df.where(df.age>20)
# 3. sort/orderBy 
df.sort("age",ascending=False)
df.sort(df.age.desct())
# 4. groupBy and agg
df.groupBy("gender").agg(count("name"),avg("age"))
# 5. join
df1.join(df.2, (df1.x1 == df2.x1) & (df1.x2 == df2.x2),'left')
# 6. create a new column from existing column
df.withColumn("first_name",split(name,'_')[0])

Write SQL in Spark

1. Run sqlContext=sqlContext(sc) before create a dataframe
2. Create a dataframe called df
3. Run df.createOrReplaceTempView("table1") to create a temp table
4. talbe2=sqlContext("select id, name from table1")
If you are writing multiple lines, use """ like this:
talbe2=sqlContext.sql("""
select id, name
from table1
where age>20
""")

Actions:

1. Show: show first n rows of a dataframe (but not rdd) without cells being truncated
df.show(5, truncate=False)
2. Take: display a list of first few rows of dataframe or rdd
df.take(5)
3. collect: collect all the data of a dataframe or rdd
df.collect()
4. count: count number of rows
df.count()
6. printSchema: show column names, data types and whether they are nullable.
df.printSchema()
7. cache : cache the data in memory if it's going to be reused a lot. Use unpersist() to uncache data and free memory.
df.cache()
df.unpersist()
# Transformations and actions can be connected and executed in sequence from left to right. 
df1.filter("age>10").join(df2,df1.x==df2.y).sort("age").show()

Takeaways:

  1. Use and look for functions that used on dataframe instead of rdd if possible, as it’s faster on dataframe
  2. Be careful when using collect() unless you need to download the whole data; use show/take
  3. Cache the data if you are going to use it a lot later.

Give me a few claps and follow me here if you find it helpful!

Post Author: Zhen Liu

I'm a data scientist interested in helping people making better decision making in business and life. Subscribe to my blog and stay tuned for new ideas and tutorials

Leave a Reply