Comparing SQL, Pandas and Spark
By Diwanshu Shekhar
- 2 minutes read - 332 wordsMost of us are familiar with writing database queries with SQL. But there are also other ways you can query your data from the database or from a file directly. One way is through a Python package called Pandas or through Apache Spark. Both of them are very popular these days in the Data Science field. If you can fit your data in memory in a single computer, I’d suggest to use Pandas. In case the data is big and you need to process your data in a distributed system in memory, Apache Spark is the technology to use. People who are familiar with Hadoop and not so familiar with Spark may be more inclined to use the traditional MapReduce to process big data, and that is fine but Spark comes with some built-in packages that allow you to process your data in a SQL-like manner which ends up saving a lot of development time. Today I’m going to compare SQL queries with Pandas and Spark, so in case you end up using these technologies, hopefully this will make slightly easier to get your head around it. Note that I’ll be showing you examples of Spark with the Python API, whose equivalence is available in JAVA and Scala APIs of Spark as well.
Employee Table/Dataframe
Id | EmployeeName | SocialSecurityNumber | DepartmentId | Salary |
---|---|---|---|---|
1 | Roger Martin | 546-98-1987 | 2 | 65000 |
1 | Robert Waters | 437-781-4563 | 1 | 70000 |
1 | Michael Peters | 908-809-0897 | 1 | 75000 |
Organization Table/Dataframe
Id | EmployeeName |
---|---|
1 | Data Science |
2 | Finance |
3 | Human Resources |
Column Selection
SQL
select * from
Employee
where Department_Id='1'
Pandas
Employee[['Employee_Name','Department_Id']]
SPARK
Employee.select('Employee_Name','Employee_Id')
Row Selection
SQL
select * from
Employee
where Department_Id='1'
Pandas
mask = Employee['Department_Id'] == 1
Employee[mask]
SPARK
Employee.where(col('Department_Id') == 1)
Group By
SQL
select Department_Id, avg('Salary')
from Employee
group by Department_Id
Pandas
Employee[['Department_Id', 'Salary']].groupby(['Department_Id']).mean()
SPARK
Employee.groupBy('Department_Id').agg(mean('Salary'))
Join
SQL
select t1.Employee_Name, t2.Department_Name
from Employee t1, Organization t2
where t1.Department_Id = t2.Id
Pandas
import pandas as pd
pd.merge(Employee, Organization, how="inner", left_on='Department_Id', right_on='Id')[['Employee_Name','Department_Id']]
SPARK
joinexpr = Employee['Department_Id'] == Organization['Id']
Employee.join(joinexpr, "inner").select('Employee_Name', 'Department_Name')