SQL, Pandas and Spark

Most of us are familiar with writing database queries with SQL. But there are also other ways you can query your data from the database or from a file directly. One way is through a Python package called Pandas or through Apache Spark. Both of them are very popular these days in the Data Science field. If you can fit your data in memory in a single computer, I'd suggest to use Pandas. In case the data is big and you need to process your data in a distributed system in memory, Apache Spark is the technology to use. People who are familiar with Hadoop and not so familiar with Spark may be more inclined to use the traditional MapReduce to process big data, and that is fine but Spark comes with some built-in packages that allow you to process your data in a SQL-like manner which ends up saving a lot of development time. Today I'm going to compare SQL queries with Pandas and Spark, so in case you end up using these technologies, hopefully this will make slightly easier to get your head around it. Note that I'll be showing you examples of Spark with the Python API, whose equivalence is available in JAVA and Scala APIs of Spark as well.

Employee Table/Dataframe

IdEmployee_NameSocial_Security_NumberDepartment_IdSalary
1Roger Martin546-98-1987265000
2Robert Waters437-781-4563170000
3Michael Peters908-809-0897175000

Organization Table/Dataframe

IdDepartment_Name
1Data Science
2Finance
3Human Resources

Column Selection

SQLPandasSpark
select Employee_Name, Department_Id
from Employee
Employee[['Employee_Name','Department_Id']]Employee.select('Employee_Name','Employee_Id')

Row Selection

SQLPandasSpark
select * from
Employee
where Department_Id='1'
mask = Employee['Department_Id'] == 1
Employee[mask]
Employee.where(col('Department_Id') == 1)

Group by

SQLPandasSpark
select Department_Id, avg('Salary')
from Employee
group by Department_Id
Employee[['Department_Id', 'Salary']].groupby(['Department_Id']).mean()Employee.groupBy('Department_Id').agg(mean('Salary'))

Joins

SQLPandasSpark
select t1.Employee_Name, t2.Department_Name
from Employee t1, Organization t2
where t1.Department_Id = t2.Id
import pandas as pd
pd.merge(Employee, Organization, how="inner", left_on='Department_Id', right_on='Id')[['Employee_Name','Department_Id']
]
joinexpr = Employee['Department_Id'] == Organization['Id']
Employee.join(joinexpr, "inner").select('Employee_Name', 'Department_Name')

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.