SQL, Pandas and Spark

Most of us are familiar with writing database queries with SQL. But there are also other ways you can query your data from the database or from a file directly. One way is through a Python package called Pandas or through Apache Spark. Both of them are very popular these days in the Data Science field. If you can fit your data in memory in a single computer, I'd suggest to use Pandas. In case the data is big and you need to process your data in a distributed system in memory, Apache Spark is the technology to use. People who are familiar with Hadoop and not so familiar with Spark may be more inclined to use the traditional MapReduce to process big data, and that is fine but Spark comes with some built-in packages that allow you to process your data in a SQL-like manner which ends up saving a lot of development time. Today I'm going to compare SQL queries with Pandas and Spark, so in case you end up using these technologies, hopefully this will make slightly easier to get your head around it. Note that I'll be showing you examples of Spark with the Python API, whose equivalence is available in JAVA and Scala APIs of Spark as well.

Employee Table/Dataframe

IdEmployee_NameSocial_Security_NumberDepartment_IdSalary
1Roger Martin546-98-1987265000
2Robert Waters437-781-4563170000
3Michael Peters908-809-0897175000

Organization Table/Dataframe

IdDepartment_Name
1Data Science
2Finance
3Human Resources

Column Selection

SQLPandasSpark
select Employee_Name, Department_Id
from Employee
Employee[['Employee_Name','Department_Id']]Employee.select('Employee_Name','Employee_Id')

Row Selection

SQLPandasSpark
select * from
Employee
where Department_Id='1'
mask = Employee['Department_Id'] == 1
Employee[mask]
Employee.where(col('Department_Id') == 1)

Group by

SQLPandasSpark
select Department_Id, avg('Salary')
from Employee
group by Department_Id
Employee[['Department_Id', 'Salary']].groupby(['Department_Id']).mean()Employee.groupBy('Department_Id').agg(mean('Salary'))

Joins

SQLPandasSpark
select t1.Employee_Name, t2.Department_Name
from Employee t1, Organization t2
where t1.Department_Id = t2.Id
import pandas as pd
pd.merge(Employee, Organization, how="inner", left_on='Department_Id', right_on='Id')[['Employee_Name','Department_Id']
]
joinexpr = Employee['Department_Id'] == Organization['Id']
Employee.join(joinexpr, "inner").select('Employee_Name', 'Department_Name')

An Overview of EU's GDPR Policy

These days individuals share personal data more than ever before. Technology has also advanced to allow companies to make use of personal data on an unprecedented scale. With so much personal data out there, you might wonder, who has personal data about me and how they will look after it. There are several common concerns among people about their personal data theses days and here are some of them:

Since I booked a holiday online, I’m getting loads of emails from companies trying to sell me car hire, airport parking, and hotels. I don’t remember signing up for that.

I’ve just downloaded an app on my mobile phone and had to give it access to my contacts. Do they really need it?

What happens to my information on Facebook when I don’t want an account anymore? Can I have it all erased?

I’ve just been speaking to my bank’s customer service department. The customer service representative was very helpful but she said she is based in India. Is my personal data protected there?

Personal data is important for increase in online services, company's responsibilities, ethical and legal responsibilities and better legislation. For example: You might be refused health insurance based on a Google search you did about a medical condition. You might be shown a credit card with a lower credit limit, not because of your credit history, but because of your race, sex, or ZIP code or the types of websites you visit.

Coming into effect in 2018, the EU General Data Protection Regulation (GDPR) is arguably the most important change in data privacy regulation in 20 years

Before GDPR:
The EU Data Protection Directive, introduced in 1995, established the basic elements of data protection for EU citizens. Unfortunately, the 1995 law was not equipped to handle the explosion in data collection and storage seen in today’s world. Although it has been extended to cover data on the internet, it is not designed for the job. The world has changed since 1995. To address these issues, new regulation was needed.

After GDPR:
The new General Data Protection Regulation (GDPR) is designed to protect personal data today and in the future. The objective is to provide a robust set of rules to protect the fundamental rights and freedoms of EU residents, and enable the free movement of personal data across EU member states

The GDPR applies to all organizations that are established in the EU and process personal data in the context of that establishment. It also applies to organizations established outside of the EU if they process data on EU residents when offering them goods and services.

Fundamental privacy principles
GDPR laws are based on a number of fundamental privacy principles that are explained briefly below:

Fair, Lawful and transparent processing:
We must have a lawful basis for processing personal data. Fair means that people must be made aware about how our personal data is being processed and why. Lawful means that it must be done based on a legitimate basis set forth in the GDPR or in an applicable law of an EU member state and Transparent means that a company's policy information is easily accessible, easy to understand, and uses clear and plain language. For example, a company must have a legitimate reason for asking people to give their personal data and it needs to be transparent about what personal data it is collecting or using, and for what purposes.

Purpose Limitation:
A company can only collect personal data for specific and legitimate purposes that we explain to the individual at the time we collect it, and we cannot use it for any new purposes that are incompatible with those purposes. For example, if an online shop asks for your phone number to send delivery texts, they cannot use it for the new and unrelated purpose of calling you about new offers without your consent.

Data Minimization:
All personal data that a company collects must be limited to what is necessary in connection with its purpose for collecting it. It can't collect personal data just because it would be nice to know or may be useful in the future.

Accuracy:
A company has an obligation to ensure that personal data is accurate and, where necessary, kept up to date.

Storage Limitation:
The data must not be kept longer than necessary. If the original reason for processing the personal data no longer exists, and there is no lawful basis for maintaining it, the data must be destroyed.

Security:
Appropriate security measures to protect personal data must be implemented. Personal data must be protected against unauthorized or unlawful processing and against accidental loss, destruction, or damage.

Accountability:
A company is responsible for compliance with these principles and must be able to demonstrate the compliance.

The EU has strict rules regarding the transfer of its citizens’ personal data outside of the EU. Whenever a company transfers the personal data of EU citizens across borders, they need to be aware of cross-border data transfer rules. Data about EU residents may only be transferred to the United States when certain safeguards are put in place.

Personal Data

The GDPR defines personal data as information relating to an identified or identifiable natural person. This means that if data can be used to directly or indirectly identify an individual, it’s considered personal data. For instance: IP Addresses, Browser Preferences, Initials, Email addresses are considered personal data

Sensitive Personal Data

Personal data includes a category of sensitive information that requires a higher level of protection. Sensitive personal data includes the following categories of data: Data that reveals racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership, Data concerning health or sex life and sexual orientation, Genetic data or biometric data

GDPR Rules In Collecting Personal Information:
The GDPR rules in collecting personal information are:

  • Freely Given: People must not be under undue pressure to consent.
  • Informed: People must be provided sufficient information to know and understand why they are consenting
  • Unambigious: Consent must be given by a statement or a clear affirmative action, like ticking a box to opt-in. Silence, pre-ticked boxes, inactivity, or failure to opt-out does not constitute valid consent.
  • Specific: The consent given must be for a specific purpose. Statements such as “I consent to the company doing whatever they want to with my data,” would not be specific enough to constitute proper consent for selling of the data to third parties or adding a person to a mailing list

Additionally, a company provide people a privacy notice whenever they collect data or plan to use data for a secondary purpose—a purpose which data owners did not originally give their consent.

Take a look at this scenario while signing up for an account in a website where it says: "Tick this box if you DO NOT want to receive marketing material from us." This is Not Acceptable under GDPR law because a consent must have a clear affirmative action. Here is another one: Let's say you decide to purchase a book online. You are offered a free app for your phone that customers can only receive if they consent to the collection of personal data—that is not required to read the book. This is Not acceptable under GDPR law as a company can only collect data if it is necessary and in this case, it is not necessary to collect data to read the book.