SQL, Pandas and Spark

Most of us are familiar with writing database queries with SQL. But there are also other ways you can query your data from the database or from a file directly. One way is through a Python package called Pandas or through Apache Spark. Both of them are very popular these days in the Data Science field. If you can fit your data in memory in a single computer, I'd suggest to use Pandas. In case the data is big and you need to process your data in a distributed system in memory, Apache Spark is the technology to use. People who are familiar with Hadoop and not so familiar with Spark may be more inclined to use the traditional MapReduce to process big data, and that is fine but Spark comes with some built-in packages that allow you to process your data in a SQL-like manner which ends up saving a lot of development time. Today I'm going to compare SQL queries with Pandas and Spark, so in case you end up using these technologies, hopefully this will make slightly easier to get your head around it. Note that I'll be showing you examples of Spark with the Python API, whose equivalence is available in JAVA and Scala APIs of Spark as well.

Employee Table/Dataframe

IdEmployee_NameSocial_Security_NumberDepartment_IdSalary
1Roger Martin546-98-1987265000
2Robert Waters437-781-4563170000
3Michael Peters908-809-0897175000

Organization Table/Dataframe

IdDepartment_Name
1Data Science
2Finance
3Human Resources

Column Selection

SQLPandasSpark
select Employee_Name, Department_Id
from Employee
Employee[['Employee_Name','Department_Id']]Employee.select('Employee_Name','Employee_Id')

Row Selection

SQLPandasSpark
select * from
Employee
where Department_Id='1'
mask = Employee['Department_Id'] == 1
Employee[mask]
Employee.where(col('Department_Id') == 1)

Group by

SQLPandasSpark
select Department_Id, avg('Salary')
from Employee
group by Department_Id
Employee[['Department_Id', 'Salary']].groupby(['Department_Id']).mean()Employee.groupBy('Department_Id').agg(mean('Salary'))

Joins

SQLPandasSpark
select t1.Employee_Name, t2.Department_Name
from Employee t1, Organization t2
where t1.Department_Id = t2.Id
import pandas as pd
pd.merge(Employee, Organization, how="inner", left_on='Department_Id', right_on='Id')[['Employee_Name','Department_Id']
]
joinexpr = Employee['Department_Id'] == Organization['Id']
Employee.join(joinexpr, "inner").select('Employee_Name', 'Department_Name')

Getting Started With ProFTP Server

ProFTP is a popular File Transfer Protocol (FTP) server available for unix-like operating systems. This tutorial will walk you through the steps you might need to get your ProFTP server up and running in Red Hat Enterprise Linux.

Installation
sudo yum install proftpd

Staring, restarting and stopping the server
sudo service proftpd start
sudo service proftpd restart
sudo service proftpd stop

Configuration
The first step in configuring a proftpd daemon is knowing where the configuration file, usually named proftpd.conf, is located. The default location for this file is /etc/proftpd.conf or /usr/local/etc/proftpd.conf, depending on your installation.
You can edit the file as needed. Here is a configuration you will need to set up an FTP server for anonymous user access, meaning anybody will have access to the server without having to enter username and password:

<IfDefine ANONYMOUS_FTP>
  <Anonymous ~ftp>
  User                        ftp
  Group                       ftp
  AccessGrantMsg              "Anonymous login ok, restrictions apply."

  # We want clients to be able to login with "anonymous" as well as "ftp"
  UserAlias                   anonymous ftp

  # Limit the maximum number of anonymous logins
  MaxClients                  10 "Sorry, max %m users -- try again later"


  # We want 'welcome.msg' displayed at login, '.message' displayed in each newly 
  #chdired directory and tell users to read README* files.
  DisplayLogin                /welcome.msg
  DisplayChdir                .message
  DisplayReadme               README*

  # Cosmetic option to make all files appear to be owned by user "ftp"
  DirFakeUser                 on ftp
  DirFakeGroup                on ftp

  # An upload directory that allows reading and storing files but not deletion
  <Directory /uploads>
     AllowOverwrite            no

     <Limit DELE RMD>
        AllowUser ftpadm
        DenyAll
     </Limit>

  </Directory>

  # Don't write anonymous accesses to the system wtmp file (good idea!)
  WtmpLog                     off

  # Logging for the anonymous transfers
  ExtendedLog                 /var/log/proftpd/access.log WRITE,READ default
  ExtendedLog                 /var/log/proftpd/auth.log AUTH auth

  </Anonymous>
</IfDefine>