Loading Local Data in MySQL 8 in Windows

After creating your table in your Database, the next task is to populate it using either the LOAD DATA or the INSERT INTO command. If you have your data in a file, let’s say ‘pet.txt’, you can load from the file to your table using the command below:

LOAD DATA LOCAL INFILE ‘path/to/your/pet.txt’ fields terminated by ‘,’ lines terminated by ‘\n’;

If the statement fails, it is likely that your MySQL installation does not have local file capability enabled by default. In order for the above command to work, both the MySQL Server and Client must have local file capability enabled. This tutorial walks you through enabling local file in both MySQL Server and MySQL client. We assume you have installed MYSQL 8.0 installed with at least these packages: MySQL Server, MySQL shell, and MySQL Workbench. MySQL Workbench is not necessary but it will allow you to see where the MySQL configuration is being read from, and start and stop the MySQL server. Below is a screenshot of MySQL workbench showing the configuration file the MySQL server is reading from.

To enable local file in MySQL server, we need to edit the configuration file shown in your MySQL Workbench and add a line that says “local-infile” under [mysqld] group. When you open the file to edit in your text editor such as Notepad, you’ll notice that some lines start with brackets. That’s called group. Any configuration parameters that come after the line that says [mysqld] apply to MySQL Server. In my case, [mysqld] was the last group in the file, so I simply added local-infile at the end of the file as shown in the screenshot below:

Now, for the MySQL server to take effect of the change, you need to restart the server which you can do using MySQL workbench. See the screenshot below. Simply stop the server and start again and click on “Server Status” to see if it started properly. If everything goes well, congratulations, you just enabled your MySQL Server to load local file data.

Now, the next step is to enable the same in the MySQL client. If you are new to computer science and not quite sure what a client and server is, here is a quick overview. A server is an application that is serving the client and a client is an application that is requesting service from a server. This can be confusing, but when you installed your MySQL, you decided to install both the server and the client. When you go to command line and type “mysql -u username -p” and a black screen appears with a prompt that says msqi>, you are using the client which is an interface that takes your commands and forwards that command to the server and, when the server responds, gives the response back to you. You can enable your MySQL client to enable local data load by using the option –local-infile=1 when you are starting your clientas shown below.

Now, you will be able to load local data to your table. See the screenshot below from my MySQL client.

Please note that there are several ways to do this, but the one I’ve described in this tutorial requires the least amount of work. Some mac users may have the same problem, and if you do make sure both your MySQL server and MySQL client has local file enabled. The process of enabling in the MySQl server is the same in principle – you just have to find out where the configuration file is located so you can edit the file and restart the server. To find out where your configuration file is located: type mysqld –verbose –help from your terminal. See the screenshot below. It will tell you where the configuration is being read from. You just have to edit one of these files as described above, restart the server and when you start you client make sure you use –local-infile=1 at the time of start.

SQL, Pandas and Spark

Most of us are familiar with writing database queries with SQL. But there are also other ways you can query your data from the database or from a file directly. One way is through a Python package called Pandas or through Apache Spark. Both of them are very popular these days in the Data Science field. If you can fit your data in memory in a single computer, I’d suggest to use Pandas. In case the data is big and you need to process your data in a distributed system in memory, Apache Spark is the technology to use. People who are familiar with Hadoop and not so familiar with Spark may be more inclined to use the traditional MapReduce to process big data, and that is fine but Spark comes with some built-in packages that allow you to process your data in a SQL-like manner which ends up saving a lot of development time. Today I’m going to compare SQL queries with Pandas and Spark, so in case you end up using these technologies, hopefully this will make slightly easier to get your head around it. Note that I’ll be showing you examples of Spark with the Python API, whose equivalence is available in JAVA and Scala APIs of Spark as well.

Employee Table/Dataframe

IdEmployee_NameSocial_Security_NumberDepartment_IdSalary
1Roger Martin546-98-1987265000
2Robert Waters437-781-4563170000
3Michael Peters908-809-0897175000

Organization Table/Dataframe

IdDepartment_Name
1Data Science
2Finance
3Human Resources

Column Selection

SQLPandasSpark
select Employee_Name, Department_Id
from Employee
Employee[[‘Employee_Name’,’Department_Id’]]Employee.select(‘Employee_Name’,’Employee_Id’)

Row Selection

SQLPandasSpark
select * from
Employee
where Department_Id=’1′
mask = Employee[‘Department_Id’] == 1
Employee[mask]
Employee.where(col(‘Department_Id’) == 1)

Group by

SQLPandasSpark
select Department_Id, avg(‘Salary’)
from Employee
group by Department_Id
Employee[[‘Department_Id’, ‘Salary’]].groupby([‘Department_Id’]).mean()Employee.groupBy(‘Department_Id’).agg(mean(‘Salary’))

Joins

SQLPandasSpark
select t1.Employee_Name, t2.Department_Name
from Employee t1, Organization t2
where t1.Department_Id = t2.Id
import pandas as pd
pd.merge(Employee, Organization, how=”inner”, left_on=’Department_Id’, right_on=’Id’)[[‘Employee_Name’,’Department_Id’]
]
joinexpr = Employee[‘Department_Id’] == Organization[‘Id’]
Employee.join(joinexpr, “inner”).select(‘Employee_Name’, ‘Department_Name’)