I just completed the ‘Data Manipulation module with DataCamp. In this module, I gained a basic knowledge of visualising data, what to do when encountering missing values as well as how to create dataframes from CSV files and how to create a CSV file from a dataframe. This blog post will hopefully serve to reinforce some of the principles I learned in this module and could potentially inspire someone who wants to learn about Data but does not know where to get started.
Visualising Data.
In this module, there were a few things I learned about visualising data, and prior to this, I had actually happened to come across a very interesting cheat sheet for what kind of visualisation to use when it comes to maplotlib.pyplot. It was found on LinkedIn and was posted by DataCamp.
Here is the actual code with some comments to help you understand what it does.
import matplotlib.pyplot as plt
#this allows you to use plt as a reference to the librarydataframe['column'].hist()
#histograms are useful for distributions. plt.show()
#like print but for graphs.
#also accepts an argument for bins
avg_weight = dataframe.groupby('___')['weight_kg'].mean()
#using groupby to get the mean weight based on '___'avg_weight.plot(kind='bar')
#argument defines the plot to be a bar chart, accepts title arg tooavg_weight.plot(title='', x='', y='', kind='', rot='')
#example arguments it can take.
#title defines title
#x defines x axis title
#y defines y axis title
#kind defines type of chart
#rot determines rotation of x title#scatter plots happen to great for relationships
#plots can be layered too (see below)
Here is a layered chart comparing conventional and organic avocado sizes and prices.
You might have also noticed that there is a legend and the graphs have a lower opacity so that they can be layered. This can be added like this:
plt.legend(['F', 'M'])
The opacity is added as an argument to .plot()
alpha='x'
x in this case refers to a value between 0.0 and 1.0 which determines the opacity for the alpha range.
Missing Values
There are a few commands that can help detect and fix issues that may come up when working with datasets with missing values.
These missing values usually present themselves as NaN which stands for ‘Not a number. This can be tricky when we want to run a numerical operation on a dataset. To alleviate this we can fill in missing values.
#detects missing values
dogs.isna()
#same but presents concisely in console/when printed
#vales == True mean NaN
dogs.isna().any()#counts missing val in each col
dogs.isna().sum()
#creates a bar chart of missing val in each col
dogs.isna().sum().plot(kind='bar')#prints graph
plt.show()
It is now clear to us that there are missing values in 3 columns. We can fix this with the following code.
#not ideal because data is 'lost'
dogs.dropna()#argument decides what fills missing val
dogs.fillna(0)
With this, we can alleviate some of the issues when facing missing data due to issues in how data collection is implemented or user errors.
Creating Dataframes (and CSV files)
There are two ways to create a dataframe.
A dataframe created from a list of dictionaries which is built row by row.
# Create a list of dictionaries with new data
avocados_list = [
{"date": "2019-11-03", "small_sold": 10376832, "large_sold": 7835071},
{"date": "2019-11-10", "small_sold": 10717154, "large_sold": 8561348},
]
And a data frame created from a dictionary of lists which is built column by column.
# Create a dictionary of lists with new data
avocados_dict = {
"date": ["2019-11-17", "2019-12-01"],
"small_sold": [10859987, 9291631],
"large_sold": [7674135, 6238096]
}# Convert dictionary into DataFrame
avocados_2019 = pd.DataFrame(avocados_dict)# Print the new DataFrame
print(avocados_2019)
As you can see there is no actual difference in how they present to you as a data scientist but it is important to understand how they are built anyway.
# Read CSV as DataFrame called airline_bumping
airline_bumping = pd.read_csv('airline_bumping.csv')# Take a look at the DataFrame
print(airline_bumping.head())
# From previous steps
airline_bumping = pd.read_csv("airline_bumping.csv")print(airline_bumping.head())airline_totals = airline_bumping.groupby("airline")[["nb_bumped", "total_passengers"]].sum()# Create new col, bumps_per_10k: no. of bumps per 10k passengers for each airline
airline_totals["bumps_per_10k"] = airline_totals["nb_bumped"] / airline_totals["total_passengers"] * 10000
The above code uses the .groupby method to group by airline and calculates the sum for two columns: nb_bumped and total_passengers.
A new column is then created to give us data that tells us how many bumps per 10k passengers have occurred on each airline.
We can now turn this new dataframe back into a CSV file. The code is quite simple for this.
# Create airline_totals_sorted
airline_totals_sorted = airline_totals.sort_values('bumps_per_10k', ascending=False)# Print airline_totals_sorted
print(airline_totals_sorted)# Save as airline_totals_sorted.csv
airline_totals_sorted.to_csv('airline_totals_sorted.csv')
We sort our data based on the column ‘bumps_per_10k’ and sort in descending order. This new dataframe is stored as a new variable and had the .to_csv method added with the output filename as an argument.
That wraps up this post, I was happy to learn about these concepts since I could see how they might directly be required by someone I would be working with. I really liked the concept of saving a sorted dataframe as a CSV which can then be used in the future to work on again. I imagine that a data scientist would be creating a lot of CSV files that are sorted in slightly different ways so that they can choose the right CSV file for the job.