Visualising Data, Missing Values and Creating Dataframes

Kasim Ali
5 min readSep 4, 2022

--

I just completed the ‘Data Manipulation module with DataCamp. In this module, I gained a basic knowledge of visualising data, what to do when encountering missing values as well as how to create dataframes from CSV files and how to create a CSV file from a dataframe. This blog post will hopefully serve to reinforce some of the principles I learned in this module and could potentially inspire someone who wants to learn about Data but does not know where to get started.

Visualising Data.

In this module, there were a few things I learned about visualising data, and prior to this, I had actually happened to come across a very interesting cheat sheet for what kind of visualisation to use when it comes to maplotlib.pyplot. It was found on LinkedIn and was posted by DataCamp.

Here is the actual code with some comments to help you understand what it does.

import matplotlib.pyplot as plt
#this allows you to use plt as a reference to the library
dataframe['column'].hist()
#histograms are useful for distributions.
plt.show()
#like print but for graphs.
#also accepts an argument for bins
Bar chart for the size of avocados on the x-axis and price on the y-axis
avg_weight = dataframe.groupby('___')['weight_kg'].mean()
#using groupby to get the mean weight based on '___'
avg_weight.plot(kind='bar')
#argument defines the plot to be a bar chart, accepts title arg too
avg_weight.plot(title='', x='', y='', kind='', rot='')
#example arguments it can take.
#title defines title
#x defines x axis title
#y defines y axis title
#kind defines type of chart
#rot determines rotation of x title
#scatter plots happen to great for relationships
#plots can be layered too (see below)
Scatter plot for the Number of avocados sold against average price.

Here is a layered chart comparing conventional and organic avocado sizes and prices.

You might have also noticed that there is a legend and the graphs have a lower opacity so that they can be layered. This can be added like this:

plt.legend(['F', 'M'])

The opacity is added as an argument to .plot()

alpha='x'

x in this case refers to a value between 0.0 and 1.0 which determines the opacity for the alpha range.

Missing Values

There are a few commands that can help detect and fix issues that may come up when working with datasets with missing values.

These missing values usually present themselves as NaN which stands for ‘Not a number. This can be tricky when we want to run a numerical operation on a dataset. To alleviate this we can fill in missing values.

#detects missing values
dogs.isna()
dogs.isna() returns this table.
#same but presents concisely in console/when printed
#vales == True mean NaN
dogs.isna().any()
#counts missing val in each col
dogs.isna().sum()
dogs.isna().any() returns this boolean table. Much tidier.
#creates a bar chart of missing val in each col
dogs.isna().sum().plot(kind='bar')
#prints graph
plt.show()

It is now clear to us that there are missing values in 3 columns. We can fix this with the following code.

#not ideal because data is 'lost'
dogs.dropna()
#argument decides what fills missing val
dogs.fillna(0)
False is returned for all columns.

With this, we can alleviate some of the issues when facing missing data due to issues in how data collection is implemented or user errors.

Creating Dataframes (and CSV files)

There are two ways to create a dataframe.

A dataframe created from a list of dictionaries which is built row by row.

# Create a list of dictionaries with new data
avocados_list = [
{"date": "2019-11-03", "small_sold": 10376832, "large_sold": 7835071},
{"date": "2019-11-10", "small_sold": 10717154, "large_sold": 8561348},
]
This is printed in the console for us. Built row by row.

And a data frame created from a dictionary of lists which is built column by column.

# Create a dictionary of lists with new data
avocados_dict = {
"date": ["2019-11-17", "2019-12-01"],
"small_sold": [10859987, 9291631],
"large_sold": [7674135, 6238096]
}
# Convert dictionary into DataFrame
avocados_2019 = pd.DataFrame(avocados_dict)
# Print the new DataFrame
print(avocados_2019)
This is printed in the console. Built column by column.

As you can see there is no actual difference in how they present to you as a data scientist but it is important to understand how they are built anyway.

# Read CSV as DataFrame called airline_bumping
airline_bumping = pd.read_csv('airline_bumping.csv')
# Take a look at the DataFrame
print(airline_bumping.head())
Since we used .head() the first 5 rows are shown to us as a summary.
# From previous steps
airline_bumping = pd.read_csv("airline_bumping.csv")
print(airline_bumping.head())airline_totals = airline_bumping.groupby("airline")[["nb_bumped", "total_passengers"]].sum()# Create new col, bumps_per_10k: no. of bumps per 10k passengers for each airline
airline_totals["bumps_per_10k"] = airline_totals["nb_bumped"] / airline_totals["total_passengers"] * 10000

The above code uses the .groupby method to group by airline and calculates the sum for two columns: nb_bumped and total_passengers.

A new column is then created to give us data that tells us how many bumps per 10k passengers have occurred on each airline.

Our new column which provides the business with new data is now in our table.

We can now turn this new dataframe back into a CSV file. The code is quite simple for this.

# Create airline_totals_sorted
airline_totals_sorted = airline_totals.sort_values('bumps_per_10k', ascending=False)
# Print airline_totals_sorted
print(airline_totals_sorted)
# Save as airline_totals_sorted.csv
airline_totals_sorted.to_csv('airline_totals_sorted.csv')

We sort our data based on the column ‘bumps_per_10k’ and sort in descending order. This new dataframe is stored as a new variable and had the .to_csv method added with the output filename as an argument.

That wraps up this post, I was happy to learn about these concepts since I could see how they might directly be required by someone I would be working with. I really liked the concept of saving a sorted dataframe as a CSV which can then be used in the future to work on again. I imagine that a data scientist would be creating a lot of CSV files that are sorted in slightly different ways so that they can choose the right CSV file for the job.

I would also love to connect on Twitter, LinkedIn or on my Website. I am not sure what value I could provide you with but if you reach out, I would be more than happy to have a conversation or talk about what I have written here.

--

--

No responses yet