How To Analyze And Visualize Data Using Python (Beginner Explained)

One of the main reasons why Python is widely used amongst data scientists and data analysts is the ability to easily manipulate and analyze data. And especially the many freely available visualization packages and libraries that come with python, which means you can visualize data in multiple charts, colors, sizes, and many more!

And by using python programming we can definitely gather greater insights compared to conventional means such as excel, most importantly prove our wits in data analytics, if you do plan on applying for an analytical job.

That is why in this article, we will show how you can analyze and visualize data using python programming.

Although python has abilities such as gathering data from API, web scraping, and many more, we plan to keep this tutorial simple by importing data from the Kaggle that is formatted in a CSV file.

If you’re interested in learning more about data analytics in python we suggest signing up for one of our free courses below.

Introduction to Data science in Python

Affiliate Disclaimer: We sometimes use affiliate links in our content. This won’t cost you anything but it helps keep our lights on and pays our writing and developer teams. We appreciate your support!

1. Download and Explore Our Data Using Pandas

The first step in any data analysis project is to explore and gather our data, this is called Exploratory data analysis. The reason why we want to explore our data first is so we can get an overarching view of the type of data we are dealing with.

The dataset we will be using can be downloaded on the Kaggle website here: https://www.kaggle.com/jaytilala/global-power-plant

If that link doesn’t work, you can directly download the CSV file here:

Basically what this dataset contains is information relating to powerplants such as its location, fuel type to generate electricity and etc.

Let’s now import our dataset and check its contents, what we will be using is a library called “pandas” which is a powerful module that allows us to analyze our tabular data.

You can import and visualize the CSV file from the code below:

import pandas as pd #importing our pandas library
df = pd.read_csv('Global Power Plant.csv') #reading our CSV file
print(df) #printing the dataframe

What I have done is create a python file in the same folder as to where my CSV file is located, then called pandas pd.read_csv function to read our CSV file and print the first 5 rows and last 5 rows of each of our attributes using the print function.

Python Data analysis

These are the following things I have noticed when we downloaded the CSV file:

  • We have some missing values in the dataset, this means that we can have issues with imbalanced observations, cause biased estimates, and in extreme cases, can even lead to invalid conclusions.

  • Some attributes have no bearing on our analysis, for example, unique identifiers which are probably only useful for those who want to check the ID of each powerplant, so the gppd_indr attribute can be ignored.

2. Clean Our Dataset Using Pandas

In this step we need to clean our dataset, the reason why we need to clear our dataset is because the missing data may negatively affect the information we could be getting.

There are many different ways to clean a dataset, but to keep it simple let’s purge all the missing values and drop our gppd_indr attribute. The reason why we are dropping our gppd_indr column is because it does not provide us any informational value, the values are just unique identifiers of powerplant id’s.

Below is the code of how we can drop our gppd_indr column and purge all missing values.

df.pop('gppd_idnr') # drop our gppd_indr attribute
dropped_na= df.dropna() # Purge all missing values 
print(dropped_na) # print updated csv file

To drop our entire column of the gppd_indr we can use the pop() function from python’s prebuilt functions. Next, we will drop all null values, which means values that contain empty spaces, this is through the use of the dropna() function.

Lastly, we will assign new data frame name as dropped_na.

As you can see the gppd_indr has been dropped from our new dataframe “dropped_na”.

Python Data visualisation

3. Visualize Our CSV File Dataset

Now that we have cleaned our dataset and removed all unnecessary data, we will now be visualizing each attribute in the dataset.

Basically what we will be doing is using the new dataframe “dropped_na” to visualize our data, I will also be using a range of different techniques from multiple packages/libraries to show the wealth of visualization techniques that python offers.

It is a known fact that great analysts always has a wealth of visualization techniques under their belt.

In terms of visualizing data, to determine the best chart to use, it is primarily a judgement call. This is because every chart has their own benefits, for instance line chart, bar charts and scatter graphs are all good at visualizing trends.

But based on the information you are dealing with, you need to pick a chart that is best suited to visualize the information you are trying to present.

3.1 Visualizing Data As A Bar Chart Using Pandas Dataframe (Country)

Pandas dataframe is a package that allows us to analyze tabular data, but it also comes with visualization tools such as a bar chart.

In this section we will be analysing the country where the powerplants are located in, one way to visualize the data is using a bar chart. The reason why we want to use a bar chart is becuase it allows us to show a distribution of data points across our countries and also show comparatively the frequency of powerplants per country.

Basically we are using the bar chart as a comparative bar chart and placing bars to represent sections from the same category adjacent to each other. This allows for a quick visual comparison of the data.

To graph a bar chart it can be done using the code below.

dropped_na['Country'].value_counts().plot(kind='bar', figsize=(20, 6))
Bar chart using pandas dataframe

Plotting this bar chart shows that the United States of America has the most amount of powerplants, however, the issue is there are many other countries but with less amount of data, this causes our chart to skew to the left which makes it difficult to present our key information, which is the countries that have the most powerplants.

To filter our data for values in countries that have greater than 100 powerplants we can use pandas dataframe “loc()” function, the loc() function helps us retrieve values from a dataset and manipulate data.

Therefore we can input loc[lambda x : x>100] to filter all the countries that have greater than 100 powerplants.

dropped_na['Country'].value_counts().loc[lambda x : x>100].plot(kind='bar', figsize=(20, 6))
Bar chart using pandas dataframe

3.2 Visualizing Data As A Pie Chart Using Matplotlib Pyplot (Primary Fuel)

Next we will visualize our primarily fuel, since there is multiple categories of primary fuel across different powerplants, what we need to do is group our values by the unique primarily fuel values and count the frequency of each primary fuel.

We can do that by using the code below, which is basically using pandas dataframe “groupby” and “size” functions, which is grouping the primary fuel by its category then counting the frequency of that specific category.

PrimaryFuel = dropped_na.groupby(["Primary Fuel"]).size() #Group by primary fuel and count the frequency
Print(PrimaryFuel) #Print PrimaryFuel Dataframe

Output:

Primary Fuel
Biomass            619
Coal              2090
Cogeneration        41
Gas               3152
Geothermal         127
Hydro             3545
Nuclear            134
Oil               1419
Other               36
Petcoke             13
Solar             4604
Storage             58
Waste             1017
Wave and Tidal       9
Wind              2667
dtype: int64

From the output we can see the data is categorised as labelled data and numerical data. Therefore a pie chart should be a good way of visualizing the data.

For our pie chart, we already have the numerical data, but we also need to create our labels, our PrimaryFuel is stored as a series format.

This means its format is very similar to a dictionary, so we can actually call the names of the fuels using the key() function.

Below is the code on how we can create our label and plot our pie chart

!pip install matplotlib # Install matplotlib visualization package
import matplotlib.pyplot as plt #Import matplotlib

labels = PrimaryFuel.keys() #Create our label name
plt.pie(PrimaryFuel, labels = labels, radius = 3) #Create our pie chart
Pie chart using matplotlib pyplot

3.3 Visualizing Data As Multiple Pie Charts Using Matplotlib Pyplot (Country and Primary Fuel)

As we have seen in our previous pie chart, we visualized the primary fuel used in powerplants across the world. However, this does not tell the full side of the story, as some countries may use a certain fuel more frequently compared to other fuels.

That is why we need to analyze deeper, in this step what we will be doing is graphing the top 4 countries and analysing their Primary Fuel.

What we are planning to do in this step is using Matplotlib’s sub-plot function, which allows us to graph multiple charts within the same space. Therefore to create our multiple pie charts we need to create multiple functions to help sort the countries column to the specified top 4 countries which are “United States of America”, “United Kingdom’, “China” and “Canada.

import matplotlib.pyplot as plt # Importing our graphical package

# Sorting our dataframe of country that equals to USA, etc and group by primary fuel
America = dropped_na[dropped_na['Country'] == 'United States of America'].groupby(["Primary Fuel"]).size()
United_Kingdom = dropped_na[dropped_na['Country'] == 'United Kingdom'].groupby(["Primary Fuel"]).size()
China = dropped_na[dropped_na['Country'] == 'China'].groupby(["Primary Fuel"]).size()
Canada = dropped_na[dropped_na['Country'] == 'Canada'].groupby(["Primary Fuel"]).size()

Next, we will be using pythons subplots function to create our subplots, basically we we have done here is used the axis to specify where we want our charts plotted within the graph, as you can see the co-ordinates [0, 0] is top left, basically in each co-ordinate we will be creating a pie chart and inputting a title.

# Creating our subplots with adjusted figure size
figure, axis = plt.subplots(2, 2, figsize=(15,15))

#Graphing each pie charts with its related countries
axis[0, 0].pie(America, labels = America.keys())
axis[0, 0].set_title("America")
axis[0, 1].pie(United_Kingdom, labels = United_Kingdom.keys())
axis[0, 1].set_title("United Kingdom")
axis[1, 0].pie(China, labels = China.keys())
axis[1, 0].set_title("China")
axis[1, 1].pie(Canada, labels = Canada.keys())
axis[1, 1].set_title("Canada")
Pie chart visualisation using python

3.4 Visualizing Data As A Tree Map Using Squarify (Country and Capacity (MW))

In this section we will be visualizing the country and the average Capacity(MW). These two categories tells us the countries which on average generate the most electricity per powerplant.

To visualize our data we will be using a tree map which can show us the size of the Capacity MW on average a country generates, so the larger the tree map’s rectangle, the more dominant a country is with generating electricity via powerplants on average.

There are plenty of packages available to visualize Tree Maps, however we will be using the package Squarify due to simplify reasons.

Basically what we are doing is grouping the countries and then calculating the average Capacity (MW) per country. The code below shows how it is done.

country_average = dropped_na.groupby(['Country'])['Capacity (MW)'].mean()

Next, we will download Squarify and graph our charts.

#Installing and Importing Squarify
!pip install squarify
import squarify

# Plotting a Python Treemap
squarify.plot(sizes=country_average.values, label=country_average.index, alpha = 0.6)

#Increasing Figure Size
plt.rcParams["figure.figsize"] = (50,15)

# Displaying the plot
plt.show()
Tree map using squarify

3.5 Visualizing Data As A Geographical Map (Longitude and Latitude Data)

In this section we will be analysing the longitude and latitude data, by doing so we can visualize the diversity of locations where the power plants are located and determine the countries that have the most powerplants.

To graph our data we can use the matplotlib pyplot python package and analyze the data as a scatter graph.

Below is the code to visualize as a scatter graph.

import matplotlib.pyplot as plt # Importing our graphical package
plt.scatter(x=dropped_na['Longitude'], y=dropped_na['Latitude']) #Plotting our Latitude and Longitude Data
plt.show() #Outputting Graph
Analysing data using scatter

Due to the nature of the longitude and longitude data, where the values can be positive or negative, this means we can actually represent the data on a scatter graph as shown above.

From analysing the graph we can see the outline of the majority where the powerplants are located. These locations are most likely where majority of the population are living.

However the graph above does not show the true scale, as some countries are missing and we are also missing the outline of each country.

To better represent the data we can use the graphical package plotly, and the graphical function “Scatter geo” this graph is pretty much a scatter graph from our matplotlib but with an added feature of an outline of the world map.

Below is the code on how we can output our scatter geo graph.

import plotly.express as px # Importing our graphical package

fig = px.scatter_geo(dropped_na,lat='Latitude',lon='Longitude', hover_name="Powerplant Name") #plotting our data
fig.update_layout(title = 'Power Plants', title_x=0.5) #Inputting title 
fig.show() # outputting graph
Analysing geographical data using plotly

Conclusion

In this tutorial we have cleaned data and analyzed categorical, numerical and geographical data using python.

You learned how to:

  • Explore and clean data using Pandas Dataframe
  • Graph data using Pandas Dataframe, Matplotlib Pyplot, Squarify and Pyplot
  • Manipulate data using Pandas Dataframe

With these powerful tools, you can now go out and investigate data!

If you want to learn more about analysing data and becoming a data scientist using python, we recommend trying our free course by clicking the link below:

Frequently Asked Questions

Is Python necessary for data analysis?

Although there are many tools such as excel, google sheets and tableau to conduct data analysis. Using python to conduct data analysis is by far one of the most used programming languages to analyze data. This is because it is able to analyze and manipulate data very quickly, and almost everything is customisable. It also comes with neatly built visualization libraries to make presenting data visualization more unique.

How Python is used in data analysis?

Python can be used for data analysis through its ability to manipulate and clean data such as removing empty spaces or changing data to a specific format. It also has a plethora of freely available visualization libraries and packages to analysz data.