Indian Cars Data Analysis and Visualization

Data Science Project for Beginners

Published in

Python in Plain English

6 min readOct 1, 2021

Indian Cars: Data Analysis and Visualization

Everyone has a different perspective about Cars. Some go for luxury. Some go for vintage cars. Some prefer cars based on their economic condition. Based on all these factors, how many car models of different companies are available for Indians?

Let’s perform an analysis over that. In this article, I am going to do some data analysis of different car companies, which sell cars in India. Let’s start with a cup of coffee.

Attention all developers seeking to make social connections and establish themselves while earning passive income — look no further! I highly recommend ‘From Code to Connections’, a book that will guide you through the process. Don’t miss out, grab your copy now on Amazon worldwide or Amazon India! You can also go for Gumroad

Code With Analysis

You can perform the task in either Google Colab or Jupyter Notebook. The link to the dataset used in this project is given at the end of this article.

Import the following libraries

#for mathematical computationimport numpy as np
import pandas as pd
import scipy.stats as stats#for data visualizationimport seaborn as sns
import matplotlib.pyplot as plt
import plotly
import plotly.express as px
from matplotlib.pyplot import figure% matplotlib inline

Let’s load and take a sneak peek at the data. Download the dataset and add it to the path. After that render the first 5 data of the dataset.

df = pd.read_csv("/content/cars_ds_final.csv", encoding='latin-1')
df.head()

Run the cell, you will see something like the above image on your work screen.

Get some more information about the data

df.describe()
df.info()
df[df.isnull()].count()
df[df.duplicated()].count()

Run the above lines of codes in a separate cell and You will get some information on the type of data.

df.describe() will tell you about the mean, median, mode, and much more about the numeric data value of the dataset

df.info() will tell you about the attribute of the data

df.isnull() count will count all the null values present in the dataset.

df.duplicated() count will display all the duplicate data present in the dataset.

Cleaning of data

df=df.fillna('')
df=df.replace(' ', '')
df['Ex-Showroom_Price']=df['Ex-Showroom_Price'].str.replace(',','')
df['Ex-Showroom_Price']=df['Ex-Showroom_Price'].str.replace('Rs.','')
df['Displacement']=df['Displacement'].str.replace('cc','')

We only want to display the numeric value of Ex-Showroom Price and Displacement of the car. I also tried to clean the Power Column of the dataset. But it is a bit hectic because the data is quite mixed with value. You will notice it when you go through the Power Column.

Now, the Displacement column and Ex-Showroom price Column are fully numeric.

Insert the following code in the next cell and run the cell.

df.describe()

Two separate columns of Ex-Showroom Price and Displacement will appear. Each column will show you the mean, median, and mode of the relevant data.

Correlation

df[['Cylinders', 'Valves_Per_Cylinder', 'Doors', 'Seating_Capacity', 'Number_of_Airbags', 'Ex-Showroom_Price', 'Displacement']] = df[['Cylinders', 'Valves_Per_Cylinder', 'Doors', 'Seating_Capacity', 'Number_of_Airbags', 'Ex-Showroom_Price', 'Displacement']].apply(pd.to_numeric)f,ax = plt.subplots(figsize=(14,10))
sns.heatmap(df.corr(), annot=True, fmt=".2f", ax=ax)
plt.show()

Ex-Showroom price is positively correlated to Displacement.
Ex-Showroom Price is Positively Correlated to the number of Cylinders. This means, more the number of cylinders, more the ex-showroom price.
The more the number of cylinders in a car, the more will be its displacement. Generally speaking, the higher an engine’s displacement the more power it can create.
The number of doors is highly negatively correlated with Displacement. That makes sense, right?
You can analyze more like this.

Brand-wise data

If you want to take a look at company-wise data. You can do it with just 1 line of code. Let’s suppose, I choose all the cars of TATA in the Indian Market.

df[df.Make =='Tata'].tail()

I used the tail() function to look at the last 5 data.

Brands with the most number of cars in the Indian market

fig = plt.figure(figsize = (10,10))
ax = fig.subplots()
df.Make.value_counts().plot(ax=ax, kind='pie')
ax.set_ylabel("")
ax.set_title("Top Car Making Companies in India")
plt.show()

Maruti Suzuki has more car variants than any other company in India.
The Top 5 companies with more than car variants in India are Maruti Suzuki, Hyundai, Mahindra, Tata, and Toyota.
Sports car variants are very low

Cars by car body Type

plt.figure(figsize=(16,7))
sns.countplot(data=df, y='Body_Type',alpha=.6,color='blue')
plt.title('Cars by car body type',fontsize=20)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.xlabel('')
plt.ylabel('');

Indian market is favourable for SUVs, sedans, and Hatchback.

Graph of Body Type and Ex-Showroom Price

PriceByType = df.groupby('Body_Type').sum().sort_values('Ex-Showroom_Price', ascending=False)
PriceByType = PriceByType.reset_index()
px.bar(x='Body_Type', y ="Ex-Showroom_Price", data_frame=PriceByType)

If we some up all the SUVs Ex-Showroom price present in the Dataset then it will be nearly 2B INR
Sports cars a minimal spike in the graph

Cars count by Engine Fuel Type

fig = plt.figure(figsize = (10,10))
ax = fig.subplots()
df.Fuel_Type.value_counts().plot(ax=ax, kind='pie')
ax.set_ylabel("")
ax.set_title("Cars Count by Engine Fuel Type")
plt.show()

Almost 90% of Indian cars run on Petrol or Diesel. This is Scary if we see it from an Environment point of view.
This data is going to change because electric vehicles have arrived in India.

Relation Between Price and Displacement

plt.figure(figsize=(10,8))
sns.scatterplot(data=df, x='Displacement', y='Ex-Showroom_Price',hue='Body_Type',palette='viridis',alpha=.89, s=120 );
plt.xticks(fontsize=13);
plt.yticks(fontsize=13)
plt.xlabel('power',fontsize=14)
plt.ylabel('price',fontsize=14)
plt.title('Relation between Displacement and price',fontsize=20);

This data is self-explanatory. The price and power of the sports car are the highest.

Plot pairwise relationships

sns.pairplot(df,vars=[ 'Displacement', 'Ex-Showroom_Price'], hue= 'Fuel_Type', palette=sns.color_palette('magma'),diag_kind='kde',height=2, aspect=1.8);

Pair plot visualization comes in handy when you want to go for Exploratory data analysis (“EDA”). Pair plot visualizes given data to find the relationship between them where the variables can be continuous or categorical.

3D graph of Displacement, Price, and Fuel Tank

fig = px.scatter_3d(df, x='Displacement', z='Ex-Showroom_Price', y='Fuel_Type',color='Make',width=800,height=750)
fig.update_layout(showlegend=True)
fig.show();

This looks quite cool because the data is aligned by their fuel type.

Well, That’s it.

Congrats, you analyzed the Co2 emission dataset. You can dig more on your own. Because you can do a lot with data. And the information you get is valuable.

Database and Full Github Source Code are here.

More Data Science Projects

Top Cyber Data Breaches (2004–2021): Data Analysis and Visualization

Medium Articles Data Visualization and Analysis using Python

Spotify Data Visualization and Analysis using Python

IPL Data Analysis (2008–2020) using Python

Zomato Data Analysis with Jupyter Notebook

Data Analysis and Visualization of Co2 Emission by Different Countries

Hello, My Name is Rohit Kumar Thakur. I am open to freelancing. I build react native projects and currently working on Python Django. Feel free to contact me at (freelance.rohit7@gmail.com)

More content at plainenglish.io

Indian Cars Data Analysis and Visualization

Data Science Project for Beginners

Code With Analysis

More Data Science Projects

Written by Rohit Kumar Thakur