Indian Cars Data Analysis and Visualization
Data Science Project for Beginners
Everyone has a different perspective about Cars. Some go for luxury. Some go for vintage cars. Some prefer cars based on their economic condition. Based on all these factors, how many car models of different companies are available for Indians?
Let’s perform an analysis over that. In this article, I am going to do some data analysis of different car companies, which sell cars in India. Let’s start with a cup of coffee.
Attention all developers seeking to make social connections and establish themselves while earning passive income — look no further! I highly recommend ‘From Code to Connections’, a book that will guide you through the process. Don’t miss out, grab your copy now on Amazon worldwide or Amazon India! You can also go for Gumroad
Code With Analysis
You can perform the task in either Google Colab or Jupyter Notebook. The link to the dataset used in this project is given at the end of this article.
- Import the following libraries
#for mathematical computationimport numpy as np
import pandas as pd
import scipy.stats as stats#for data visualizationimport seaborn as sns
import matplotlib.pyplot as plt
import plotly
import plotly.express as px
from matplotlib.pyplot import figure% matplotlib inline
- Let’s load and take a sneak peek at the data. Download the dataset and add it to the path. After that render the first 5 data of the dataset.
df = pd.read_csv("/content/cars_ds_final.csv", encoding='latin-1')
df.head()
Run the cell, you will see something like the above image on your work screen.
- Get some more information about the data
df.describe()
df.info()
df[df.isnull()].count()
df[df.duplicated()].count()
Run the above lines of codes in a separate cell and You will get some information on the type of data.
df.describe() will tell you about the mean, median, mode, and much more about the numeric data value of the dataset
df.info() will tell you about the attribute of the data
df.isnull() count will count all the null values present in the dataset.
df.duplicated() count will display all the duplicate data present in the dataset.
- Cleaning of data
df=df.fillna('')
df=df.replace(' ', '')
df['Ex-Showroom_Price']=df['Ex-Showroom_Price'].str.replace(',','')
df['Ex-Showroom_Price']=df['Ex-Showroom_Price'].str.replace('Rs.','')
df['Displacement']=df['Displacement'].str.replace('cc','')
We only want to display the numeric value of Ex-Showroom Price and Displacement of the car. I also tried to clean the Power Column of the dataset. But it is a bit hectic because the data is quite mixed with value. You will notice it when you go through the Power Column.
Now, the Displacement column and Ex-Showroom price Column are fully numeric.
Insert the following code in the next cell and run the cell.
df.describe()
Two separate columns of Ex-Showroom Price and Displacement will appear. Each column will show you the mean, median, and mode of the relevant data.
- Correlation
df[['Cylinders', 'Valves_Per_Cylinder', 'Doors', 'Seating_Capacity', 'Number_of_Airbags', 'Ex-Showroom_Price', 'Displacement']] = df[['Cylinders', 'Valves_Per_Cylinder', 'Doors', 'Seating_Capacity', 'Number_of_Airbags', 'Ex-Showroom_Price', 'Displacement']].apply(pd.to_numeric)f,ax = plt.subplots(figsize=(14,10))
sns.heatmap(df.corr(), annot=True, fmt=".2f", ax=ax)
plt.show()
Ex-Showroom price is positively correlated to Displacement.
Ex-Showroom Price is Positively Correlated to the number of Cylinders. This means, more the number of cylinders, more the ex-showroom price.
The more the number of cylinders in a car, the more will be its displacement. Generally speaking, the higher an engine’s displacement the more power it can create.
The number of doors is highly negatively correlated with Displacement. That makes sense, right?
You can analyze more like this.
- Brand-wise data
If you want to take a look at company-wise data. You can do it with just 1 line of code. Let’s suppose, I choose all the cars of TATA in the Indian Market.
df[df.Make =='Tata'].tail()
I used the tail() function to look at the last 5 data.
- Brands with the most number of cars in the Indian market
fig = plt.figure(figsize = (10,10))
ax = fig.subplots()
df.Make.value_counts().plot(ax=ax, kind='pie')
ax.set_ylabel("")
ax.set_title("Top Car Making Companies in India")
plt.show()
Maruti Suzuki has more car variants than any other company in India.
The Top 5 companies with more than car variants in India are Maruti Suzuki, Hyundai, Mahindra, Tata, and Toyota.
Sports car variants are very low
- Cars by car body Type
plt.figure(figsize=(16,7))
sns.countplot(data=df, y='Body_Type',alpha=.6,color='blue')
plt.title('Cars by car body type',fontsize=20)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.xlabel('')
plt.ylabel('');
Indian market is favourable for SUVs, sedans, and Hatchback.
- Graph of Body Type and Ex-Showroom Price
PriceByType = df.groupby('Body_Type').sum().sort_values('Ex-Showroom_Price', ascending=False)
PriceByType = PriceByType.reset_index()
px.bar(x='Body_Type', y ="Ex-Showroom_Price", data_frame=PriceByType)
If we some up all the SUVs Ex-Showroom price present in the Dataset then it will be nearly 2B INR
Sports cars a minimal spike in the graph
- Cars count by Engine Fuel Type
fig = plt.figure(figsize = (10,10))
ax = fig.subplots()
df.Fuel_Type.value_counts().plot(ax=ax, kind='pie')
ax.set_ylabel("")
ax.set_title("Cars Count by Engine Fuel Type")
plt.show()
Almost 90% of Indian cars run on Petrol or Diesel. This is Scary if we see it from an Environment point of view.
This data is going to change because electric vehicles have arrived in India.
- Relation Between Price and Displacement
plt.figure(figsize=(10,8))
sns.scatterplot(data=df, x='Displacement', y='Ex-Showroom_Price',hue='Body_Type',palette='viridis',alpha=.89, s=120 );
plt.xticks(fontsize=13);
plt.yticks(fontsize=13)
plt.xlabel('power',fontsize=14)
plt.ylabel('price',fontsize=14)
plt.title('Relation between Displacement and price',fontsize=20);
This data is self-explanatory. The price and power of the sports car are the highest.
- Plot pairwise relationships
sns.pairplot(df,vars=[ 'Displacement', 'Ex-Showroom_Price'], hue= 'Fuel_Type', palette=sns.color_palette('magma'),diag_kind='kde',height=2, aspect=1.8);
Pair plot visualization comes in handy when you want to go for Exploratory data analysis (“EDA”). Pair plot visualizes given data to find the relationship between them where the variables can be continuous or categorical.
- 3D graph of Displacement, Price, and Fuel Tank
fig = px.scatter_3d(df, x='Displacement', z='Ex-Showroom_Price', y='Fuel_Type',color='Make',width=800,height=750)
fig.update_layout(showlegend=True)
fig.show();
This looks quite cool because the data is aligned by their fuel type.
Well, That’s it.
Congrats, you analyzed the Co2 emission dataset. You can dig more on your own. Because you can do a lot with data. And the information you get is valuable.
Database and Full Github Source Code are here.
More Data Science Projects
Top Cyber Data Breaches (2004–2021): Data Analysis and Visualization
Medium Articles Data Visualization and Analysis using Python
Spotify Data Visualization and Analysis using Python
IPL Data Analysis (2008–2020) using Python
Zomato Data Analysis with Jupyter Notebook
Data Analysis and Visualization of Co2 Emission by Different Countries
Hello, My Name is Rohit Kumar Thakur. I am open to freelancing. I build react native projects and currently working on Python Django. Feel free to contact me at (freelance.rohit7@gmail.com)
More content at plainenglish.io