# The Happiness Vaccines- Exploratory Data Analysis

## 1. Introduction

Following the outbreak of the Covid-19 and the following development of the vaccine, the world now faces the challenge of distributing the vaccine all around the world with the hope of returning back to the now acclaimed 'normality'. The progress seems positive at a first glance. But, with the limited amount of vaccines, the problem now seems to distribute the vaccine in a fair manner, allowing the less wealthy countries to have access to the vaccine.

This notebook analyses the data of latest Covid-19 Vaccine Status of all the Countries in the World as on 30 June, 2021 (https://www.kaggle.com/anandhuh/latest-worldwide-vaccine-data) and compares it with the data of the world happiness report (https://www.kaggle.com/ajaypalsinghlo/world-happiness-report-2021).

What story is behind the data? Is the world succeding in distributing the vaccine in a fair manner? Is a happy country also a vaccinated one? We will find out.

### 1.1 Objectives

The notebook will focus mainly on what factors make a country more vaccinated than other.

1) Find key aspects of a leading vaccinated country.

2) Develop strategies and reccomendations based on findings.

3) Build a model that predicts how vaccinated a country is as on 30 June, 2021

Download the dataset using the opendatasets Python library

# Kaggle URL of datasets
vaccine_url = 'https://www.kaggle.com/anandhuh/latest-worldwide-vaccine-data'

happiness_url = 'https://www.kaggle.com/ajaypalsinghlo/world-happiness-report-2021'

import opendatasets as od

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds

100%|██████████| 3.09k/3.09k [00:00<00:00, 2.43MB/s]
Downloading latest-worldwide-vaccine-data.zip to ./latest-worldwide-vaccine-data



od.download(happiness_url)

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds

100%|██████████| 55.2k/55.2k [00:00<00:00, 26.4MB/s]
Downloading world-happiness-report-2021.zip to ./world-happiness-report-2021



# Changing the directory
data_dirone = './latest-worldwide-vaccine-data'

#Changing the directory
data_dirtwo = './world-happiness-report-2021'


# Installing all the libraries
!pip install numpy pandas matplotlib seaborn --upgrade --quiet

# Importing all the libraries
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

# Reading the data using pandas dataframe
vaccine_df = pd.read_csv(data_dirone + "/Worldwide Vaccine Data.csv")

vaccine_df

Country Doses administered per 100 people Total doses administered % of population vaccinated % of population fully vaccinated
0 U.A.E. 156.0 15198661.0 NaN NaN
1 Malta 131.0 659488.0 71.0 63.0
2 Bahrain 129.0 2116497.0 64.0 60.0
3 Aruba 119.0 126387.0 64.0 55.0
4 Israel 119.0 10749083.0 62.0 57.0
... ... ... ... ... ...
177 South Sudan 0.4 44920.0 0.4 0.0
178 Benin 0.4 46108.0 0.3 0.1
179 Burkina Faso 0.1 25833.0 0.1 NaN
180 Congo 0.1 59443.0 0.1 0.0
181 Chad 0.1 8981.0 0.1 NaN

182 rows × 5 columns

# Finding the number of rowns and columns in dataframe
vaccine_df.shape

(182, 5)
# Some basic information of differnt columns of dataframe
vaccine_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 182 entries, 0 to 181
Data columns (total 5 columns):
#   Column                             Non-Null Count  Dtype
---  ------                             --------------  -----
0   Country                            182 non-null    object
1   Doses administered per 100 people  177 non-null    float64
2   Total doses administered           177 non-null    float64
3   % of population vaccinated         175 non-null    float64
4   % of population fully vaccinated   160 non-null    float64
dtypes: float64(4), object(1)
memory usage: 7.2+ KB

#Reading the happines dataframe using pandas library

happiness_df

Country name Regional indicator Ladder score Standard error of ladder score upperwhisker lowerwhisker Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Ladder score in Dystopia Explained by: Log GDP per capita Explained by: Social support Explained by: Healthy life expectancy Explained by: Freedom to make life choices Explained by: Generosity Explained by: Perceptions of corruption Dystopia + residual
0 Finland Western Europe 7.842 0.032 7.904 7.780 10.775 0.954 72.000 0.949 -0.098 0.186 2.43 1.446 1.106 0.741 0.691 0.124 0.481 3.253
1 Denmark Western Europe 7.620 0.035 7.687 7.552 10.933 0.954 72.700 0.946 0.030 0.179 2.43 1.502 1.108 0.763 0.686 0.208 0.485 2.868
2 Switzerland Western Europe 7.571 0.036 7.643 7.500 11.117 0.942 74.400 0.919 0.025 0.292 2.43 1.566 1.079 0.816 0.653 0.204 0.413 2.839
3 Iceland Western Europe 7.554 0.059 7.670 7.438 10.878 0.983 73.000 0.955 0.160 0.673 2.43 1.482 1.172 0.772 0.698 0.293 0.170 2.967
4 Netherlands Western Europe 7.464 0.027 7.518 7.410 10.932 0.942 72.400 0.913 0.175 0.338 2.43 1.501 1.079 0.753 0.647 0.302 0.384 2.798
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
144 Lesotho Sub-Saharan Africa 3.512 0.120 3.748 3.276 7.926 0.787 48.700 0.715 -0.131 0.915 2.43 0.451 0.731 0.007 0.405 0.103 0.015 1.800
145 Botswana Sub-Saharan Africa 3.467 0.074 3.611 3.322 9.782 0.784 59.269 0.824 -0.246 0.801 2.43 1.099 0.724 0.340 0.539 0.027 0.088 0.648
146 Rwanda Sub-Saharan Africa 3.415 0.068 3.548 3.282 7.676 0.552 61.400 0.897 0.061 0.167 2.43 0.364 0.202 0.407 0.627 0.227 0.493 1.095
147 Zimbabwe Sub-Saharan Africa 3.145 0.058 3.259 3.030 7.943 0.750 56.201 0.677 -0.047 0.821 2.43 0.457 0.649 0.243 0.359 0.157 0.075 1.205
148 Afghanistan South Asia 2.523 0.038 2.596 2.449 7.695 0.463 52.493 0.382 -0.102 0.924 2.43 0.370 0.000 0.126 0.000 0.122 0.010 1.895

149 rows × 20 columns

# No. of rows and columns in datset
happiness_df.shape

(149, 20)
# Some basic information regarding datset
happiness_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149 entries, 0 to 148
Data columns (total 20 columns):
#   Column                                      Non-Null Count  Dtype
---  ------                                      --------------  -----
0   Country name                                149 non-null    object
1   Regional indicator                          149 non-null    object
2   Ladder score                                149 non-null    float64
3   Standard error of ladder score              149 non-null    float64
4   upperwhisker                                149 non-null    float64
5   lowerwhisker                                149 non-null    float64
6   Logged GDP per capita                       149 non-null    float64
7   Social support                              149 non-null    float64
8   Healthy life expectancy                     149 non-null    float64
9   Freedom to make life choices                149 non-null    float64
10  Generosity                                  149 non-null    float64
11  Perceptions of corruption                   149 non-null    float64
12  Ladder score in Dystopia                    149 non-null    float64
13  Explained by: Log GDP per capita            149 non-null    float64
14  Explained by: Social support                149 non-null    float64
15  Explained by: Healthy life expectancy       149 non-null    float64
16  Explained by: Freedom to make life choices  149 non-null    float64
17  Explained by: Generosity                    149 non-null    float64
18  Explained by: Perceptions of corruption     149 non-null    float64
19  Dystopia + residual                         149 non-null    float64
dtypes: float64(18), object(2)
memory usage: 23.4+ KB


### 2.2 Data Cleaning

Let's start by cleaning the data of both datasets. We will see if they have missing values, duplicates and see if eliminate them if thats the case.

Very important to take into account that both datasets are going to merge. Therefore, they must have one key column that has the same values. Hence, We will also see if the values are consistent in both datasets.

# Duplicate check in vaccines_df

if vaccine_df.duplicated().any() == False:
print("There are not any duplicates")
else:
print("There are duplicates")

There are not any duplicates

# Check if there is any null value in vaccine_df
# Using the Heat Map to check the existence of null values
plt.figure(figsize=(12,8))
sns.heatmap(vaccine_df.isnull());


Yes, there are some null values.

We are going to replace those Null's with 0's

#Replace all the null values with zero
vaccine_df.fillna(0, inplace=True)

# Again check if there is any null value

sns.heatmap(vaccine_df.isnull());


Now we can clearly see that there is no any null value in vaccine_df dataset

# Now check duplicate values in happiness_df dataset

if happiness_df.duplicated().any() == False:
print("There are not any duplicates")
else:
print("There are duplicates")

There are not any duplicates

# Check if there is any null value in dataset
sns.heatmap(happiness_df.isnull())

<AxesSubplot:>

Great.... we can clearly see that there are no any null value in happiness_df dataset.

All good with happiness data set.

### 2.3 Merging the Dataset

It's time to merge both dataframe, the countries will be our key column.

The following point should be mention-

The happiness dataset has less countries than the vaccine dataset. An inner join will be used but information will be lost. The information lost will be around 30 rows, but it shouldn't be critical for the analysis.

Let's proceed with the merge

# Running the merged function
merged_df = vaccine_df.merge(happiness_df, left_on='Country', right_on='Country name', how="inner", sort=True)

# Deleting one column named -'Country name'
del merged_df['Country name']

# Our newly merged dataframe
merged_df

Country Doses administered per 100 people Total doses administered % of population vaccinated % of population fully vaccinated Regional indicator Ladder score Standard error of ladder score upperwhisker lowerwhisker ... Generosity Perceptions of corruption Ladder score in Dystopia Explained by: Log GDP per capita Explained by: Social support Explained by: Healthy life expectancy Explained by: Freedom to make life choices Explained by: Generosity Explained by: Perceptions of corruption Dystopia + residual
0 Afghanistan 2.2 835694.0 1.7 0.5 South Asia 2.523 0.038 2.596 2.449 ... -0.102 0.924 2.43 0.370 0.000 0.126 0.000 0.122 0.010 1.895
1 Albania 33.0 943439.0 19.0 14.0 Central and Eastern Europe 5.117 0.059 5.234 5.001 ... -0.030 0.901 2.43 1.008 0.529 0.646 0.491 0.168 0.024 2.250
2 Algeria 5.8 2500000.0 5.8 0.0 Middle East and North Africa 4.887 0.053 4.991 4.783 ... -0.067 0.752 2.43 0.946 0.765 0.552 0.119 0.144 0.120 2.242
3 Argentina 45.0 20221697.0 36.0 8.9 Latin America and Caribbean 5.929 0.056 6.040 5.819 ... -0.182 0.834 2.43 1.162 0.980 0.646 0.544 0.069 0.067 2.461
4 Armenia 2.2 64293.0 1.8 0.4 Commonwealth of Independent States 5.283 0.058 5.397 5.168 ... -0.168 0.629 2.43 0.996 0.758 0.585 0.540 0.079 0.198 2.127
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
130 Venezuela 5.1 1466988.0 4.3 0.8 Latin America and Caribbean 4.892 0.064 5.017 4.767 ... -0.169 0.827 2.43 0.852 0.897 0.574 0.284 0.078 0.072 2.135
131 Vietnam 3.7 3593970.0 3.5 0.2 Southeast Asia 5.411 0.039 5.488 5.334 ... -0.098 0.796 2.43 0.817 0.873 0.616 0.679 0.124 0.091 2.211
132 Yemen 0.9 268753.0 0.9 0.0 Middle East and North Africa 3.658 0.070 3.794 3.521 ... -0.147 0.800 2.43 0.329 0.831 0.272 0.268 0.092 0.089 1.776
133 Zambia 0.8 151205.0 0.8 0.1 Sub-Saharan Africa 4.073 0.069 4.209 3.938 ... 0.061 0.823 2.43 0.528 0.552 0.231 0.487 0.227 0.074 1.975
134 Zimbabwe 8.9 1299154.0 5.2 3.7 Sub-Saharan Africa 3.145 0.058 3.259 3.030 ... -0.047 0.821 2.43 0.457 0.649 0.243 0.359 0.157 0.075 1.205

135 rows × 24 columns

## 3. Exploratory Analysis & Visualization

We ended up with 135 countries and lot's of useful data for our analysis. Let's now proceed with the descriptive analysis.

Since we have 24 features, we will first start selecting the variables that we are going to use.

I will select those variables that I find interesting for the analysis. Important to remark that the chosen criteria of the variables will be the relationship they have with the feature '% of population vaccinated', since our analysis is based on that.

# Extracting the columns from merged_df and creating a new dataframe for our analysis
database_df = merged_df[['Country',
'Regional indicator',
'% of population vaccinated',
'% of population fully vaccinated',
'Logged GDP per capita',
'Social support',
'Healthy life expectancy']]

database_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 135 entries, 0 to 134
Data columns (total 8 columns):
#   Column                            Non-Null Count  Dtype
---  ------                            --------------  -----
0   Country                           135 non-null    object
1   Regional indicator                135 non-null    object
2   % of population vaccinated        135 non-null    float64
3   % of population fully vaccinated  135 non-null    float64
4   Ladder score                      135 non-null    float64
5   Logged GDP per capita             135 non-null    float64
6   Social support                    135 non-null    float64
7   Healthy life expectancy           135 non-null    float64
dtypes: float64(6), object(2)
memory usage: 9.5+ KB


With 2 categorical variables and 6 floats, we can start doing the descriptive analysis

database_df

Country Regional indicator % of population vaccinated % of population fully vaccinated Ladder score Logged GDP per capita Social support Healthy life expectancy
0 Afghanistan South Asia 1.7 0.5 2.523 7.695 0.463 52.493
1 Albania Central and Eastern Europe 19.0 14.0 5.117 9.520 0.697 68.999
2 Algeria Middle East and North Africa 5.8 0.0 4.887 9.342 0.802 66.005
3 Argentina Latin America and Caribbean 36.0 8.9 5.929 9.962 0.898 69.000
4 Armenia Commonwealth of Independent States 1.8 0.4 5.283 9.487 0.799 67.055
... ... ... ... ... ... ... ... ...
130 Venezuela Latin America and Caribbean 4.3 0.8 4.892 9.073 0.861 66.700
131 Vietnam Southeast Asia 3.5 0.2 5.411 8.973 0.850 68.034
132 Yemen Middle East and North Africa 0.9 0.0 3.658 7.578 0.832 57.122
133 Zambia Sub-Saharan Africa 0.8 0.1 4.073 8.145 0.708 55.809
134 Zimbabwe Sub-Saharan Africa 5.2 3.7 3.145 7.943 0.750 56.201

135 rows × 8 columns

# Finding the list of unique Regional indicator in the dataframe
database_df['Regional indicator'].unique()

array(['South Asia', 'Central and Eastern Europe',
'Middle East and North Africa', 'Latin America and Caribbean',
'Commonwealth of Independent States', 'North America and ANZ',
'Western Europe', 'Sub-Saharan Africa', 'Southeast Asia',
'East Asia'], dtype=object)
# Creaing a new dataframe by grouping the data on the basis of Regional Indicator

regional_df = database_df.groupby('Regional indicator').mean()
regional_df

% of population vaccinated % of population fully vaccinated Ladder score Logged GDP per capita Social support Healthy life expectancy
Regional indicator
Central and Eastern Europe 32.693750 25.981250 5.960562 10.158500 0.891563 68.621250
Commonwealth of Independent States 8.333333 4.808333 5.467000 9.401833 0.872500 65.009500
East Asia 38.000000 25.166667 5.820667 10.220667 0.872667 70.500000
Latin America and Caribbean 22.283333 12.516667 6.055611 9.451167 0.855389 67.762167
Middle East and North Africa 22.060000 12.366667 5.177200 9.650267 0.792667 65.718133
North America and ANZ 40.000000 22.700000 7.128500 10.809500 0.933500 72.325000
South Asia 15.914286 7.400000 4.441857 8.682571 0.703429 62.681000
Southeast Asia 15.922222 8.966667 5.407556 9.421444 0.820333 64.888444
Sub-Saharan Africa 3.290625 1.246875 4.523531 8.094188 0.702656 55.983125
Western Europe 53.947368 33.631579 6.979632 10.841789 0.918421 73.015632

This is the overview of the datset arranged according to Regional Indicators.

In the above dataset, we have find out the mean of all the variables by grouping on the 'Regional indicator' basis.

### Let's check which Regional Indicator has highest Ladder Score ?

# Draw a bar graph indication the ladder Score for every Region

plt.figure(figsize=(12,8))
barplot1 = sns.barplot(x='Regional indicator', y='Ladder score', data=database_df, palette="Blues_d");

barplot1.set_xticklabels(barplot1.get_xticklabels(), rotation=90);


From the above bar graph, we can see that 'North America and ANZ' has highest ladder score with least variance.

### Relation between Social support and Ladder Score

Let's try to explore if there is any relation between social support and Ladder Score!!!

# Draw a scatter plot to find the relation between Social support and ladder Score
plt.figure(figsize=(12,8))
sns.scatterplot(x= database_df['Social support'], y= database_df['Ladder score']);


Yes, there is a positive relation between ladder score and social support. As the value of social support is increasing, the ladder score is increasing positively.

### Let's try to explore the effect of Social support on healthy life expectancy

# Draw a plot to find the relation between Social support and healthy life Expectancy

plt.figure(figsize=(12,8))
sns.regplot(x= database_df['Social support'], y= database_df['Healthy life expectancy']);


The graph tells us that there is indeed a positive relationship between the healthy life expectancy and social support.

### 4.1 Which regions have most of their population vaccinated?

plt.figure(figsize=(12,8))
barplot2 = sns.barplot(x='Regional indicator', y='% of population vaccinated', data=database_df, palette="Blues_d");

barplot2.set_xticklabels(barplot2.get_xticklabels(), rotation=90);


Note -

The lines on top of the bars represent the variance of the mean, which is represented by the height of the bar.

We can start seeing that Western Europe, North America and ANZ, and East Asia lead the vaccuiation process.

South Asia and Sub-Saharan Africa sit at the bottom.

Important to remark the high variance in each region except for western Europe. This means that there are some countries in the sample that have way more vaccinated people than other. Could this be a sign of regional cooperation between the countries? My intuition tells me yes.

North America and ANZ on the other hand, doesn't seem to be cooperating.

# We can clearly see that Western Europe Region have highest % of population vaccinated among all the regions.
# Let's look at all the countries come under the Western Europe Region
western_europe_df = database_df[database_df['Regional indicator']=='Western Europe']
western_europe_df

Country Regional indicator % of population vaccinated % of population fully vaccinated Ladder score Logged GDP per capita Social support Healthy life expectancy
6 Austria Western Europe 53.0 34.0 7.268 10.906 0.934 73.300
11 Belgium Western Europe 61.0 34.0 6.834 10.823 0.906 72.199
28 Cyprus Western Europe 38.0 27.0 6.223 10.576 0.802 73.898
30 Denmark Western Europe 56.0 32.0 7.620 10.933 0.954 72.700
36 Finland Western Europe 58.0 18.0 7.842 10.775 0.954 72.000
37 France Western Europe 50.0 30.0 6.690 10.704 0.942 74.000
41 Germany Western Europe 55.0 37.0 7.155 10.873 0.903 72.500
43 Greece Western Europe 45.0 35.0 5.723 10.279 0.823 72.600
48 Iceland Western Europe 72.0 50.0 7.554 10.878 0.983 73.000
53 Ireland Western Europe 48.0 20.0 7.085 11.342 0.947 72.400
55 Italy Western Europe 56.0 31.0 6.483 10.623 0.880 73.800
71 Luxembourg Western Europe 54.0 32.0 7.324 11.647 0.908 72.600
77 Malta Western Europe 71.0 63.0 6.602 10.674 0.931 72.200
89 Netherlands Western Europe 58.0 35.0 7.464 10.932 0.942 72.400
95 Norway Western Europe 47.0 29.0 7.392 11.053 0.954 73.300
102 Portugal Western Europe 54.0 33.0 5.929 10.421 0.879 72.600
115 Spain Western Europe 53.0 36.0 6.491 10.571 0.932 74.700
117 Sweden Western Europe 46.0 29.0 7.363 10.867 0.934 72.700
118 Switzerland Western Europe 50.0 34.0 7.571 11.117 0.942 74.400
# Let's plot the % of population vaccinated in all the countries that comes in the Western Europe Region
plt.figure(figsize=(12,8))
barplot2 = sns.barplot(x='Country', y='% of population vaccinated', data=western_europe_df, palette="Blues_d");

barplot2.set_xticklabels(barplot2.get_xticklabels(), rotation=90);


We can clearly observe that all countries in Western Europe have similar % of vaccinated population with highest in the Iceland.

### 4.2 Does a higher GDP mean a higher vaccinated population?

# Here we are going to do a scatterplot of the % of people vaccinated vs the GDP per Capita.

plt.figure(figsize=(12,8))
sns.scatterplot(x='Logged GDP per capita', y='% of population vaccinated', data=database_df, s=100);


We can see from the graph that GDP does influence the percentage of population vaccinated: the higher the GDP, the higher the people vaccinated.

### 4.3 Does Social support and Healthy life expectany influence % of vaccinated people?

In this section we are going to use a scatter plot again. But, since we have three variables, we are not going to cluster them but use the parameter "hue" to graph the three of them.

plt.figure(figsize=(12,8))
sns.scatterplot(x='Healthy life expectancy', y='% of population vaccinated', hue='Social support', data=database_df, s=100);


The graph tells us that there is indeed a positive relationship between the % of vaccinated people, healthy life expectancy and social support.

### 4.4 Is a vaccinated country a happy one?

To answer this question we are going to use the ladder score. The higher the ladder score, the happier a country is.

plt.figure(figsize=(12,8))
sns.regplot(x='Ladder score', y='% of population vaccinated', data=database_df);


There is indeed a trend that shows that the higher a country is in the happines report ladder, the more people will be vaccinated. Important to remark that countries achieve this level of happiness due to a combination of economic factors, some analyze before and other deleted to focus the analysis.

## 5. Conclusions & Recommendations

### 5.1 Conclusions

1) Countries in their regions are in different vaccination stage, which means that there's no evidence of cooperation between countries in their regions, this phenomenon is explained with the variance of the mean in seciton 4.1. Thats not the case for Western Europe, where the countries have a similar % of vaccinated population.

2) The trend in section 4.2 shows that the % of vaccinated population is directly related to the GDP of a country: the higher the GDP the higher the % of vaccinated population. This is not the case of some countries,they are a medium level GDP but have a higher % of vaccinated population. This could be explained with the emphasis the country have on social support, factor that is higher than almost all of the high GDP countries

3) A happy country is a vaccinated one. But do not rush with a direct causation. A happy country is the result of a combination of economic and social factors that usually make a developed country. Then, i would rather say that a developed country is a vaccinated one.

### 5.2 Recommendations

1) What is seen as a competition of what country has the highest vaccinated population should be seen as the region who is distributing its resources the best and even helping poor countries. It is thus reccomended to better the relationships in each region for a better distribution and further cooperation in other areas.

2) The role of social support is also evident and the data back up its results with the level of vaccinated population. A country that cares about its people will also provide them. The reccomendation is to invest in entities and programs than promotes social support

1) Python documentation: http://www.python.org/doc/

2) Kaggle Opendatasets Source: https://www.kaggle.com/datasets

3) Pandas User Guide: https://pandas.pydata.org/docs/user_guide/index.html#user-guide

4) Matplotlib Tutorial: https://matplotlib.org/stable/tutorials/introductory/pyplot.html

5) Seaborn UserGuide: https://seaborn.pydata.org/tutorial.html

## Future Work

As a next step, you can try the Exploratory Data analysis of -