Challenge 1: Come up with 3 Questions

A big part of data science is coming up with questions that you'd like to explore. This is the most difficult aspect to teach in a tutorial because it's completely open-ended and requires some creativity. Often times you will be asking questions of the data, that it actually cannot answer - and that's ok. That's all part of the process of discovery.

Pause here for a moment and think about the kind of data you saw in the columns. Write down at least 3 questions that you'd like to explore as part of this analysis. For example, your question might go like: "What percentage of the Nobel laureates were women?" or "How many prizes were given out in each category". Practice coming up with a few of your own questions.

In the upcoming lessons, you might find that we will write the code to answer some of your questions. And if not, your questions make for a great exercise to take this analysis even further.


The challenges below are all based on questions we're going to ask the data:


Challenge 2

Create a donut chart using plotly which shows how many prizes went to men compared to how many prizes went to women. What percentage of all the prizes went to women?


Challenge 3


Challenge 4

Did some people get a Nobel Prize more than once? If so, who were they?


Challenge 5


Challenge 6


Challenge 7

Create a plotly bar chart that shows the split between men and women by category.



.

.

..

...

..

.

.



Solution 2: Creating a Donut Chart with Plotly

To create the chart we use the our .value_counts() method together with plotly's .pie() function. We see that out of all the Nobel laureates since 1901, only about 6.2% were women.

biology = df_data.sex.value_counts()
fig = px.pie(labels=biology.index, 
             values=biology.values,
             title="Percentage of Male vs. Female Winners",
             names=biology.index,
             hole=0.4,)

fig.update_traces(textposition='inside', textfont_size=15, textinfo='percent')

fig.show()

Solution 3: The first 3 women to win

Even without looking at the data, you might have already guessed one of the famous names: Marie Curie.

df_data[df_data.sex == 'Female'].sort_values('year', ascending=True)[:3]


Solution 4: The Repeat Winners

Winning a Nobel prize is quite an achievement. However, some folks have actually won the prize multiple times. To find them, we can use many different approaches. One approach is to look for duplicates in the full_name column:

is_winner = df_data.duplicated(subset=['full_name'], keep=False)
multiple_winners = df_data[is_winner]
print(f'There are {multiple_winners.full_name.nunique()}' \
      ' winners who were awarded the prize more than once.')

There are 6 winners who were awarded the prize more than once.

col_subset = ['year', 'category', 'laureate_type', 'full_name']
multiple_winners[col_subset]

Only 4 of the repeat laureates were individuals.

We see that Marie Curie actually got the Nobel prize twice - once in physics and once in chemistry. Linus Carl Pauling got it first in chemistry and later for peace given his work in promoting nuclear disarmament. Also, the International Red Cross was awarded the Peace prize a total of 3 times. The first two times were both during the devastating World Wars.


Solution 5: Number of Prizes per Category

To find the number of unique categories in a column we can use:

df_data.category.nunique()

To generate the vertical plotly bar chart, we again use .value_counts():

prizes_per_category = df_data.category.value_counts()
v_bar = px.bar(
        x = prizes_per_category.index,
        y = prizes_per_category.values,
        color = prizes_per_category.values,
        color_continuous_scale='Aggrnyl',
        title='Number of Prizes Awarded per Category')

v_bar.update_layout(xaxis_title='Nobel Prize Category', 
                    coloraxis_showscale=False,
                    yaxis_title='Number of Prizes')
v_bar.show()


Solution 6: The Economics Prize

The chart above begs the question: "Why are there so few prizes in the field of economics?". Looking at the first couple of winners in the economics category, we have our answer:

df_data[df_data.category == 'Economics'].sort_values('year')[:3]

The economics prize is much newer. It was first awarded in 1969, compared to 1901 for physics.


Solution 7: Male and Female Winners by Category

We already saw that overall, only 6.2% of Nobel prize winners were female. Does this vary by category?

cat_men_women = df_data.groupby(['category', 'sex'], 
                               as_index=False).agg({'prize': pd.Series.count})
cat_men_women.sort_values('prize', ascending=False, inplace=True)

We can combine .groupby() and .agg() with the .count() function. This way we can count the number of men and women by prize category.

We can then use .color the parameter in the .bar() function to mark the number of men and women on the chart:

v_bar_split = px.bar(x = cat_men_women.category,
                     y = cat_men_women.prize,
                     color = cat_men_women.sex,
                     title='Number of Prizes Awarded per Category split by Men and Women')

v_bar_split.update_layout(xaxis_title='Nobel Prize Category', 
                          yaxis_title='Number of Prizes')
v_bar_split.show()

We see that overall the imbalance is pretty large with physics, economics, and chemistry. Women are somewhat more represented in categories of Medicine, Literature and Peace. Splitting bar charts like this is an incredibly powerful way to show a more granular picture.