The first step as always is getting a better idea about what we're dealing with.


Preliminary Data Exploration


Challenge: How many rows and columns does df_apps have? What are the column names? What does the data look like? Look at a random sample of 5 different rows with .sample()


.

.

..

...

..

.

.

Solution


Compared to the previous projects we are working with a fairly large DataFrame this time.

df_apps.shape

tells us we have 10841 rows and 12 columns.

We can already see that there are some data issues that we need to fix. In the Ratings and Type columns there are NaN (Not a number values) and in the Price column we have dollar signs that will cause problems.


The .sample(n) method will give us n random rows. This is another handy way to inspect our DataFrame.

Challenge: Remove the columns called Last_Updated and Android_Version from the DataFrame. We will not use these columns.


Challenge: How many rows have a NaN value (not-a-number) in the Rating column? Create DataFrame called df_apps_clean that does not include these rows.




.

.

..

...

..

.

.


Solution: Dropping Unused Columns and Removing NaN Values


To remove the unwanted columns, we simply provide a list of the column names ['Last_Updated', Android_Ver'] to the .drop() method. By setting axis=1 we are specifying that we want to drop certain columns.

To find and remove the rows with the NaN values we can create a subset of the DataFrame based on where .isna() evaluates to True. We see that NaN values in ratings are associated with no reviews (and no installs). That makes sense.

We can drop the NaN values with .dropna():


df_apps_clean = df_apps.dropna()
df_apps_clean.shape

This leaves us with 9,367 entries in our DataFrame. But there may be other problems with the data too:


Challenge: Are there any duplicates in data? Check for duplicates using the .duplicated() function. How many entries can you find for the "Instagram" app? Use .drop_duplicates() to remove any duplicates from df_apps_clean.



.

.

..

..

.

.


Solution: Finding and Removing Duplicates


There are indeed duplicates in the data. We can show them using the .duplicated() method, which brings up 476 rows:


duplicated_rows = df_apps_clean[df_apps_clean.duplicated()]
print(duplicated_rows.shape)
duplicated_rows.head()


We can actually check for an individual app like Instagram by looking up all the entries with that name in the App column.

So how do we get rid of duplicates? Can we simply call .drop_duplicates()?


df_apps_clean = df_apps_clean.drop_duplicates()

Not really. If we do this without specifying how to identify duplicates, we see that 3 copies of Instagram are retained because they have a different number of reviews. We need to provide the column names that should be used in the comparison to identify duplicates. For example:

This leaves us with 8,199 entries after removing duplicates. Huzzah!



What else should I know about the data?


So we can see that 13 different features were originally scraped from the Google Play Store.



Heres what you would see under an Android app listing if you go to a listing on the Google Play Store: