Data Science

Lesson 7

Gathering Data

***Original Tutorial:***
https://www.kaggle.com/ldfreeman3/a-data-science-framework-to-achieve-99-accuracy

OVERVIEW

Your analysis and summary are only as good as the data you've collected. In order to tackle any problem you have at hand, you'll be required to gather data from whatever sources it may come from.

In many situations, if you're working for a company, they will provide you the data. In a lot of these cases, the data is raw and dirty - as a data scientist, you'll learn how to filter this data, clean it up, and make sure it's usable for your models.

Think of it like this, if you were a competitive athlete, you wouldn't feed your body fast food - rather you'd ensure you were eating clean, nutritious meals. The same goes for our models in the future, we want to make sure we give it only the best data. Feeding it dirty data can be detrimental to our desired result.

NOTE

>- For this exercise, you will be provided with the dataset, which you can find in the same directory as the notebook you are currently in. >
>- There are 2 files. The 'titanic_train.csv' file is the one we'll use to train our model (feeding our athlete good food) and the 'titanic_test.csv' is the file we'll use to test our model (observing how our athlete performs).

In [ ]:

import pandas as pd

data = pd.read_csv("./titanic_train.csv")
data.head()

*Observe the output of the training data above.

This is a sample of what the data looks like. We will be using some of these attributes to predict if a novel passenger would have survived given the circumstances of the titanic ever being replicated.*

In [ ]: