Student UpSourcing 2015
Data Mining Session #2

Deolu Adeleye
17th April, 2015

studentupsourcing.com
@supsourcing

To Be Covered

These are the topics we'll be covering today (hopefully):

  • Observation
  • Confounding Variables
  • Brief Intro to R and RStudio

Observation

“There is nothing more elusive than an obvious fact.”

– Sherlock Holmes

It is my believe that there are signals everywhere, but we just have bad receptors. As such, a lot escapes our attention, especially what we term 'the obvious'.

You can NOT afford to be unobservant in this field!

So why do we need to be observant? Majorly, it's:

  • to detect hidden patterns in data
  • to avoid wrong inferences

Assignment!

Look at the assignment given, and compare your groupings with others

(if you didn't get the assignment, inform me: deoluadeleye@gmail.com)

Confounding Variables

However, as much as we stress observation, there is an extreme we have to watch out for…

Humans generally have an innate ability to figure out, or, in some cases, seek out patterns in things, even where there are none. A lot of people will look at a shapeless cloud in the sky and swear it resembles an animal or someone they know…many people look at the front of a car and remark that the headlamps and grill form a 'smile' or a 'frown'…

How does this apply to data science? I want you to take a look at the following plot:

plot of chunk unnamed-chunk-1

Say this was a result of observing a group of people, and the plot you see was the researcher's observed pattern. What would you deduce from this? Easy: that a person's shoe size is directly related to their IQ! Right?

Well…

If you said 'yes', you'd have fallen victim to confounding, which is a surprisingly common problem in data mining and statistics.

Let's investigate this matter more closely. Take a look at this plot:

plot of chunk unnamed-chunk-2

This makes more intuitive sense, and feels more correct than the initial, isn't it?

Following on this more intuitively correct plot, let's use age instead of shoe size as our independent variable. Now look at both on the same page:

plot of chunk unnamed-chunk-3

By further investigating, we're able to see that the actual variable that connects both is the age.

Now, it's easy to laugh at the previous examples as being so silly they'd be impossible mistakes to be made in real life. NOT SO!

The plot below was published in New England Journal Of Medicine in 2012, where the authors erroneously (almost hilariously so), attempted to prove that countries that consumed more chocolate produced more nobel lauretes, because consuming chocolate could 'hypothetically improve cognitive function'. This very same mistake was made by well-seasoned professionals!

Chocolate Laureates!

So we can see that not only must we be observant to seek out (subtly) hidden patterns in data, we must also be observant to avoid wrong or confounding inferences! If you've not got anything from all the previous slides, please make sure you get this:

\[ OBSERVE!! \]

Brief Intro To R and RSTudio

First, you'll need to install R first, then RStudio, if you haven't done so already. (Linux versions of both are also available at those links)

After both are installed, you should see a screen that looks like this: screenshot

No. 1 is the console - it's where you write your code and input commands, see the results of your input, etc.

So, you may click on your console, and type the following:

print(“Hello World!”)

…and then press the 'Enter' key. Congratulations! You've just written your first R program!

No. 2 is the Environment and History window - it's where info about your variables and functions you create are displayed.

So, if you type the following command:

x = 2

and look in the Environment window, viola! Your variable 'x' appears, with other information about it. This window gives you a very quick and helpful glance at all the variables currently in your workspace.

If you click the History tab, you'll see a history of all the commands you've typed thus far.

Or, an easier way to do this is to just press the 'Up' and/or 'Down' keys on your keyboard.

No. 3 is the Files, Plots, Packages, Help and Viewer windows.

'Packages' gives you a list of all the packages you currently have on your system. (More on this in a subsequent session)

'Files' lists all the files and folders in the directory of your system that you're currently working from.

'Viewer' is…well, let's leave this juicy window for later… ;)

'Help' is where you can find help and additional info about packages and their functions.

You can access help by clicking the 'Help' tab, and then searching in the Search bar to your upper right.

Or, from your console, just put a '?' in front of a function.

So,

?mean

will launch the help file on the mean function.

If you aren't sure of the function name, but want to search for a keyword instead, you can try using two question marks: ??mean

The 'Plots' window shows you every plot you generate.

Quickly, let's generate the plot of a straight line, say from 1 to 100.

plot(1:100)

A plot appears in the 'Plots' window. (Not an outstanding plot, I know, but baby steps… ;) )

Notice the colon ':' operator - it is used to automatically create a sequence of values.

Of course, there is also the sequence operator 'seq', but more on that later…

Okay! We've recognised our environment! In the next session, we'll be learning about Data Types and begin playing even more in R!

Till then: Happy Hacking!