About This Guide
This guide is highly recomended for students enrolled in HINF 5008 - Computational Methods in Health Informatics
The resources provided here aim to help you apply computational methods in your research, while learning more about data-mining topics and how to get help with new tools.
This guide focuses specifically on using the opensource statistical programing language R, but more on other data analytics resources can be found through the Data-Intensive Research LibGuide.
Familiarize Yourself With These Concepts
Getting a better understanding of how the following terms are used will help you as you explore these resources and learn to apply them in your research.
- Data Science
- Big Data
- Machine Learning
- Data Mining
How to read the Data Science Venn Diagram:
The primary colors of data: hacking skills, math and stats knowledge, and substantive expertise
- Data Science is inherently interdisciplinary in nature and each of these skills are on their own very valuable, but when combined with only one other are at best simply not data science, or at worst potentially dangerous.
- Data is a commodity traded electronically; therefore, in order to be in this market you need to speak hacker. This does not require a background in computer science. Being able to manipulate text files at the command-line, understanding vectorized operations, thinking algorithmically are the hacking skills that make for a successful data hacker.
- Once you have acquired and cleaned the data, the next step is to actually extract insight from it. In order to do this, you need to apply appropriate math and statistics methods, which requires at least a baseline familiarity with these tools. This is not to say that a PhD in statistics in required to be a competent data scientist, but it does require knowing what an ordinary least squares regression is and how to interpret it.
- Substance is the next piece. Data plus math and statistics only gets you machine learning, which is great if that is what you are interested in, but not if you are doing data science. Data Science is about discovery and building knowledge, which requires some motivating questions about the world and hypotheses that can be brought to data and tested with statistical methods. Substantive expertise plus math and statistics knowledge is where most traditional researcher falls. Doctoral level researchers spend most of their time acquiring expertise in these areas, but very little time learning about technology.
- The "Danger Zone" is where people, "know enough to be dangerous," and is the most problematic area of the diagram. In this area people who are perfectly capable of extracting and structuring data, likely related to a field they know quite a bit about, and probably even know enough R to run a linear regression and report the coefficients; but they lack any understanding of what those coefficients mean. It is from this part of the diagram that the phrase "lies, damned lies, and statistics" emanates, because either through ignorance or malice this overlap of skills gives people the ability to create what appears to be a legitimate analysis without any understanding of how they got there or what they have created.