2.3 What is Data Science?

Data Science core is using data to answer questions. It is a combination of mathematics, computer science, and domain environment pertaining to the data. The domain environment is the field that the data belongs to for example such as finance, environment, biology, genetics, psychology, etc.

library(tidyverse)
library(ggforce)

## Data frame - All of the arguments below are required to build the venn diagram. 

VennDS <- tibble(
  # Center of each circles
  x = c(0, 1,-1),  
  y = c(-0.5, 1, 1),
  # label coordinates if needed
  tx = NULL,
  ty = NULL,
  cat = c('Variety', 'Velocity',
          'Volume')
)
# ggplot argument x0, y0, r are all required,
v1 <- ggplot(VennDS, aes(
  x0 = x,
  y0 = y,
  r = 1.5,
  fill = cat)) + 
  geom_circle(alpha = 0.25,
  size = 1,
  color = "transparent",
  show.legend = FALSE) + 
  # using geom_text to draw on the graph
  geom_text(aes(x = -1.5, 
                y = 1, 
                label = "Computer Science"), 
            size = 5) +       
  geom_text(aes(x = 1.5, 
                y = 1, 
                label = "Math & Statistics "), 
            size = 5) +       
  geom_text(aes(x = 1.5, 
                y = .7, 
                label = " Knowledge"), 
            size = 5) +
  geom_text(aes(x = 0, 
                y = -1, 
                label = "Domain Environment"), 
            size = 5) +
  geom_text(aes(x = 0, 
                y = .75, 
                label = "Data Science"), 
            size = 5) +
  geom_text(aes(x = 0 , 
                y = 1.5, 
                label = "Machine"), 
            size = 5) +
  geom_text(aes(x = 0, 
                y = 1.2, 
                label = "Learning"), 
            size = 5) +
  geom_text(aes(x = -.9, 
                y = 0, 
                label = "Danger Zone"), 
            size = 5) +
  geom_text(aes(x = .9, 
                y = .3, 
                label = "Traditional "),
            size = 5) +
  geom_text(aes(x = .9, 
                y = 0, 
                label = " Research"), 
            size = 5) + 
  # remove x and y labels
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        axis.title.y=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank())

v1
Data Science Venn Diagram [@vdgm]

Figure 2.7: Data Science Venn Diagram (Conway 2010)

The Data Science venn diagram figure 2.7, shows us the combination of three fields - computer science, math & statistics, and domain environment (i.e. genetics, finance, etc.). The core of Data Science is using the data to answer our question. Generating the right question will help us decide what data to take in and out of our environment. An example of an ineffective question is “What is the sales for company abc?” What about an effective question - “What is the quarterly sales for the last 5 years for company abc? What are the top 10 products in sales/profit/numbers in the last year by month?”

Once the question is generated the following are basic routine in R/RStudio:

  1. Grabbing the data
  2. Data wrangling and making them tidy
  3. Exploratory Analysis
  4. Reproducible Research
  5. Statistical Inference
  6. Regression Models
  7. Practical Machine Learning
  8. Insights

References

Conway, Drew. 2010. “The Data Science Venn Diagram.” http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram.