Audio version of the article
What is Data Science?
In recent years the phrase “data science” has become a buzzword in the tech industry. The demand for data scientists has surged since the late 1990s, presenting new job opportunities and research areas for computer scientists. Before we delve into the computer science aspect of data science, it’s useful to know exactly what data science is and to explore the skills required to become a successful data scientist.
Data science is a field of study that involves the processing of large sets of data with statistical methods to extract trends, patterns, or other relevant information. In short, data science encapsulates anything related to obtaining insights, trends, or any other valuable information from data. The foundations of these tasks originate from the fields of statistics, programming, and visualization. In short, a successful data scientist has in-depth knowledge in these four pillars:
- Math and Statistics: From modeling to experimental design, encountering something math-related is inevitable, as data almost always requires quantitative analysis.
- Programming and Database: Knowing how to navigate program data hierarchies, or big data, and query certain datasets alongside knowing how to code algorithms and develop models is invaluable to a data scientist (more on this below).
- Domain Knowledge and Soft Skills: A successful and effective data scientist is knowledgeable about the company or firm at which they are working and proactive at strategizing and/or creating innovative solutions to data issues.
- Communication and Visualization: To make their work viable for all audiences, data scientists must be able to weave a coherent and impactful story through visuals and facts to convey the importance of their work. This is usually completed with certain programming languages or data visualization software, such as Tableau or Excel.
Does Data Science Require Coding?
Short answer: yes. As described in points 2 and 4, coding plays a significant role in data science, making appearances in almost every step of the process. Though, how is coding utilized in every step of solving a data science problem? Below, you’ll find the different stages of a typical data science experiment and a detailed account of how coding is integrated within the process. It’s important to remember that this process is not always linear; data scientists tend to ping-pong back and forth between different steps depending on the nature of the problem at hand.
Preplanning and Experimental Design
Before coding anything, it’s necessary for data scientists to understand the problem that is being solved and the desired objective. This step also requires data scientists to figure out which tools, software, and data be used throughout the process. Although coding is not involved in this phase, it can’t be skipped, as it allows a data scientist to keep his or her focus on their objective and not let white noise or unrelated data or results to distract.
The world has a massive amount of data that is growing constantly. In fact, Forbes reports that humans create 2.5 quintillion bytes of data daily. From such vast amounts of data arise vast amounts of data quality issues. These issues can be anything, ranging from duplicate or missing datasets and values, inconsistent data, misentered data, or even outdated data. Obtaining relevant and comprehensive datasets is tedious and difficult. Oftentimes, data scientists use multiple datasets, pulling the data they need from each one. This step requires coding with querying languages, such as SQL and NoSQL.
After all the necessary data is compiled in one location, the data needs to be cleaned. For example, data which is inconsistently labeled “doctor” or “Dr.” can cause problems when it is analyzed. Labeling errors, minor spelling mistakes, and other minutiae can cause major problems along the road. Data scientists can use languages like Python and R to clean data. They can also use applications, such as OpenRefine or Trifecta Wrangler, which are specifically made to clean data and transform it into different formats.
Once a dataset is clean and uniformly formatted, it is ready to be analyzed. Data analytics is a broad term with definitions that differ from application to application. When it comes to data analysis, Python is ubiquitous in the data science community. R and MATLAB are popular as well, as they were created to be used in data analysis. Though these languages have a steeper learning curve than Python, they are useful for an aspiring data scientist, as they are so widely used. Beyond these languages, there are a plethora of tools available online to help expedite and streamline data analysis.
Visualizing the results of data analysis helps data scientists convey the importance of their work as well as their findings. This can be done done using graphs, charts, and other easy-to-read visuals, which can allow broader audiences to understand a data scientist’s work. Python is a commonly used language for this step; packages such as seaborn and prettyplotlib can help data scientists make visuals. Other software, such as Tableau and Excel, are also readily available and are widely used to create graphics.
Programming Languages used in Data Science
Python is a household name in data science. It can be used to obtain, clean, analyze, and visualize data, and is often considered the programming language that serves as the foundation of data science. In fact, 40% of data scientists who responded to an O’Reilly survey claimed they used Python as their main coding language. The language has contributors that have created libraries solely dedicated to data science operations and extensions into artificial intelligence/machine learning, making it an ideal choice.
Common packages, such as numpy and pandas, can compute complex calculations with matrices of data, making it easier for data scientists to focus on solutions instead of mathematical formulas and algorithms. Even though these packages (along with others, such as sklearn) already take care of the mathematical formulas and calculations, it’s still important to have a solid understanding of said concepts in order to implement the correct procedure through code. Beyond these foundational packages, Python also has many specialized packages that can help with specific tasks.
R and MATLAB are also popular tools used in data science. They are often used for data analysis and can allow for hypothesis testing to validate statistical models. Though these languages have different setups and syntaxes than Python, the basic logic of the former two languages is based off of the latter, further affirming that Python is a keystone language in data science.
Other popular programming languages, such as Java, can be useful for the aspiring data scientist to learn as well. Java is used in a vast number of workplaces, and plenty of tools in the big data realm are written in Java. For example, TensorFlow is a software library that is available for Java. The list of coding languages that are relevant or being used directly in the field of data science goes on and on, just as the benefits of learning a new computing language are endless.
Case Study: Python, MATLAB, and R
- At ForecastWatch, Python was used to write a parser to harvest forecasts from other websites.
- Financial industries leveraged time-series data in MATLAB to backtest statistical models that are used to engineer fund portfolios.
- In 2014, Facebook transitioned to using mostly Python for data analysis since it was already used widely throughout the firm.
- R is widely used in healthcare industries, ranging from drug discovery, pre-clinical trial testing, and drug safety data analysis.
- Sports analysts use R to analyze time-series sports data on certain players in predicting future performances.
Database and Querying
Beyond data analysis, it is imperative to be knowledgeable in querying languages. When obtaining data, data scientists oftentimes navigate multiple databases within different data hierarchies. Languages, such as SQL and its successors, as well as firm-specific cloud navigation systems are key in expediting the data wrangling process. Beyond this, querying languages can also compute basic formulas and operations based on the programmer’s preference.
Case Study: Querying in Data Science
- The U.S. Congress Database is an open source database that can be queried using pSQL, and can answer questions about the demographics of our legislative branch.
- When companies acquire smaller firms or startups, they often run into the issue of navigating multiple databases. To ease the process, SQL is a popular language used to navigate data.
Data Science is Growing
In almost every step of the data science process, programming is used to achieve different goals. As the field intensifies and becomes more complex, data scientists will rely more and more heavily on coding to ensure that they can successfully solve more complex problems. For these reasons, it is integral that aspiring data scientists learn to utilize coding to ensure that they are prepared for any role. Because of the rapid amounts of innovation, the field is constantly expanding and data scientist positions are constantly opening at companies of all sizes and fields. In short, data science and its future are nothing short of exciting!