What is Data Science?

 

The field of data science seems to get bigger and more popular every day. According to LinkedIn, data science was one of the fastest growing fields of work in 2017, and in 2020 Glassdoor ranked data science work as one of the top three jobs in the United States. Given the growing popularity of data science, it's no surprise that more and more people are taking an interest in the industry. But what exactly is data science?

We learn about data science, take the time to define data science, explore how big data and artificial intelligence are changing the field, learn some common data science tools, and look at some examples of data science.

Definition of Data Science

Before we can explore any data science tools or examples, we would like to get a concise definition of data science.

Defining "data science" is actually a bit complicated, because the term is applied to many different tasks and methods of investigation and analysis. We can start by remembering what the term "science" means. Science is the systematic study of the physical and natural world through observation and experimentation, with the aim of advancing human understanding of natural processes. The important words in that definition are "observation" and "understanding".

If data science is the process of understanding the world from models in data, the responsibility of a data scientist is to transform the data, analyze the data, and extract the models from the data. In other words, a data scientist is provided with the data and uses a variety of different tools and techniques to preprocess the data (prepare it for analysis) and then analyze the data for meaningful models.

The role of a data scientist is similar to the role of a traditional scientist. Both deal with data analysis to support or reject hypotheses about how the world operates, trying to make sense of data models to improve our understanding of the world. Data scientists use the same scientific methods as a traditional scientist. A data scientist begins by collecting observations on some phenomena they would like to study. Then they formulate a hypothesis on the phenomenon in question and try to find data that somehow cancel their hypothesis.

If the hypothesis is not contradicted by the data, they may be able to build a theory, or a model, on how the phenomenon works, which they can continue to test again and again to see if it holds true for other similar data sets. If a model is robust enough, if it explains the models well, and is not nullified during other tests, it can also be used to predict future occurrences of that phenomenon.

A data scientist will typically not collect their data through an experiment. They usually do not design experiments with double-blind controls and tests to discover confounding variables that could interfere with a hypothesis. Most of the data analyzed by a data scientist will be data obtained through studies and observation systems, which is one way that the work of a data scientist may differ from that of a traditional scientist, who tends to carry out more experiments. .

That said, a data scientist might be called upon to do a form of experimentation called A / B testing where changes are made to a system that collects data to see how data patterns change.

Regardless of the techniques and tools used, data science ultimately aims to improve our understanding of the world by making sense of data, and data is acquired through observation and experimentation. Data science is the process of using algorithms, statistical principles, and various tools and machines to extract insights from data, insights that help us understand patterns in the world around us.

 What do data scientists do?

You may see that any activity that involves analyzing data in a scientific way can be called data science, which is part of what makes defining data science so difficult. To clarify, let's explore some of the activities a data scientist might perform on a daily basis.

Data science brings together many different disciplines and specialties. Photo: Calvin Andrus via Wikimedia Commons, CC BY SA 3.0 (https://commons.wikimedia.org/wiki/File:DataScienceDisciplines.png)

On any given day, a data scientist might be asked to: create a data storage and retrieval scheme, create ETL data pipelines (extract, transform, load) and clean up data, employ statistical methods, create data visualizations and dashboards, implement artificial intelligence and machine learning algorithms, make recommendations for data-driven actions.

Let's break up the activities listed above a bit.

Data archiving, retrieval, ETL and cleanup

A data scientist may be required to manage the installation of the technologies necessary to store and retrieve data, paying attention to both hardware and software. The person responsible for this position may also be referred to as the "Data Engineer". However, some companies include these responsibilities under the role of data scientist. A data scientist may also need to create or help create ETL pipelines. Data is very rarely formatted just as a data scientist needs. Instead, the data will need to be received in raw form from the data source, transformed into a usable format, and preprocessed (things like standardizing data, eliminating redundancies, and removing corrupted data).

Statistical methods

The application of statistics is necessary to transform simply by looking at the data and interpreting them into a real science. Statistical methods are used to extract relevant models from data sets, and a data scientist must have a good understanding of statistical concepts. They must be able to discern significant correlations from spurious correlations by checking for confounding variables. They also need to know the right tools to use to determine which features in the dataset are important to their model / have predictive power. A data scientist needs to know when to use a regression versus classification approach and when to worry about the mean of a sample versus the median of a sample. A data scientist would not be a scientist without these crucial skills.

Date display

A crucial part of a data scientist's job is to communicate his or her findings to others. If a data scientist can't effectively communicate their findings to others, the implications of their findings don't matter. A data scientist should also be an effective storyteller. This means producing visualizations that communicate relevant points about the dataset and the models discovered within it. There are a large number of different data visualization tools that a data scientist could use and they can view data for the purposes of initial and basic exploration (exploratory data analysis) or view the results produced by a model.

Business tips and applications

A data scientist must have some insight into the requirements and goals of their organization or business. A data scientist needs to understand these things because they need to know what types of variables and characteristics they should be analyzing, exploring patterns that will help their organization achieve its goals. Data scientists must be aware of the constraints they are operating and the assumptions that the organization's leadership is making.

Machine learning and artificial intelligence

Machine learning and artificial intelligence algorithms and models are tools used by data scientists to analyze data, identify patterns within the data, discern the relationships between variables and make predictions about future events.

Traditional Data Science vs. Big Data Science

As data collection methods have become more sophisticated and databases larger, a difference has emerged between traditional data science and "big data" science.

Traditional data analysis and data science are performed with descriptive and exploratory analyzes, with the aim of finding models and analyzing project performance results. Traditional methods of data analysis often focus only on past data and current data. Data analysts often deal with data that has already been cleaned up and standardized, while data scientists often deal with complex and dirty data. More advanced data analysis and data science techniques could be used to predict future behavior, although this is more often done with big data, as predictive models often need large amounts of data to be reliably built. .

“Big data” refers to data that is too large and complex to be managed with traditional scientific analysis and techniques and tools and instruments. Big data is often collected through online platforms and advanced data transformation tools are used to make large volumes of data ready for inspection by data science. As more data is collected at any given time, a data scientist job involves more big data analysis.

Data science tools

Common data science tools include tools for storing data, performing exploratory data analysis, data models, running ETLs, and data visualization. Platforms like Amazon Web Services, Microsoft Azure, and Google Cloud all offer tools to help data scientists store, transform, analyze and model data. There are also independent data science tools such as Airflow (data infrastructure) and Tableau (data visualization and analysis).

In terms of machine learning and artificial intelligence algorithms used to model data, they are often delivered through data science modules and platforms such as TensorFlow, PyTorch, and Azure Machine-learning studio. These platforms, like data scientists, make changes to their data sets, compose machine learning architectures, and train machine learning models.

Other common data science tools and libraries include SAS (for statistical modeling), Apache Spark (for streaming data analysis), D3.js (for interactive in-browser visualizations), and Jupyter (for interactive and shareable blocks of code and views).

Photo: Seonjae Jo via Flickr, CC BY SA 2.0 (https://www.flickr.com/photos/130860834@N02/19786840570)

Examples of data science

Examples of data science and its applications are everywhere. Data science has applications in everything from food delivery, to sports, traffic and health. Data is everywhere and therefore data science can be applied to everything.

In terms of food, Uber is investing in an expansion of its ride-sharing system focused on food delivery, Uber Eats. Uber Eats needs to offer people their food in a timely manner, while it's still hot and fresh. For this to happen, the company's data scientists must use statistical modeling that takes into account aspects such as the distance from restaurants to delivery points, holiday periods, cooking times and even weather conditions, all considered with the aim to optimize delivery times.

Sports statistics are used by team managers to determine who the best players are and to form strong and reliable teams that will win matches. A notable example is the data science documented by Michael Lewis in the book Moneyball, in which the Oakland Athletics team chief executive analyzed a variety of statistics to identify quality players who could be signed to the team at relatively low cost.

The analysis of traffic patterns is essential for the creation of self-driving vehicles. Self-driving vehicles must be able to predict the activity around them and respond to changes in road conditions, such as the greater stopping distance required when it rains, as well as the presence of more cars on the road during rush hour. In addition to self-driving vehicles, apps like Google Maps analyze traffic patterns to tell commuters how long it will take to get to their destination using various routes and forms of transportation.

In terms of health data science, computer vision is often combined with machine learning and other artificial intelligence techniques to create image classifiers that can examine things like X-rays, FMRIs, and ultrasounds to see if there are any. potential medical problems that could show up in the scan. These algorithms can be used to help doctors diagnose the disease.

Ultimately, data science covers numerous activities and brings together aspects from different disciplines. However, data science is always concerned with telling interesting and interesting stories from the data and using the data to better understand the world.


Post a Comment

Previous Post Next Post

Contact Form