Adversarial Machine Learning – introduction
24 April 2020Tokenizers, or how do machines read?
8 May 2020We all make small or big decisions every day. We spend a considerable amount of time and effort on them to make the right ones. The art of making decisions is just that – choosing a plan of action in the face of uncertainty. There are two ways to make a decision. One of them is an intuitive way of making decisions based on “gut feeling”. The second is a method which uses data or information. It is a logical and scientific way to achieve the right approach with the available data. This quantitative approach to making decision is at the heart of Data Science.
Data science and statistics
Statistics is about linking numbers to these questions to make them evolve. Establishing quantitative links with largely qualitative questions is at the heart of statistics. It is also a set of principles and parameters for obtaining information to make decisions in the face of uncertainty. Data Science is an ideal point which ideally reclines in computer programming, statistics and the field in which the analysis is performed. Data analysts solve problems and help someone make a decision based on the available data. So what are they doing to achieve this?
- Define a description of the problem
- Collect the right kind of data to perform the analysis
- They explore data
- They use various techniques to draw conclusions from the data
- They confirm conclusions and forecasts
To do all of the above, Data Scientist needs to have a solid understanding of the area with reporting of a problem belongs. A data scientist should be able to get the data, clean it, read it, analyze it, and apply the answer method in a fairly short time. To do this, they must have computer programming skills. All the steps mentioned are not performed directly by the data scientist, but from a computer instructed by him.
A deeper immersion in the world of statistics
Considering traditional statistics, it has three important parameters, like mean, median and modal number. The idea of a central tendency is that there can be one value which best describes the data. Let’s look at it in more detail.
- Descriptive statistics. These methods describe data to us in the form of tables and charts. In fact, we’re trying to research the data to find out where the answer reclines.
- Sampling in statistics. In statistics, sampling is extremely important. In one way, the statistic itself relates to a specific sample parameter from which we can estimate or forecast a population.
- Simple random sampling. Each element in the population has an equal chance of being included in the sample. Random samples are usually quite representative as they do not favor certain members.
- Composite random sampling. The population is first divided into groups. The general sample consists of some elements from each group. Then elements from each group are selected randomly. The layered sample ensures that elements from each group will be represented in the sample, so this sampling method is good when we want some elements from each group.
- Cluster random sampling. The population is first divided into groups or clusters. The general sample consists of every element from certain clusters. Groups are chosen randomly. The focus sample takes each member from certain groups, so it is good for each group to reflect the entire population. Also, you must remember that incorrect sampling methods will lead to skewed or biased results. To give an excuse for machine learning to train a specific machine learning algorithm, we take sample data and train the algorithm based on the sample. In this case, the effectiveness of the machine learning algorithm essentially depends on the quality of the sample data. Incorrect sample type like convenience sample may lead to incorrect predictions.
- Inference statistics. This is actually a statistical inference where we can infer about a large set of data from testing a small population of data. In practical situations, statistical inference may include estimating a population parameter or deciding on a parameter value. The latter involves asking a ‘hypothetical’ question about the data population and finding the answer by testing a small sample of the data.
This is just the top of the iceberg. The next level is statistics to predict outcomes, and that’s when we enter to the world of machine learning.
Until now, we have used data from the past to understand, infer and predict the future. To deal with this, we need to understand another important topic called probability theory, which is about the probability of an event occurring. Therefore, it is important not only what happened in the past, but also the probability of it being repeated in the future.