Data Science Learning Roadmap
From statistics to machine learning and AI, master the art of data analysis
Duration: 32 weeks | 3 steps | 37 topics
Career Opportunities
- Data Scientist
- Machine Learning Engineer
- Data Analyst
- AI Researcher
Step 1: Python & Statistics
Learn Python programming and essential statistical concepts for data analysis
Time: 8 weeks | Level: beginner
- Python Syntax & Data Types (required) — Variables, data types (int, float, str, bool), operators, type casting, and basic I/O
- Python uses dynamic typing; variables do not need explicit type declarations
- Understand core data types: int, float, str, bool, and None
- Use input() for user input and print() with f-strings for formatted output
- Python uses indentation (not braces) to define code blocks
- Control Flow & Functions (required) — if/elif/else, for loops, while loops, function definitions, arguments, and return values
- Use if/elif/else for branching logic and for/while loops for iteration
- Define reusable functions with def; use default parameters and *args/**kwargs for flexibility
- Understand variable scope: local variables inside functions vs global variables
- Use break, continue, and else clauses on loops for advanced flow control
- Lists, Tuples, Dicts & Sets (required) — Core collection types, their methods, iteration patterns, and when to use each
- Lists are ordered and mutable; tuples are ordered and immutable (use for fixed data)
- Dictionaries store key-value pairs with O(1) lookup; use for structured data and caching
- Sets store unique elements and support mathematical operations (union, intersection, difference)
- Choose the right collection: list for ordered sequences, dict for mappings, set for uniqueness checks
- File I/O & String Processing (required) — Reading and writing files, CSV handling, string methods, and text manipulation
- Use the 'with' statement for file operations to ensure proper resource cleanup
- Read files with read(), readline(), or readlines(); write with write() or writelines()
- Use the csv module or pandas for structured CSV data processing
- Master string methods: split(), join(), strip(), replace(), find(), and format()/f-strings
- Object-Oriented Python (required) — Classes, objects, inheritance, encapsulation, polymorphism, and magic methods
- Classes bundle data (attributes) and behavior (methods) into reusable blueprints
- Use __init__ to initialize object state; 'self' refers to the current instance
- Inheritance lets child classes extend or override parent behavior; favor composition when appropriate
- Magic methods (__str__, __repr__, __len__, __eq__) integrate your objects with Python's built-in operations
- Descriptive Statistics (required) — Mean, median, mode, variance, standard deviation, percentiles, and data distribution shapes
- Mean, median, and mode measure central tendency; choose based on data distribution and outliers
- Variance and standard deviation quantify how spread out data points are from the mean
- Understand skewness (left/right) and kurtosis (tail heaviness) to describe distribution shapes
- Use percentiles and box plots to summarize data ranges and identify outliers
- Probability Theory (required) — Probability distributions, Bayes' theorem, conditional probability, and expected value
- Probability ranges from 0 to 1; understand independent vs dependent events
- Know key distributions: normal (bell curve), binomial (yes/no trials), Poisson (event counts)
- Bayes' theorem updates the probability of a hypothesis given new evidence: P(A|B) = P(B|A)*P(A)/P(B)
- Conditional probability P(A|B) is the probability of A given that B has occurred
- Hypothesis Testing (required) — t-tests, chi-square tests, p-values, confidence intervals, and statistical significance
- Formulate null (H0) and alternative (H1) hypotheses before collecting data
- The p-value is the probability of observing your results (or more extreme) if H0 is true; reject H0 if p < alpha (typically 0.05)
- Use t-tests for comparing means, chi-square for categorical data, and ANOVA for multiple groups
- Confidence intervals provide a range of plausible values for a population parameter (e.g., 95% CI)
- NumPy Fundamentals (recommended) — Arrays, broadcasting, vectorized operations, and numerical computing with NumPy
- NumPy arrays are faster than Python lists for numerical computation due to contiguous memory and C-level operations
- Broadcasting allows operations on arrays of different shapes without explicit looping
- Use vectorized operations (element-wise math, np.sum, np.mean) instead of Python for-loops for performance
- Master array slicing, reshaping, and indexing for efficient data manipulation
- Virtual Environments & pip (recommended) — Creating isolated environments with venv, managing packages with pip, and requirements.txt
- Virtual environments isolate project dependencies so different projects can use different package versions
- Create with 'python -m venv .venv' and activate before installing packages
- Use 'pip freeze > requirements.txt' to capture and 'pip install -r requirements.txt' to reproduce environments
- Regular Expressions in Python (optional) — Pattern matching, the re module, and common text processing use cases
- Use re.search(), re.match(), re.findall(), and re.sub() for pattern matching and replacement
- Raw strings (r'pattern') avoid double-escaping backslashes in regex patterns
- Master groups, quantifiers, and character classes for extracting structured data from unstructured text
- Python Comprehensions & Generators (optional) — List/dict/set comprehensions, generator expressions, and the yield keyword
- List comprehensions are a concise, Pythonic way to create lists: [expr for item in iterable if condition]
- Dict and set comprehensions follow the same pattern: {k: v for ...} and {expr for ...}
- Generators produce items lazily with yield, using constant memory for large datasets
- Use generator expressions (parentheses instead of brackets) for memory-efficient pipelines
Step 2: Data Analysis & Visualization
Master data analysis tools and create compelling visualizations with Python libraries
Time: 10 weeks | Level: intermediate
- Pandas DataFrames (required) — Creating, indexing, selecting, filtering, and manipulating DataFrames and Series
- DataFrames are 2D labeled tables; Series are 1D labeled arrays — the core Pandas structures
- Select data with loc (label-based) and iloc (position-based) indexing
- Filter rows with boolean indexing: df[df['column'] > value]
- Use groupby(), merge(), concat(), and pivot_table() for aggregation and reshaping
- Data Cleaning & Preprocessing (required) — Handling missing values, duplicates, outliers, type conversions, and data normalization
- Identify and handle missing values with isnull(), dropna(), and fillna() (mean, median, forward fill)
- Remove duplicates with drop_duplicates() and standardize column names and types with astype()
- Detect outliers using IQR, z-scores, or visual methods (box plots) and decide whether to cap, remove, or keep them
- Normalize and standardize numerical features so they are on comparable scales for analysis
- Exploratory Data Analysis (required) — Systematic approach to understanding data through summary statistics, distributions, and correlations
- Start with df.info(), df.describe(), and df.shape to understand data types, ranges, and completeness
- Use histograms, box plots, and KDE plots to visualize distributions of individual variables
- Compute correlation matrices and visualize with heatmaps to find relationships between features
- Document insights and hypotheses as you explore; EDA guides feature engineering and model selection
- Matplotlib Fundamentals (required) — Figure and axes objects, plot types, customization, and publication-quality figures
- Use the object-oriented API (fig, ax = plt.subplots()) for full control over figure layout
- Know common plot types: line, bar, scatter, histogram, pie, and when each is appropriate
- Customize titles, labels, legends, colors, and annotations to tell a clear data story
- Use subplots to compare multiple views side by side in a single figure
- Seaborn Statistical Visualization (required) — Statistical plotting with Seaborn: distribution plots, regression plots, categorical plots, and themes
- Seaborn builds on Matplotlib with higher-level functions for statistical visualization
- Use displot/histplot for distributions, scatterplot/regplot for relationships, and catplot for categorical comparisons
- heatmap() with a correlation matrix instantly reveals feature relationships
- Apply built-in themes (set_theme, set_style) and color palettes for polished, consistent visuals
- SQL for Data Scientists (required) — Writing queries to extract, filter, aggregate, and join data from relational databases
- Master SELECT, WHERE, GROUP BY, HAVING, ORDER BY, and LIMIT for data extraction
- Use JOIN (INNER, LEFT, RIGHT) to combine tables and subqueries for complex filtering
- Aggregate functions (COUNT, SUM, AVG, MIN, MAX) summarize data across groups
- Window functions (ROW_NUMBER, RANK, LAG, LEAD) compute values across rows without collapsing groups
- Plotly & Interactive Dashboards (recommended) — Creating interactive charts with Plotly and building dashboards with Dash or Streamlit
- Plotly creates interactive, web-based charts with hover tooltips, zoom, and pan out of the box
- plotly.express provides high-level functions (px.scatter, px.bar, px.line) for quick interactive plots
- Dash lets you build full web dashboard applications in Python with callbacks for interactivity
- Use Streamlit as a lightweight alternative for rapid data app prototyping with minimal code
- Feature Engineering (recommended) — Creating new features, encoding categorical variables, scaling, and feature selection techniques
- Create new features from existing data: date parts, ratios, binning, polynomial features
- Encode categorical variables with one-hot encoding, label encoding, or target encoding
- Scale numerical features with StandardScaler (z-score) or MinMaxScaler (0-1 range) depending on the algorithm
- Use feature selection methods (correlation, mutual information, recursive feature elimination) to reduce dimensionality
- Time Series Data (recommended) — Working with datetime data, resampling, rolling windows, and trend/seasonality decomposition
- Convert columns to datetime with pd.to_datetime() and set as index for time-based operations
- Resample time series to different frequencies (daily to monthly) with .resample()
- Apply rolling windows (.rolling()) for moving averages and smoothing noisy data
- Decompose time series into trend, seasonal, and residual components with statsmodels
- Web Scraping (BeautifulSoup, Requests) (recommended) — Fetching web pages, parsing HTML, extracting data, and handling pagination and rate limiting
- Use the requests library to fetch web pages and BeautifulSoup to parse the HTML
- Navigate the HTML tree with find(), find_all(), and CSS selectors to extract specific data
- Respect robots.txt, add delays between requests, and set a User-Agent header to scrape responsibly
- Store scraped data in CSV or directly into a Pandas DataFrame for immediate analysis
- Geospatial Data (optional) — Working with geographic data, mapping with Folium and GeoPandas, and spatial analysis
- GeoPandas extends Pandas with geometry columns for spatial operations (intersections, buffers, distances)
- Use Folium to create interactive Leaflet maps with markers, choropleths, and heatmaps in Python
- Understand coordinate reference systems (CRS) and when to reproject data for accurate spatial calculations
- Advanced Pandas (MultiIndex, pivot) (optional) — Hierarchical indexing, pivot tables, method chaining, and performance optimization
- MultiIndex (hierarchical index) enables multi-level grouping and cross-sectional slicing with xs()
- pivot_table() creates spreadsheet-style summaries with aggregation functions across row and column groups
- Method chaining (.assign().query().groupby()...) produces readable, pipeline-style transformations
- Use categorical dtypes and eval()/query() for memory and speed optimizations on large datasets
Step 3: Machine Learning Fundamentals
Learn the basics of machine learning algorithms and their applications
Time: 14 weeks | Level: advanced
- Supervised Learning Overview (required) — Classification vs regression, training/test splits, bias-variance tradeoff, and the ML workflow
- Supervised learning maps labeled inputs to outputs; classification predicts categories, regression predicts continuous values
- Always split data into training and test sets (typically 80/20) to evaluate generalization
- The bias-variance tradeoff: high bias underfits (too simple), high variance overfits (too complex)
- The ML workflow: collect data, clean/preprocess, engineer features, train, evaluate, iterate
- Linear & Logistic Regression (required) — Linear regression for continuous prediction, logistic regression for binary classification, and their assumptions
- Linear regression finds the best-fit line by minimizing the sum of squared residuals (ordinary least squares)
- Key assumptions: linearity, independence, homoscedasticity (constant variance), and normally distributed errors
- Logistic regression uses the sigmoid function to output probabilities for binary classification
- Regularization (L1/Lasso, L2/Ridge) penalizes large coefficients to prevent overfitting
- Decision Trees & Random Forests (required) — Tree-based splitting, pruning, random forest ensembles, and feature importance
- Decision trees split data on features that maximize information gain (Gini impurity or entropy)
- Trees are prone to overfitting; control depth, min_samples_split, and min_samples_leaf to prune
- Random forests aggregate many decorrelated trees (bagging + feature randomness) for robust predictions
- Use feature_importances_ to understand which features drive the model's decisions
- Model Evaluation & Validation (required) — Cross-validation, accuracy, precision, recall, F1, ROC-AUC, and the confusion matrix
- Accuracy alone is misleading on imbalanced datasets; use precision, recall, and F1-score
- The confusion matrix shows true/false positives/negatives — the foundation for all classification metrics
- K-fold cross-validation rotates the test set across k folds for a more reliable performance estimate
- ROC-AUC measures the model's ability to distinguish classes across all classification thresholds
- Scikit-learn Pipeline (required) — Building reproducible ML workflows with Pipeline, ColumnTransformer, and custom transformers
- Pipeline chains preprocessing steps and the estimator into a single object that prevents data leakage
- ColumnTransformer applies different transformations to numerical and categorical columns in parallel
- Pipelines integrate seamlessly with cross_val_score and GridSearchCV for end-to-end validation
- Build custom transformers by implementing fit() and transform() with BaseEstimator and TransformerMixin
- Unsupervised Learning (required) — K-Means clustering, PCA dimensionality reduction, DBSCAN, and discovering hidden patterns
- K-Means partitions data into k clusters by minimizing within-cluster distances; use the elbow method to choose k
- PCA reduces dimensionality by projecting data onto principal components that capture the most variance
- DBSCAN finds clusters of arbitrary shape and automatically identifies noise points (outliers)
- Unsupervised learning is used for customer segmentation, anomaly detection, and dimensionality reduction
- Neural Networks Basics (required) — Perceptrons, activation functions, forward pass, backpropagation, and gradient descent
- A neural network is layers of neurons; each neuron computes a weighted sum, adds a bias, and applies an activation function
- Activation functions (ReLU, sigmoid, softmax) introduce non-linearity, enabling the network to learn complex patterns
- Backpropagation calculates gradients of the loss with respect to each weight using the chain rule
- Gradient descent iteratively updates weights to minimize the loss function; learning rate controls step size
- Deep Learning with TensorFlow/Keras (recommended) — Building, training, and evaluating deep neural networks with the Keras high-level API
- Keras provides a high-level Sequential and Functional API for building neural networks with minimal code
- Define layers (Dense, Conv2D, LSTM), compile with an optimizer and loss function, then fit on data
- Use callbacks (EarlyStopping, ModelCheckpoint, ReduceLROnPlateau) to control training behavior
- Monitor training/validation loss curves to diagnose overfitting and adjust architecture or regularization
- Natural Language Processing Intro (recommended) — Text preprocessing, tokenization, TF-IDF, word embeddings, and sentiment analysis basics
- Text preprocessing: lowercasing, removing punctuation/stopwords, stemming, and lemmatization
- TF-IDF converts text to numerical vectors by weighting term frequency against document frequency
- Word embeddings (Word2Vec, GloVe) capture semantic relationships in dense vector representations
- Pre-trained transformer models (BERT, GPT) achieve state-of-the-art results; fine-tune with Hugging Face
- Ensemble Methods (recommended) — XGBoost, gradient boosting, bagging, stacking, and combining models for better performance
- Bagging (Random Forest) trains models on random subsets in parallel; boosting trains them sequentially to correct errors
- XGBoost and LightGBM are optimized gradient boosting libraries that dominate tabular data competitions
- Stacking uses the predictions of base models as features for a meta-learner to combine their strengths
- Ensemble methods almost always outperform individual models by reducing variance and/or bias
- Hyperparameter Tuning (recommended) — GridSearchCV, RandomizedSearchCV, Bayesian optimization, and efficient search strategies
- Hyperparameters (learning rate, max_depth, n_estimators) are set before training and control model complexity
- GridSearchCV exhaustively searches all parameter combinations; slow but thorough for small grids
- RandomizedSearchCV samples random combinations, offering better efficiency for large search spaces
- Bayesian optimization (Optuna, Hyperopt) uses past results to intelligently explore the search space
- Model Deployment (optional) — Serving models via Flask APIs, building Streamlit apps, and tracking with MLflow
- Save trained models with joblib or pickle; load them in a web server to serve predictions
- Flask provides lightweight REST APIs: expose a /predict endpoint that accepts JSON input
- Streamlit turns Python scripts into interactive web apps for model demos and data exploration
- MLflow tracks experiments (parameters, metrics, artifacts) and manages model versioning and deployment
- Computer Vision Basics (optional) — Image classification, convolutional neural networks, transfer learning, and data augmentation
- CNNs use convolutional layers to detect spatial features (edges, textures, objects) in images
- Transfer learning leverages pre-trained models (ResNet, VGG, EfficientNet) by fine-tuning on your dataset
- Data augmentation (rotation, flip, zoom, crop) artificially increases training set size and reduces overfitting
- Use ImageDataGenerator or tf.data pipelines for efficient image loading and preprocessing
