Data Science Learning Roadmap

From statistics to machine learning and AI, master the art of data analysis

Duration: 32 weeks | 3 steps | 37 topics

Career Opportunities

  • Data Scientist
  • Machine Learning Engineer
  • Data Analyst
  • AI Researcher

Step 1: Python & Statistics

Learn Python programming and essential statistical concepts for data analysis

Time: 8 weeks | Level: beginner

  • Python Syntax & Data Types (required) — Variables, data types (int, float, str, bool), operators, type casting, and basic I/O
    • Python uses dynamic typing; variables do not need explicit type declarations
    • Understand core data types: int, float, str, bool, and None
    • Use input() for user input and print() with f-strings for formatted output
    • Python uses indentation (not braces) to define code blocks
  • Control Flow & Functions (required) — if/elif/else, for loops, while loops, function definitions, arguments, and return values
    • Use if/elif/else for branching logic and for/while loops for iteration
    • Define reusable functions with def; use default parameters and *args/**kwargs for flexibility
    • Understand variable scope: local variables inside functions vs global variables
    • Use break, continue, and else clauses on loops for advanced flow control
  • Lists, Tuples, Dicts & Sets (required) — Core collection types, their methods, iteration patterns, and when to use each
    • Lists are ordered and mutable; tuples are ordered and immutable (use for fixed data)
    • Dictionaries store key-value pairs with O(1) lookup; use for structured data and caching
    • Sets store unique elements and support mathematical operations (union, intersection, difference)
    • Choose the right collection: list for ordered sequences, dict for mappings, set for uniqueness checks
  • File I/O & String Processing (required) — Reading and writing files, CSV handling, string methods, and text manipulation
    • Use the 'with' statement for file operations to ensure proper resource cleanup
    • Read files with read(), readline(), or readlines(); write with write() or writelines()
    • Use the csv module or pandas for structured CSV data processing
    • Master string methods: split(), join(), strip(), replace(), find(), and format()/f-strings
  • Object-Oriented Python (required) — Classes, objects, inheritance, encapsulation, polymorphism, and magic methods
    • Classes bundle data (attributes) and behavior (methods) into reusable blueprints
    • Use __init__ to initialize object state; 'self' refers to the current instance
    • Inheritance lets child classes extend or override parent behavior; favor composition when appropriate
    • Magic methods (__str__, __repr__, __len__, __eq__) integrate your objects with Python's built-in operations
  • Descriptive Statistics (required) — Mean, median, mode, variance, standard deviation, percentiles, and data distribution shapes
    • Mean, median, and mode measure central tendency; choose based on data distribution and outliers
    • Variance and standard deviation quantify how spread out data points are from the mean
    • Understand skewness (left/right) and kurtosis (tail heaviness) to describe distribution shapes
    • Use percentiles and box plots to summarize data ranges and identify outliers
  • Probability Theory (required) — Probability distributions, Bayes' theorem, conditional probability, and expected value
    • Probability ranges from 0 to 1; understand independent vs dependent events
    • Know key distributions: normal (bell curve), binomial (yes/no trials), Poisson (event counts)
    • Bayes' theorem updates the probability of a hypothesis given new evidence: P(A|B) = P(B|A)*P(A)/P(B)
    • Conditional probability P(A|B) is the probability of A given that B has occurred
  • Hypothesis Testing (required) — t-tests, chi-square tests, p-values, confidence intervals, and statistical significance
    • Formulate null (H0) and alternative (H1) hypotheses before collecting data
    • The p-value is the probability of observing your results (or more extreme) if H0 is true; reject H0 if p < alpha (typically 0.05)
    • Use t-tests for comparing means, chi-square for categorical data, and ANOVA for multiple groups
    • Confidence intervals provide a range of plausible values for a population parameter (e.g., 95% CI)
  • NumPy Fundamentals (recommended) — Arrays, broadcasting, vectorized operations, and numerical computing with NumPy
    • NumPy arrays are faster than Python lists for numerical computation due to contiguous memory and C-level operations
    • Broadcasting allows operations on arrays of different shapes without explicit looping
    • Use vectorized operations (element-wise math, np.sum, np.mean) instead of Python for-loops for performance
    • Master array slicing, reshaping, and indexing for efficient data manipulation
  • Virtual Environments & pip (recommended) — Creating isolated environments with venv, managing packages with pip, and requirements.txt
    • Virtual environments isolate project dependencies so different projects can use different package versions
    • Create with 'python -m venv .venv' and activate before installing packages
    • Use 'pip freeze > requirements.txt' to capture and 'pip install -r requirements.txt' to reproduce environments
  • Regular Expressions in Python (optional) — Pattern matching, the re module, and common text processing use cases
    • Use re.search(), re.match(), re.findall(), and re.sub() for pattern matching and replacement
    • Raw strings (r'pattern') avoid double-escaping backslashes in regex patterns
    • Master groups, quantifiers, and character classes for extracting structured data from unstructured text
  • Python Comprehensions & Generators (optional) — List/dict/set comprehensions, generator expressions, and the yield keyword
    • List comprehensions are a concise, Pythonic way to create lists: [expr for item in iterable if condition]
    • Dict and set comprehensions follow the same pattern: {k: v for ...} and {expr for ...}
    • Generators produce items lazily with yield, using constant memory for large datasets
    • Use generator expressions (parentheses instead of brackets) for memory-efficient pipelines

Step 2: Data Analysis & Visualization

Master data analysis tools and create compelling visualizations with Python libraries

Time: 10 weeks | Level: intermediate

  • Pandas DataFrames (required) — Creating, indexing, selecting, filtering, and manipulating DataFrames and Series
    • DataFrames are 2D labeled tables; Series are 1D labeled arrays — the core Pandas structures
    • Select data with loc (label-based) and iloc (position-based) indexing
    • Filter rows with boolean indexing: df[df['column'] > value]
    • Use groupby(), merge(), concat(), and pivot_table() for aggregation and reshaping
  • Data Cleaning & Preprocessing (required) — Handling missing values, duplicates, outliers, type conversions, and data normalization
    • Identify and handle missing values with isnull(), dropna(), and fillna() (mean, median, forward fill)
    • Remove duplicates with drop_duplicates() and standardize column names and types with astype()
    • Detect outliers using IQR, z-scores, or visual methods (box plots) and decide whether to cap, remove, or keep them
    • Normalize and standardize numerical features so they are on comparable scales for analysis
  • Exploratory Data Analysis (required) — Systematic approach to understanding data through summary statistics, distributions, and correlations
    • Start with df.info(), df.describe(), and df.shape to understand data types, ranges, and completeness
    • Use histograms, box plots, and KDE plots to visualize distributions of individual variables
    • Compute correlation matrices and visualize with heatmaps to find relationships between features
    • Document insights and hypotheses as you explore; EDA guides feature engineering and model selection
  • Matplotlib Fundamentals (required) — Figure and axes objects, plot types, customization, and publication-quality figures
    • Use the object-oriented API (fig, ax = plt.subplots()) for full control over figure layout
    • Know common plot types: line, bar, scatter, histogram, pie, and when each is appropriate
    • Customize titles, labels, legends, colors, and annotations to tell a clear data story
    • Use subplots to compare multiple views side by side in a single figure
  • Seaborn Statistical Visualization (required) — Statistical plotting with Seaborn: distribution plots, regression plots, categorical plots, and themes
    • Seaborn builds on Matplotlib with higher-level functions for statistical visualization
    • Use displot/histplot for distributions, scatterplot/regplot for relationships, and catplot for categorical comparisons
    • heatmap() with a correlation matrix instantly reveals feature relationships
    • Apply built-in themes (set_theme, set_style) and color palettes for polished, consistent visuals
  • SQL for Data Scientists (required) — Writing queries to extract, filter, aggregate, and join data from relational databases
    • Master SELECT, WHERE, GROUP BY, HAVING, ORDER BY, and LIMIT for data extraction
    • Use JOIN (INNER, LEFT, RIGHT) to combine tables and subqueries for complex filtering
    • Aggregate functions (COUNT, SUM, AVG, MIN, MAX) summarize data across groups
    • Window functions (ROW_NUMBER, RANK, LAG, LEAD) compute values across rows without collapsing groups
  • Plotly & Interactive Dashboards (recommended) — Creating interactive charts with Plotly and building dashboards with Dash or Streamlit
    • Plotly creates interactive, web-based charts with hover tooltips, zoom, and pan out of the box
    • plotly.express provides high-level functions (px.scatter, px.bar, px.line) for quick interactive plots
    • Dash lets you build full web dashboard applications in Python with callbacks for interactivity
    • Use Streamlit as a lightweight alternative for rapid data app prototyping with minimal code
  • Feature Engineering (recommended) — Creating new features, encoding categorical variables, scaling, and feature selection techniques
    • Create new features from existing data: date parts, ratios, binning, polynomial features
    • Encode categorical variables with one-hot encoding, label encoding, or target encoding
    • Scale numerical features with StandardScaler (z-score) or MinMaxScaler (0-1 range) depending on the algorithm
    • Use feature selection methods (correlation, mutual information, recursive feature elimination) to reduce dimensionality
  • Time Series Data (recommended) — Working with datetime data, resampling, rolling windows, and trend/seasonality decomposition
    • Convert columns to datetime with pd.to_datetime() and set as index for time-based operations
    • Resample time series to different frequencies (daily to monthly) with .resample()
    • Apply rolling windows (.rolling()) for moving averages and smoothing noisy data
    • Decompose time series into trend, seasonal, and residual components with statsmodels
  • Web Scraping (BeautifulSoup, Requests) (recommended) — Fetching web pages, parsing HTML, extracting data, and handling pagination and rate limiting
    • Use the requests library to fetch web pages and BeautifulSoup to parse the HTML
    • Navigate the HTML tree with find(), find_all(), and CSS selectors to extract specific data
    • Respect robots.txt, add delays between requests, and set a User-Agent header to scrape responsibly
    • Store scraped data in CSV or directly into a Pandas DataFrame for immediate analysis
  • Geospatial Data (optional) — Working with geographic data, mapping with Folium and GeoPandas, and spatial analysis
    • GeoPandas extends Pandas with geometry columns for spatial operations (intersections, buffers, distances)
    • Use Folium to create interactive Leaflet maps with markers, choropleths, and heatmaps in Python
    • Understand coordinate reference systems (CRS) and when to reproject data for accurate spatial calculations
  • Advanced Pandas (MultiIndex, pivot) (optional) — Hierarchical indexing, pivot tables, method chaining, and performance optimization
    • MultiIndex (hierarchical index) enables multi-level grouping and cross-sectional slicing with xs()
    • pivot_table() creates spreadsheet-style summaries with aggregation functions across row and column groups
    • Method chaining (.assign().query().groupby()...) produces readable, pipeline-style transformations
    • Use categorical dtypes and eval()/query() for memory and speed optimizations on large datasets

Step 3: Machine Learning Fundamentals

Learn the basics of machine learning algorithms and their applications

Time: 14 weeks | Level: advanced

  • Supervised Learning Overview (required) — Classification vs regression, training/test splits, bias-variance tradeoff, and the ML workflow
    • Supervised learning maps labeled inputs to outputs; classification predicts categories, regression predicts continuous values
    • Always split data into training and test sets (typically 80/20) to evaluate generalization
    • The bias-variance tradeoff: high bias underfits (too simple), high variance overfits (too complex)
    • The ML workflow: collect data, clean/preprocess, engineer features, train, evaluate, iterate
  • Linear & Logistic Regression (required) — Linear regression for continuous prediction, logistic regression for binary classification, and their assumptions
    • Linear regression finds the best-fit line by minimizing the sum of squared residuals (ordinary least squares)
    • Key assumptions: linearity, independence, homoscedasticity (constant variance), and normally distributed errors
    • Logistic regression uses the sigmoid function to output probabilities for binary classification
    • Regularization (L1/Lasso, L2/Ridge) penalizes large coefficients to prevent overfitting
  • Decision Trees & Random Forests (required) — Tree-based splitting, pruning, random forest ensembles, and feature importance
    • Decision trees split data on features that maximize information gain (Gini impurity or entropy)
    • Trees are prone to overfitting; control depth, min_samples_split, and min_samples_leaf to prune
    • Random forests aggregate many decorrelated trees (bagging + feature randomness) for robust predictions
    • Use feature_importances_ to understand which features drive the model's decisions
  • Model Evaluation & Validation (required) — Cross-validation, accuracy, precision, recall, F1, ROC-AUC, and the confusion matrix
    • Accuracy alone is misleading on imbalanced datasets; use precision, recall, and F1-score
    • The confusion matrix shows true/false positives/negatives — the foundation for all classification metrics
    • K-fold cross-validation rotates the test set across k folds for a more reliable performance estimate
    • ROC-AUC measures the model's ability to distinguish classes across all classification thresholds
  • Scikit-learn Pipeline (required) — Building reproducible ML workflows with Pipeline, ColumnTransformer, and custom transformers
    • Pipeline chains preprocessing steps and the estimator into a single object that prevents data leakage
    • ColumnTransformer applies different transformations to numerical and categorical columns in parallel
    • Pipelines integrate seamlessly with cross_val_score and GridSearchCV for end-to-end validation
    • Build custom transformers by implementing fit() and transform() with BaseEstimator and TransformerMixin
  • Unsupervised Learning (required) — K-Means clustering, PCA dimensionality reduction, DBSCAN, and discovering hidden patterns
    • K-Means partitions data into k clusters by minimizing within-cluster distances; use the elbow method to choose k
    • PCA reduces dimensionality by projecting data onto principal components that capture the most variance
    • DBSCAN finds clusters of arbitrary shape and automatically identifies noise points (outliers)
    • Unsupervised learning is used for customer segmentation, anomaly detection, and dimensionality reduction
  • Neural Networks Basics (required) — Perceptrons, activation functions, forward pass, backpropagation, and gradient descent
    • A neural network is layers of neurons; each neuron computes a weighted sum, adds a bias, and applies an activation function
    • Activation functions (ReLU, sigmoid, softmax) introduce non-linearity, enabling the network to learn complex patterns
    • Backpropagation calculates gradients of the loss with respect to each weight using the chain rule
    • Gradient descent iteratively updates weights to minimize the loss function; learning rate controls step size
  • Deep Learning with TensorFlow/Keras (recommended) — Building, training, and evaluating deep neural networks with the Keras high-level API
    • Keras provides a high-level Sequential and Functional API for building neural networks with minimal code
    • Define layers (Dense, Conv2D, LSTM), compile with an optimizer and loss function, then fit on data
    • Use callbacks (EarlyStopping, ModelCheckpoint, ReduceLROnPlateau) to control training behavior
    • Monitor training/validation loss curves to diagnose overfitting and adjust architecture or regularization
  • Natural Language Processing Intro (recommended) — Text preprocessing, tokenization, TF-IDF, word embeddings, and sentiment analysis basics
    • Text preprocessing: lowercasing, removing punctuation/stopwords, stemming, and lemmatization
    • TF-IDF converts text to numerical vectors by weighting term frequency against document frequency
    • Word embeddings (Word2Vec, GloVe) capture semantic relationships in dense vector representations
    • Pre-trained transformer models (BERT, GPT) achieve state-of-the-art results; fine-tune with Hugging Face
  • Ensemble Methods (recommended) — XGBoost, gradient boosting, bagging, stacking, and combining models for better performance
    • Bagging (Random Forest) trains models on random subsets in parallel; boosting trains them sequentially to correct errors
    • XGBoost and LightGBM are optimized gradient boosting libraries that dominate tabular data competitions
    • Stacking uses the predictions of base models as features for a meta-learner to combine their strengths
    • Ensemble methods almost always outperform individual models by reducing variance and/or bias
  • Hyperparameter Tuning (recommended) — GridSearchCV, RandomizedSearchCV, Bayesian optimization, and efficient search strategies
    • Hyperparameters (learning rate, max_depth, n_estimators) are set before training and control model complexity
    • GridSearchCV exhaustively searches all parameter combinations; slow but thorough for small grids
    • RandomizedSearchCV samples random combinations, offering better efficiency for large search spaces
    • Bayesian optimization (Optuna, Hyperopt) uses past results to intelligently explore the search space
  • Model Deployment (optional) — Serving models via Flask APIs, building Streamlit apps, and tracking with MLflow
    • Save trained models with joblib or pickle; load them in a web server to serve predictions
    • Flask provides lightweight REST APIs: expose a /predict endpoint that accepts JSON input
    • Streamlit turns Python scripts into interactive web apps for model demos and data exploration
    • MLflow tracks experiments (parameters, metrics, artifacts) and manages model versioning and deployment
  • Computer Vision Basics (optional) — Image classification, convolutional neural networks, transfer learning, and data augmentation
    • CNNs use convolutional layers to detect spatial features (edges, textures, objects) in images
    • Transfer learning leverages pre-trained models (ResNet, VGG, EfficientNet) by fine-tuning on your dataset
    • Data augmentation (rotation, flip, zoom, crop) artificially increases training set size and reduces overfitting
    • Use ImageDataGenerator or tf.data pipelines for efficient image loading and preprocessing
Advertisement
Join Us
blur