Data Analysis Learning Roadmap

Master data analysis from fundamentals through SQL, visualization, and business intelligence for data-driven decision making

Duration: 24 weeks | 4 steps | 40 topics

Career Opportunities

  • Data Analyst
  • Business Intelligence Analyst
  • Data Visualization Specialist
  • Reporting Analyst
  • Business Analyst
  • Market Research Analyst

Step 1: Data Analysis Fundamentals

Build a strong foundation in data types, descriptive statistics, data cleaning, and spreadsheet-based analysis

Time: 6 weeks | Level: beginner

  • Data Types & Structures (required) — Understand the different types of data (numerical, categorical, ordinal) and how data is structured in tables, arrays, and files.
    • Quantitative data is numerical (discrete or continuous); qualitative data is categorical (nominal or ordinal)
    • Structured data fits neatly into rows and columns; unstructured data includes text, images, and logs
    • Understanding data types determines which statistical methods and visualizations are appropriate
    • Common file formats include CSV, JSON, Excel, Parquet, and SQL databases
  • Descriptive Statistics (required) — Calculate and interpret measures of central tendency, spread, and distribution to summarize datasets.
    • Mean, median, and mode measure central tendency; each is appropriate in different contexts
    • Standard deviation and IQR measure spread and variability in the data
    • Histograms and box plots visualize distribution shape, skewness, and outliers
    • Percentiles and z-scores describe where individual values fall relative to the entire dataset
  • Data Cleaning Techniques (required) — Identify and resolve data quality issues including missing values, duplicates, inconsistencies, and formatting errors.
    • Handle missing values by deletion, imputation (mean/median/mode), or flagging depending on context
    • Remove duplicate records after verifying they are true duplicates and not legitimate repeated entries
    • Standardize formats for dates, currencies, phone numbers, and text casing
    • Validate data against business rules and expected ranges to catch anomalies early
  • Excel for Analysis (required) — Use Microsoft Excel as a powerful analysis tool with formulas, functions, conditional formatting, and data analysis add-ins.
    • VLOOKUP/XLOOKUP, INDEX/MATCH, and SUMIFS are essential for data retrieval and aggregation
    • Conditional formatting highlights patterns, outliers, and trends visually in large datasets
    • Data validation and named ranges improve accuracy and usability of analytical workbooks
    • The Analysis ToolPak add-in provides regression, ANOVA, and other statistical tools
  • Spreadsheet Formulas & Pivot Tables (required) — Master pivot tables for dynamic summarization and advanced formulas for complex calculations across datasets.
    • Pivot tables summarize, group, and cross-tabulate data without writing formulas
    • Use calculated fields and items in pivot tables for custom metrics
    • Slicers and timelines provide interactive filtering for pivot table dashboards
    • Array formulas and dynamic arrays enable advanced multi-cell calculations
  • Sampling & Distributions (recommended) — Understand sampling methods and probability distributions that underpin statistical inference and hypothesis testing.
    • Random sampling ensures every member of a population has an equal chance of selection, reducing bias
    • The normal distribution is the most important distribution; the Central Limit Theorem explains why
    • Sample size affects the reliability of estimates; larger samples reduce standard error
  • Data Collection Methods (recommended) — Learn primary and secondary data collection methods including surveys, APIs, web scraping, and public datasets.
    • Primary data is collected directly for a specific purpose; secondary data already exists from other sources
    • Surveys must be carefully designed to avoid leading questions and response bias
    • APIs provide structured, programmatic access to data from platforms and services
  • Data Ethics & Privacy (recommended) — Understand ethical responsibilities when handling personal data, including consent, anonymization, and regulatory compliance.
    • Always obtain informed consent before collecting and analyzing personal data
    • Anonymize or pseudonymize data to protect individual identities in analysis and reporting
    • Understand GDPR, CCPA, and other privacy regulations that govern data handling
  • Google Sheets Advanced (optional) — Leverage Google Sheets for collaborative analysis with QUERY function, Apps Script automation, and connected sheets.
    • The QUERY function uses a SQL-like syntax to filter, sort, and aggregate data within a sheet
    • IMPORTDATA, IMPORTHTML, and IMPORTXML pull external data directly into spreadsheets
    • Google Apps Script enables automation and custom functions for repetitive workflows
  • Intro to Python for Analysis (optional) — Get started with Python and the pandas library for programmatic data manipulation that scales beyond spreadsheets.
    • Pandas DataFrames are the primary data structure for tabular data analysis in Python
    • Use read_csv, read_excel, and read_sql to load data from various sources into DataFrames
    • Chaining methods like .groupby(), .filter(), .apply() enables expressive data transformations

Step 2: SQL and Database Management

Master SQL for querying, aggregating, and managing data in relational databases used across every industry

Time: 6 weeks | Level: intermediate

  • SELECT & Filtering (required) — Write foundational SQL queries to retrieve specific columns and filter rows using WHERE clauses and logical operators.
    • SELECT specifies which columns to retrieve; * selects all columns (avoid in production queries)
    • WHERE filters rows using comparison operators (=, <>, >, <), LIKE, IN, BETWEEN, and IS NULL
    • AND, OR, and NOT combine multiple conditions; use parentheses to control evaluation order
    • ORDER BY sorts results; LIMIT/TOP restricts the number of rows returned
  • JOINs & Subqueries (required) — Combine data from multiple tables using different join types and use subqueries for complex multi-step logic.
    • INNER JOIN returns only rows with matching keys in both tables
    • LEFT JOIN returns all rows from the left table and matched rows from the right (NULLs for no match)
    • Subqueries (nested SELECT statements) can appear in WHERE, FROM, and SELECT clauses
    • Correlated subqueries reference the outer query and execute once per row; use with caution for performance
  • Aggregate Functions & GROUP BY (required) — Summarize data using COUNT, SUM, AVG, MIN, MAX and group results by categorical columns.
    • GROUP BY collapses rows with the same values into summary rows for aggregation
    • HAVING filters groups after aggregation (unlike WHERE which filters individual rows before)
    • COUNT(*) counts all rows including NULLs; COUNT(column) counts only non-NULL values
    • Combine multiple aggregates in a single query for comprehensive summary statistics
  • Database Design & Normalization (required) — Design efficient relational database schemas by applying normalization rules to eliminate redundancy and ensure data integrity.
    • First Normal Form (1NF) requires atomic values and no repeating groups in columns
    • Second Normal Form (2NF) removes partial dependencies on composite primary keys
    • Third Normal Form (3NF) removes transitive dependencies where non-key columns depend on other non-key columns
    • Primary keys uniquely identify rows; foreign keys establish relationships between tables
  • Window Functions (required) — Perform calculations across sets of rows related to the current row without collapsing the result set like GROUP BY.
    • Window functions use OVER() to define the window (partition and order) without collapsing rows
    • ROW_NUMBER(), RANK(), and DENSE_RANK() assign rankings within partitions
    • LAG() and LEAD() access values from previous or next rows for period-over-period comparisons
    • Running totals and moving averages use SUM()/AVG() with ROWS BETWEEN in the OVER clause
  • CTEs & Temp Tables (recommended) — Use Common Table Expressions and temporary tables to break complex queries into readable, maintainable steps.
    • CTEs (WITH clauses) create named temporary result sets that improve query readability
    • Recursive CTEs traverse hierarchical data like org charts and category trees
    • Temporary tables persist for the session and are useful for multi-step analysis workflows
  • Query Optimization (recommended) — Improve SQL query performance through indexing strategies, execution plan analysis, and query restructuring.
    • Use EXPLAIN/EXPLAIN ANALYZE to view query execution plans and identify bottlenecks
    • Indexes speed up WHERE, JOIN, and ORDER BY operations but add overhead to writes
    • Avoid SELECT *; retrieve only the columns you need to reduce I/O and memory usage
  • PostgreSQL Features (recommended) — Explore PostgreSQL-specific features like JSONB support, array types, full-text search, and extensions.
    • JSONB columns store semi-structured data and support indexing and querying with operators like ->> and @>
    • Array types allow storing multiple values in a single column with array functions and operators
    • PostgreSQL's full-text search provides powerful text matching without external search engines
  • NoSQL Basics (optional) — Understand when and why to use NoSQL databases like MongoDB, and how they differ from relational databases.
    • NoSQL databases trade strict consistency for flexibility, scalability, and performance at scale
    • Document databases (MongoDB) store JSON-like documents without requiring a fixed schema
    • Choose NoSQL when data is highly variable, rapidly changing, or needs horizontal scaling
  • Database Administration (optional) — Learn basic database administration tasks including backups, user management, permissions, and monitoring.
    • Regular backups (pg_dump, pg_basebackup) and tested restore procedures protect against data loss
    • Use GRANT and REVOKE to manage user permissions following the principle of least privilege
    • Monitor query performance, disk usage, and connection counts to maintain database health

Step 3: Data Visualization

Transform data into compelling visual stories using Tableau, Power BI, and Python libraries to communicate insights effectively

Time: 6 weeks | Level: intermediate

  • Visualization Principles (required) — Learn the foundational principles of effective data visualization including clarity, accuracy, and audience-appropriate design.
    • Maximize the data-ink ratio: every visual element should serve a purpose
    • Avoid chartjunk, 3D effects, and unnecessary decorations that distort or distract from data
    • Consider your audience's data literacy when choosing complexity and chart types
    • Pre-attentive attributes (color, size, position) can highlight key insights instantly
  • Tableau Fundamentals (required) — Build interactive visualizations and dashboards in Tableau using drag-and-drop with dimensions, measures, and calculated fields.
    • Tableau distinguishes between dimensions (categorical) and measures (numerical) automatically
    • Drag fields to Rows, Columns, Color, Size, and Detail shelves to build visualizations
    • Calculated fields and table calculations enable custom metrics and advanced analysis
    • Publish dashboards to Tableau Public or Tableau Server for sharing with stakeholders
  • Dashboard Design (required) — Design effective analytical dashboards with clear layout, logical flow, interactivity, and performance optimization.
    • Follow a Z-pattern or F-pattern layout that matches how users naturally scan information
    • Place the most important KPIs and insights at the top or top-left of the dashboard
    • Use filters, parameters, and drill-down actions for interactive exploration
    • Limit dashboards to 5-9 visualizations to avoid cognitive overload
  • Chart Types & When to Use (required) — Select the right chart type for your data and message, from bar charts and line graphs to heatmaps and scatter plots.
    • Bar charts compare categorical values; horizontal bars are better when labels are long
    • Line charts show trends over time; multiple lines compare trends across categories
    • Scatter plots reveal relationships and correlations between two numerical variables
    • Avoid pie charts for more than 5 categories; use bar charts or treemaps instead
  • Power BI Basics (required) — Build reports and dashboards in Microsoft Power BI using data modeling, DAX basics, and the report canvas.
    • Power BI Desktop connects to data sources, transforms data with Power Query, and builds report visuals
    • The data model defines relationships between tables for cross-filtering in visuals
    • DAX (Data Analysis Expressions) creates calculated columns and measures for custom analytics
    • Publish reports to the Power BI Service for sharing, scheduling refreshes, and collaboration
  • Interactive Visualizations (recommended) — Add interactivity to visualizations with filters, parameters, drill-throughs, and linked selections for exploratory analysis.
    • Filter actions allow clicking on one chart to filter all other charts in the dashboard
    • Parameters let users dynamically change what data or metric is displayed
    • Drill-through pages provide detail views when users click on summary-level data points
  • Python Visualization (matplotlib/seaborn) (recommended) — Create publication-quality static visualizations with matplotlib and seaborn for data analysis notebooks and reports.
    • Matplotlib provides low-level control; seaborn builds on it with attractive statistical visualizations
    • Use plt.subplots() for multi-panel figures and fine-grained layout control
    • Seaborn's built-in themes and color palettes produce professional-looking charts with minimal code
  • Color Theory for Data (recommended) — Apply color theory principles to data visualization for accessibility, emphasis, and clear categorical distinction.
    • Use sequential palettes for ordered data, diverging palettes for data with a meaningful midpoint
    • Limit categorical colors to 7-10 distinct hues; use labeling when more categories are needed
    • Ensure sufficient contrast for colorblind accessibility (use colorblind-safe palettes like Viridis)
  • D3.js Introduction (optional) — Get started with D3.js for building custom, interactive, web-based data visualizations with full creative control.
    • D3 binds data to DOM elements and applies data-driven transformations for custom visualizations
    • SVG elements (rect, circle, path, text) are the building blocks for D3 charts
    • D3's enter-update-exit pattern manages dynamic data changes and transitions
  • Geospatial Visualization (optional) — Create maps and geographic visualizations to represent spatial data patterns, regional comparisons, and location-based insights.
    • Choropleth maps use color intensity to represent values across geographic regions
    • Point maps display individual locations with size and color encoding for additional dimensions
    • Use geographic context carefully: map projections can distort area comparisons

Step 4: Business Intelligence and Reporting

Apply BI architecture, ETL processes, KPI tracking, and stakeholder communication to drive data-informed business decisions

Time: 6 weeks | Level: advanced

  • BI Architecture & Tools (required) — Understand the components of a BI architecture including data warehouses, ETL pipelines, semantic layers, and reporting tools.
    • BI architecture typically flows from source systems through ETL to a data warehouse and into reporting tools
    • Star and snowflake schemas organize data warehouses with fact tables (metrics) and dimension tables (context)
    • Semantic layers (data models) translate raw data into business-friendly terms and calculations
    • Modern data stacks use cloud-based tools (Snowflake, dbt, Looker, Fivetran) for scalable BI
  • ETL/ELT Processes (required) — Design data pipelines that Extract data from sources, Transform it for analysis, and Load it into target systems.
    • ETL transforms data before loading (traditional); ELT loads raw data first and transforms in the warehouse (modern)
    • Common transformations include cleaning, deduplication, type casting, joining, and aggregation
    • Tools like dbt enable version-controlled, tested, and documented SQL-based transformations
    • Schedule pipelines to run on a cadence (hourly, daily) and monitor for failures
  • KPI Definition & Tracking (required) — Define meaningful Key Performance Indicators aligned to business goals and build systems to track them over time.
    • KPIs should be Specific, Measurable, Achievable, Relevant, and Time-bound (SMART)
    • Leading indicators predict future performance; lagging indicators measure past results
    • Limit the number of KPIs to maintain focus; 5-7 per area or role is a practical guideline
    • Track KPIs against targets and benchmarks to surface when metrics are off track
  • Automated Reporting (required) — Automate the generation and distribution of recurring reports to reduce manual effort and ensure consistency.
    • Schedule data refreshes in BI tools so dashboards always show current data
    • Use email subscriptions and alerts to push reports to stakeholders automatically
    • Python scripts (with pandas and openpyxl) can generate and email formatted Excel reports
    • Automation reduces human error and frees analysts for higher-value exploratory work
  • Stakeholder Communication (required) — Present data findings to non-technical audiences with clear narratives, appropriate visualizations, and actionable recommendations.
    • Lead with the insight and recommendation, not the methodology or raw data
    • Tailor the level of detail to your audience: executives want summaries, analysts want depth
    • Use the situation-complication-resolution framework to structure data stories
    • Anticipate questions and prepare supporting analysis for drill-down discussions
  • Data Modeling for BI (recommended) — Design dimensional data models optimized for analytical queries with star schemas, slowly changing dimensions, and hierarchies.
    • Star schemas place a central fact table surrounded by denormalized dimension tables for fast queries
    • Fact tables contain measurable events (sales, clicks, transactions); dimension tables describe context (who, what, where, when)
    • Slowly changing dimensions (SCD Type 1, 2, 3) track how dimension attributes evolve over time
  • Power BI Advanced (DAX) (recommended) — Write advanced DAX formulas for complex business calculations including time intelligence, iterators, and context manipulation.
    • CALCULATE modifies filter context and is the most important function in DAX
    • Time intelligence functions (TOTALYTD, SAMEPERIODLASTYEAR, DATEADD) simplify period comparisons
    • Iterator functions (SUMX, AVERAGEX, MAXX) evaluate expressions row-by-row before aggregating
  • Predictive Analytics Intro (recommended) — Apply basic statistical models and forecasting techniques to predict future trends from historical data.
    • Regression analysis models the relationship between dependent and independent variables
    • Time series forecasting uses historical patterns (trend, seasonality) to project future values
    • Built-in forecasting in Tableau and Power BI provides quick predictive visualizations
  • Data Governance (optional) — Establish policies and processes for data quality, lineage, cataloging, and access control across the organization.
    • Data governance defines who owns data, who can access it, and how it should be managed
    • Data catalogs document available datasets, their lineage, and definitions for discoverability
    • Data quality rules and monitoring ensure ongoing accuracy and completeness of analytical data
  • Self-Service BI (optional) — Enable business users to explore data and create their own reports without depending on the data team for every request.
    • Self-service BI empowers non-technical users to answer their own data questions
    • Provide curated, validated datasets and semantic models to ensure consistency and accuracy
    • Balance flexibility for users with governance controls to prevent data misinterpretation
Advertisement
Join Us
blur