Data Analysis Learning Roadmap
Master data analysis from fundamentals through SQL, visualization, and business intelligence for data-driven decision making
Duration: 24 weeks | 4 steps | 40 topics
Career Opportunities
- Data Analyst
- Business Intelligence Analyst
- Data Visualization Specialist
- Reporting Analyst
- Business Analyst
- Market Research Analyst
Step 1: Data Analysis Fundamentals
Build a strong foundation in data types, descriptive statistics, data cleaning, and spreadsheet-based analysis
Time: 6 weeks | Level: beginner
- Data Types & Structures (required) — Understand the different types of data (numerical, categorical, ordinal) and how data is structured in tables, arrays, and files.
- Quantitative data is numerical (discrete or continuous); qualitative data is categorical (nominal or ordinal)
- Structured data fits neatly into rows and columns; unstructured data includes text, images, and logs
- Understanding data types determines which statistical methods and visualizations are appropriate
- Common file formats include CSV, JSON, Excel, Parquet, and SQL databases
- Descriptive Statistics (required) — Calculate and interpret measures of central tendency, spread, and distribution to summarize datasets.
- Mean, median, and mode measure central tendency; each is appropriate in different contexts
- Standard deviation and IQR measure spread and variability in the data
- Histograms and box plots visualize distribution shape, skewness, and outliers
- Percentiles and z-scores describe where individual values fall relative to the entire dataset
- Data Cleaning Techniques (required) — Identify and resolve data quality issues including missing values, duplicates, inconsistencies, and formatting errors.
- Handle missing values by deletion, imputation (mean/median/mode), or flagging depending on context
- Remove duplicate records after verifying they are true duplicates and not legitimate repeated entries
- Standardize formats for dates, currencies, phone numbers, and text casing
- Validate data against business rules and expected ranges to catch anomalies early
- Excel for Analysis (required) — Use Microsoft Excel as a powerful analysis tool with formulas, functions, conditional formatting, and data analysis add-ins.
- VLOOKUP/XLOOKUP, INDEX/MATCH, and SUMIFS are essential for data retrieval and aggregation
- Conditional formatting highlights patterns, outliers, and trends visually in large datasets
- Data validation and named ranges improve accuracy and usability of analytical workbooks
- The Analysis ToolPak add-in provides regression, ANOVA, and other statistical tools
- Spreadsheet Formulas & Pivot Tables (required) — Master pivot tables for dynamic summarization and advanced formulas for complex calculations across datasets.
- Pivot tables summarize, group, and cross-tabulate data without writing formulas
- Use calculated fields and items in pivot tables for custom metrics
- Slicers and timelines provide interactive filtering for pivot table dashboards
- Array formulas and dynamic arrays enable advanced multi-cell calculations
- Sampling & Distributions (recommended) — Understand sampling methods and probability distributions that underpin statistical inference and hypothesis testing.
- Random sampling ensures every member of a population has an equal chance of selection, reducing bias
- The normal distribution is the most important distribution; the Central Limit Theorem explains why
- Sample size affects the reliability of estimates; larger samples reduce standard error
- Data Collection Methods (recommended) — Learn primary and secondary data collection methods including surveys, APIs, web scraping, and public datasets.
- Primary data is collected directly for a specific purpose; secondary data already exists from other sources
- Surveys must be carefully designed to avoid leading questions and response bias
- APIs provide structured, programmatic access to data from platforms and services
- Data Ethics & Privacy (recommended) — Understand ethical responsibilities when handling personal data, including consent, anonymization, and regulatory compliance.
- Always obtain informed consent before collecting and analyzing personal data
- Anonymize or pseudonymize data to protect individual identities in analysis and reporting
- Understand GDPR, CCPA, and other privacy regulations that govern data handling
- Google Sheets Advanced (optional) — Leverage Google Sheets for collaborative analysis with QUERY function, Apps Script automation, and connected sheets.
- The QUERY function uses a SQL-like syntax to filter, sort, and aggregate data within a sheet
- IMPORTDATA, IMPORTHTML, and IMPORTXML pull external data directly into spreadsheets
- Google Apps Script enables automation and custom functions for repetitive workflows
- Intro to Python for Analysis (optional) — Get started with Python and the pandas library for programmatic data manipulation that scales beyond spreadsheets.
- Pandas DataFrames are the primary data structure for tabular data analysis in Python
- Use read_csv, read_excel, and read_sql to load data from various sources into DataFrames
- Chaining methods like .groupby(), .filter(), .apply() enables expressive data transformations
Step 2: SQL and Database Management
Master SQL for querying, aggregating, and managing data in relational databases used across every industry
Time: 6 weeks | Level: intermediate
- SELECT & Filtering (required) — Write foundational SQL queries to retrieve specific columns and filter rows using WHERE clauses and logical operators.
- SELECT specifies which columns to retrieve; * selects all columns (avoid in production queries)
- WHERE filters rows using comparison operators (=, <>, >, <), LIKE, IN, BETWEEN, and IS NULL
- AND, OR, and NOT combine multiple conditions; use parentheses to control evaluation order
- ORDER BY sorts results; LIMIT/TOP restricts the number of rows returned
- JOINs & Subqueries (required) — Combine data from multiple tables using different join types and use subqueries for complex multi-step logic.
- INNER JOIN returns only rows with matching keys in both tables
- LEFT JOIN returns all rows from the left table and matched rows from the right (NULLs for no match)
- Subqueries (nested SELECT statements) can appear in WHERE, FROM, and SELECT clauses
- Correlated subqueries reference the outer query and execute once per row; use with caution for performance
- Aggregate Functions & GROUP BY (required) — Summarize data using COUNT, SUM, AVG, MIN, MAX and group results by categorical columns.
- GROUP BY collapses rows with the same values into summary rows for aggregation
- HAVING filters groups after aggregation (unlike WHERE which filters individual rows before)
- COUNT(*) counts all rows including NULLs; COUNT(column) counts only non-NULL values
- Combine multiple aggregates in a single query for comprehensive summary statistics
- Database Design & Normalization (required) — Design efficient relational database schemas by applying normalization rules to eliminate redundancy and ensure data integrity.
- First Normal Form (1NF) requires atomic values and no repeating groups in columns
- Second Normal Form (2NF) removes partial dependencies on composite primary keys
- Third Normal Form (3NF) removes transitive dependencies where non-key columns depend on other non-key columns
- Primary keys uniquely identify rows; foreign keys establish relationships between tables
- Window Functions (required) — Perform calculations across sets of rows related to the current row without collapsing the result set like GROUP BY.
- Window functions use OVER() to define the window (partition and order) without collapsing rows
- ROW_NUMBER(), RANK(), and DENSE_RANK() assign rankings within partitions
- LAG() and LEAD() access values from previous or next rows for period-over-period comparisons
- Running totals and moving averages use SUM()/AVG() with ROWS BETWEEN in the OVER clause
- CTEs & Temp Tables (recommended) — Use Common Table Expressions and temporary tables to break complex queries into readable, maintainable steps.
- CTEs (WITH clauses) create named temporary result sets that improve query readability
- Recursive CTEs traverse hierarchical data like org charts and category trees
- Temporary tables persist for the session and are useful for multi-step analysis workflows
- Query Optimization (recommended) — Improve SQL query performance through indexing strategies, execution plan analysis, and query restructuring.
- Use EXPLAIN/EXPLAIN ANALYZE to view query execution plans and identify bottlenecks
- Indexes speed up WHERE, JOIN, and ORDER BY operations but add overhead to writes
- Avoid SELECT *; retrieve only the columns you need to reduce I/O and memory usage
- PostgreSQL Features (recommended) — Explore PostgreSQL-specific features like JSONB support, array types, full-text search, and extensions.
- JSONB columns store semi-structured data and support indexing and querying with operators like ->> and @>
- Array types allow storing multiple values in a single column with array functions and operators
- PostgreSQL's full-text search provides powerful text matching without external search engines
- NoSQL Basics (optional) — Understand when and why to use NoSQL databases like MongoDB, and how they differ from relational databases.
- NoSQL databases trade strict consistency for flexibility, scalability, and performance at scale
- Document databases (MongoDB) store JSON-like documents without requiring a fixed schema
- Choose NoSQL when data is highly variable, rapidly changing, or needs horizontal scaling
- Database Administration (optional) — Learn basic database administration tasks including backups, user management, permissions, and monitoring.
- Regular backups (pg_dump, pg_basebackup) and tested restore procedures protect against data loss
- Use GRANT and REVOKE to manage user permissions following the principle of least privilege
- Monitor query performance, disk usage, and connection counts to maintain database health
Step 3: Data Visualization
Transform data into compelling visual stories using Tableau, Power BI, and Python libraries to communicate insights effectively
Time: 6 weeks | Level: intermediate
- Visualization Principles (required) — Learn the foundational principles of effective data visualization including clarity, accuracy, and audience-appropriate design.
- Maximize the data-ink ratio: every visual element should serve a purpose
- Avoid chartjunk, 3D effects, and unnecessary decorations that distort or distract from data
- Consider your audience's data literacy when choosing complexity and chart types
- Pre-attentive attributes (color, size, position) can highlight key insights instantly
- Tableau Fundamentals (required) — Build interactive visualizations and dashboards in Tableau using drag-and-drop with dimensions, measures, and calculated fields.
- Tableau distinguishes between dimensions (categorical) and measures (numerical) automatically
- Drag fields to Rows, Columns, Color, Size, and Detail shelves to build visualizations
- Calculated fields and table calculations enable custom metrics and advanced analysis
- Publish dashboards to Tableau Public or Tableau Server for sharing with stakeholders
- Dashboard Design (required) — Design effective analytical dashboards with clear layout, logical flow, interactivity, and performance optimization.
- Follow a Z-pattern or F-pattern layout that matches how users naturally scan information
- Place the most important KPIs and insights at the top or top-left of the dashboard
- Use filters, parameters, and drill-down actions for interactive exploration
- Limit dashboards to 5-9 visualizations to avoid cognitive overload
- Chart Types & When to Use (required) — Select the right chart type for your data and message, from bar charts and line graphs to heatmaps and scatter plots.
- Bar charts compare categorical values; horizontal bars are better when labels are long
- Line charts show trends over time; multiple lines compare trends across categories
- Scatter plots reveal relationships and correlations between two numerical variables
- Avoid pie charts for more than 5 categories; use bar charts or treemaps instead
- Power BI Basics (required) — Build reports and dashboards in Microsoft Power BI using data modeling, DAX basics, and the report canvas.
- Power BI Desktop connects to data sources, transforms data with Power Query, and builds report visuals
- The data model defines relationships between tables for cross-filtering in visuals
- DAX (Data Analysis Expressions) creates calculated columns and measures for custom analytics
- Publish reports to the Power BI Service for sharing, scheduling refreshes, and collaboration
- Interactive Visualizations (recommended) — Add interactivity to visualizations with filters, parameters, drill-throughs, and linked selections for exploratory analysis.
- Filter actions allow clicking on one chart to filter all other charts in the dashboard
- Parameters let users dynamically change what data or metric is displayed
- Drill-through pages provide detail views when users click on summary-level data points
- Python Visualization (matplotlib/seaborn) (recommended) — Create publication-quality static visualizations with matplotlib and seaborn for data analysis notebooks and reports.
- Matplotlib provides low-level control; seaborn builds on it with attractive statistical visualizations
- Use plt.subplots() for multi-panel figures and fine-grained layout control
- Seaborn's built-in themes and color palettes produce professional-looking charts with minimal code
- Color Theory for Data (recommended) — Apply color theory principles to data visualization for accessibility, emphasis, and clear categorical distinction.
- Use sequential palettes for ordered data, diverging palettes for data with a meaningful midpoint
- Limit categorical colors to 7-10 distinct hues; use labeling when more categories are needed
- Ensure sufficient contrast for colorblind accessibility (use colorblind-safe palettes like Viridis)
- D3.js Introduction (optional) — Get started with D3.js for building custom, interactive, web-based data visualizations with full creative control.
- D3 binds data to DOM elements and applies data-driven transformations for custom visualizations
- SVG elements (rect, circle, path, text) are the building blocks for D3 charts
- D3's enter-update-exit pattern manages dynamic data changes and transitions
- Geospatial Visualization (optional) — Create maps and geographic visualizations to represent spatial data patterns, regional comparisons, and location-based insights.
- Choropleth maps use color intensity to represent values across geographic regions
- Point maps display individual locations with size and color encoding for additional dimensions
- Use geographic context carefully: map projections can distort area comparisons
Step 4: Business Intelligence and Reporting
Apply BI architecture, ETL processes, KPI tracking, and stakeholder communication to drive data-informed business decisions
Time: 6 weeks | Level: advanced
- BI Architecture & Tools (required) — Understand the components of a BI architecture including data warehouses, ETL pipelines, semantic layers, and reporting tools.
- BI architecture typically flows from source systems through ETL to a data warehouse and into reporting tools
- Star and snowflake schemas organize data warehouses with fact tables (metrics) and dimension tables (context)
- Semantic layers (data models) translate raw data into business-friendly terms and calculations
- Modern data stacks use cloud-based tools (Snowflake, dbt, Looker, Fivetran) for scalable BI
- ETL/ELT Processes (required) — Design data pipelines that Extract data from sources, Transform it for analysis, and Load it into target systems.
- ETL transforms data before loading (traditional); ELT loads raw data first and transforms in the warehouse (modern)
- Common transformations include cleaning, deduplication, type casting, joining, and aggregation
- Tools like dbt enable version-controlled, tested, and documented SQL-based transformations
- Schedule pipelines to run on a cadence (hourly, daily) and monitor for failures
- KPI Definition & Tracking (required) — Define meaningful Key Performance Indicators aligned to business goals and build systems to track them over time.
- KPIs should be Specific, Measurable, Achievable, Relevant, and Time-bound (SMART)
- Leading indicators predict future performance; lagging indicators measure past results
- Limit the number of KPIs to maintain focus; 5-7 per area or role is a practical guideline
- Track KPIs against targets and benchmarks to surface when metrics are off track
- Automated Reporting (required) — Automate the generation and distribution of recurring reports to reduce manual effort and ensure consistency.
- Schedule data refreshes in BI tools so dashboards always show current data
- Use email subscriptions and alerts to push reports to stakeholders automatically
- Python scripts (with pandas and openpyxl) can generate and email formatted Excel reports
- Automation reduces human error and frees analysts for higher-value exploratory work
- Stakeholder Communication (required) — Present data findings to non-technical audiences with clear narratives, appropriate visualizations, and actionable recommendations.
- Lead with the insight and recommendation, not the methodology or raw data
- Tailor the level of detail to your audience: executives want summaries, analysts want depth
- Use the situation-complication-resolution framework to structure data stories
- Anticipate questions and prepare supporting analysis for drill-down discussions
- Data Modeling for BI (recommended) — Design dimensional data models optimized for analytical queries with star schemas, slowly changing dimensions, and hierarchies.
- Star schemas place a central fact table surrounded by denormalized dimension tables for fast queries
- Fact tables contain measurable events (sales, clicks, transactions); dimension tables describe context (who, what, where, when)
- Slowly changing dimensions (SCD Type 1, 2, 3) track how dimension attributes evolve over time
- Power BI Advanced (DAX) (recommended) — Write advanced DAX formulas for complex business calculations including time intelligence, iterators, and context manipulation.
- CALCULATE modifies filter context and is the most important function in DAX
- Time intelligence functions (TOTALYTD, SAMEPERIODLASTYEAR, DATEADD) simplify period comparisons
- Iterator functions (SUMX, AVERAGEX, MAXX) evaluate expressions row-by-row before aggregating
- Predictive Analytics Intro (recommended) — Apply basic statistical models and forecasting techniques to predict future trends from historical data.
- Regression analysis models the relationship between dependent and independent variables
- Time series forecasting uses historical patterns (trend, seasonality) to project future values
- Built-in forecasting in Tableau and Power BI provides quick predictive visualizations
- Data Governance (optional) — Establish policies and processes for data quality, lineage, cataloging, and access control across the organization.
- Data governance defines who owns data, who can access it, and how it should be managed
- Data catalogs document available datasets, their lineage, and definitions for discoverability
- Data quality rules and monitoring ensure ongoing accuracy and completeness of analytical data
- Self-Service BI (optional) — Enable business users to explore data and create their own reports without depending on the data team for every request.
- Self-service BI empowers non-technical users to answer their own data questions
- Provide curated, validated datasets and semantic models to ensure consistency and accuracy
- Balance flexibility for users with governance controls to prevent data misinterpretation
