Data Visualization


Lecture 04

January 30, 2026

Review

EDA

  • EDA: Understand structure of data while making as few assumptions as possible (e.g. look at the raw data, not “smoothed” versions).
  • You might be asking the “right” questions but have the “wrong” data: be open-minded and rely on domain knowledge.
  • Quantitative EDA can be a starting point, but EDA also requires visualization.

Data Visualization

Challenges for Effective Visualization

  • Limits From Cognitive Processes
  • No “Optimal” Visualization
  • Temptation To Overload Figures
  • Easy to “Lie” About The Data

Further Challenges

Following Munzner (2014):

  • Possible designs are a bad match with human perceptual and cognitive systems;
  • Possible designs are a bad match with the intended task;
  • Only a small number of possibilities are reasonable choices;
  • Randomly choosing possibilities is a bad idea because the odds of finding a very good solution are very low.”

Remember: Data Never Speaks For Itself

Data must be understood in a particular context. You need to understand your data and what it says (or does not say!) based on your hypotheses.

  • What question(s) does your data address?
  • What transformations make the representation of the data as salient as possible?
  • What scales or channels are most appropriate?

Some Caveats

  • There is no recipe to effective visualization. Everything depends on your data and the story you want to tell.
  • This also means that defaults from data visualization packages are usually bad.
  • Principles are largely based on Western (American/European) norms and may not translate perfectly.
  • A lot of these guidelines are based on average outcomes, there is likely to be a lot of individual variation.

Some Data Viz Principles

Gestalt Principles

The Gestalt school of psychology identified several principles of perception.

Core idea: Humans are very good at finding structure.

As a result, you need to evaluate the totality of a visual field, not just each component.

Gestalt Principles

  • Proximity
  • Similarity
  • Connection
  • Common Fate
  • Continuity
  • Closure

Illustration of Gestalt principles

Illustration of several Gestalt principles. Adapted from Healy (2018).

Impact of Continuity

Code
# create trend for data
x = rand(Uniform(0, 20), 10)
y = 5 .+ 2 * x
# sample and add noise
ε = rand(Normal(0, 5), 10)
y .+= ε

p1 = scatter(x, y, xlabel=L"$x$", ylabel=L"$y$", color=:blue, markersize=8)
p2 = plot(sort(x), sort(y), xlabel=L"$x$", ylabel=L"$y$", color=:blue, marker=:circle, markersize=8)

plot(p1, p2, layout=(1,2), size=(1100, 500))

Figure 1: Illustrating impact of connected points.

Vertical Scales

  • Worry about the vertical scale
    • Are you exaggerating small changes? Downplaying meaningful changes?
    • “Adjust vertical scale of data” to reflect meaningful changes and “Vertical scale should always include zero” are conflicting advice.

Principle of Proportional Ink

On the left: 2014 quantity is ~1.1x 2010 value, but due to vertical scale the 2014 bar uses 2.7x the ink.

Can mislead viewers into thinking the difference is larger than it truly is!

Illustration of Non-Proportional Ink

What Makes A Good Color Scheme?

  • Perceptually uniformity: preserve a mapping between changes in perceived colors and changes in attribute values.
  • Color-blind safe!

Color Schemes

Good news: Most plotting libraries include a wide variety of perceptually uniform, color-blind safe color schemes.

Bad news: These are not usually the defaults (in particular, avoid “rainbow” color schemes).

Good site: Paul Tol’s color schemes.

Sequential Color Schemes

Sequential schemes change in intensity from low to high as the value changes.

Divergent Color Schemes

Divergent schemes intensify in two directions from a zero or mean value.

Unordered Color Schemes

Unordered schemes are appropriate for categorical data.

Avoid Using Two Vertical Axes

Two axis plot

Source: Datawrapper.de

Avoid Using Two Vertical Axes

Two axis plot

Illustration of how rescaling axes manipulates impression

Source: Datawrapper.de

Example: Spurious Correlations

Spurious Correlation 5901

Encode Information With Marks and Channels

Marks: Geometric primitives

  • Points
  • Segments
  • Paths
  • Polygons

Channels: Mark appearance

  • Color (Hue/Saturation/Luminescence)
  • Position (1D/2D/3D)
  • Size (Length/Area/Volume)
  • Angle

Ordered vs. Categorical Attributes

The channels available depend on the type of attribute:

  • Ordered attributes can be
    • Ordinal: Ranking, no meaning to distance;
    • Quantitative: Measure of magnitude which supports arithmetic comparison;
  • Categorical attributes are unordered.

Channel Effectiveness: Ordered Data

Channels for ordered data, arranged top-to-bottom from more to less effective (channels in the right column are less effective than those in the left). Modified from Healy (2018) after Munzner (2014).

Channel Effectiveness: Categorical Data

Channels for categorical data, arranged top-to-bottom from more to less effective. Modified from Healy (2018) after Munzer (2014).

Preattentive Popout

Try to make your key features “pop out” to the viewer during the pre-attentive scan.

Searching for the blue circle becomes harder. Adapted from Healy (2018).

Code
npt = 20
dist = Distributions.Product(Uniform.([0, 0], [1, 1]))
pts = Tuple.(eachcol(rand(dist, npt)))
blueidx = rand(1:npt)
p1 = scatter(pts[1:end .!= blueidx], color=:red, xticks=:false, yticks=:false, legend=:false, markersize=5, title="Color Only, N=20", framestyle=:box)
scatter!(p1, pts[blueidx, :], color=:blue, markersize=5)

npt = 100
pts = Tuple.(eachcol(rand(dist, npt)))
blueidx = rand(1:npt)
p2 = scatter(pts[1:end .!= blueidx], color=:red, xticks=:false, yticks=:false, legend=:false, markersize=5, title="Color Only, N=100", framestyle=:box)
scatter!(p2, pts[blueidx, :], color=:blue, markersize=5)

npt = 20
pts = Tuple.(eachcol(rand(dist, npt)))
blueidx = rand(1:npt)
p3 = scatter(pts[1:end .!= blueidx], color=:blue, markershape=:utriangle, xticks=:false, yticks=:false, legend=:false, markersize=5, title="Shape Only, N=20", framestyle=:box)
scatter!(p3, pts[blueidx, :], color=:blue, markersize=5, markershape=:circle)

npt = 100
pts = Tuple.(eachcol(rand(dist, npt)))
blueidx = rand(1:npt)
p4 = scatter(pts[1:end .!= blueidx], color=:blue, markershape=:utriangle, xticks=:false, yticks=:false, legend=:false, markersize=5, title="Shape Only, N=100", framestyle=:box)
scatter!(p4, pts[blueidx, :], color=:blue, markersize=5, markershape=:circle)

plot(p1, p2, p3, p4, layout=(2, 2), size=(800, 500))

Visualization Examples

Thoughts On This Plot?

First Street Foundation Return Period Trends

Source: Shu et al. (2023)

How About This One?

Trump Polling Average vs. Employment in Swing States

Source: Joe Weisenthal

Or This One?

Impacts of Climate Mitigation on Air Pollution

Source: Huang et al. (2023)

Or…

Relationship Between Climate Variables

Source: Errickson et al. (2021)

Last One!

Modeled Flood Risk vs. Perception

Source: Bakkensen & Barrage (2022)

Key Points

Data Visualization

  • No “cookbook”: be thoughtful and honest with your data.
  • Defaults from software packages (scales, colors, etc.) are usually bad.
  • Think carefully about if you’re creating artifacts which might mislead others, even if unintentional.

Upcoming Schedule

Next Classes

Next Week: Probability Models and Linear Regression

Assessments

Homework 1 Due Friday, 2/6.

Exercises: This week’s involves some simple computations.

Reading: Franconeri et al (2024)

Quiz 1: Friday, 2/13.

References

References (Scroll for Full List)

Bakkensen, L. A., & Barrage, L. (2022). Going Underwater? Flood Risk Belief Heterogeneity and Coastal Home Price Dynamics. Rev. Financ. Stud., 35, 3666–3709. https://doi.org/10.1093/rfs/hhab122
Errickson, F. C., Keller, K., Collins, W. D., Srikrishnan, V., & Anthoff, D. (2021). Equity is more important for the social cost of methane than climate uncertainty. Nature, 592, 564–570. https://doi.org/10.1038/s41586-021-03386-6
Healy, K. (2018). Data visualization: A practical introduction. Princeton University Press.
Huang, X., Srikrishnan, V., Lamontagne, J., Keller, K., & Peng, W. (2023). Effects of global climate mitigation on regional air quality and health. Nature Sustainability, 1–13. https://doi.org/10.1038/s41893-023-01133-5
Munzner, T. (2014). Visualization analysis and design. CRC Press.
Shu, E., Hauer, M., & Porter, J. (2023, November). Future population exposure to flood risk: A decomposition approach across Shared-Socioeconomic pathways (SSPs). Research Square. https://doi.org/10.21203/rs.3.rs-3628132/v1