# Samples of Thoughts

## about data, statistics and everything in between

I recently stumbled across this data set about visa costs. It is a collection of visa costs for all countries for different kind of visas (tourist, business, work, transit, and some other visas). Each row corresponds to visa relations between a source country (the country applying for the visa) and a target country (the country issuing the visa) together with the cost for the different visa types.

Since I had a bit of free time on my hand, I decided to do some “plotcrastinating”, play around with the data and try out some new visualizations.

## Travelling the world

As a German myself, I enjoy the privilege of a great number of visas, many of them free. But how do other countries fare in terms of the number of countries they can travel? And how many of these can they visit without visa cost?

Each point represents one or multiple countries that get the same number of tourist visas. E.g. to the right, the field is led by the USA with 162 visas, closely followed by the grey dot representing Belgium, Finland, and France.

Both Africa and Asia have distinctly bimodal distributions: Mauritius and Seychelles, both popular tourist destinations, score much higher than all other African countries. In Asia, countries like Singapore, South Korea and Japan get more than 150 tourist visas while most other countries in Asia score only around or below 100.

These numbers are, however, only lower bounds. The data collection process is described in this working paper and for some countries no visa information could be found online. A visa relation might still exist but it is fair to say that if it can’t be found online, it is probably more effort to obtain it.

#### A technical note

This kind of plot is one of my favorite plots and I use it (or a variant of it) pretty much all the time. At its core, it’s just geom_point():

p <- d %>%
count(source, source_continent) %>%
ggplot(aes(x = source_continent, y = n, col=source_continent )) +
geom_point() +
coord_flip()

There’s a great blog post by Cedric Scherer that walks through all the steps on how to prettify this plot, in particular, how to include the arrows and annotations. There are a few adaptions I made: Instead of geom_point(), I use geom_quasirandom() from {ggbeeswarm}. It packs points close together while trying to avoid overplotting. It doesn’t solve the overplotting problem completely though. I therefore like to add a small border line to the points so they don’t look like a big, weirdly shaped blob. The trick to getting these is to use a shape that has fill and color parameters. The only point shapes that have these are the shapes 21 to 25:

p <- d %>%
count(source, source_continent) %>%
ggplot(aes(x = source_continent, y = n )) +
geom_quasirandom(aes(fill=source_continent),
color = "white", shape = 21 ) +
coord_flip()

## Free travel (visas) for everyone!

The existence of a visa relation doesn’t tell us how much effort it will be to apply for it nor how difficult it is to have it granted. For an estimate of how many countries one can visit without much hassle, we can restrict the data to free tourist visas. I first wondered if there might be some countries that don’t get any free tourist visas but this does not seem to be the case. At least in this data set, every country gets at least 11 free visas. (Remember, this is a lower bound)

I am going to use the same plot style as above but make a small modification. One problem with the previous plot is that the countries Luxembourg and China are both represented by the same amount of ink. To better represent the actual number of people affected by a visa policy, it is better to use the population for the bubble sizes:

The general distribution mostly stays the same, just shifted to the left. On the lower hand is Iraq (again) with only 11 free visas. The US has been overtaken by a whole group of European countries, led by Finland, Germany, and Sweden which all three get 119 free visas each. The US is now behind Canada and Singapore, Japan, and Korea, being on the same level as Brazil with 106 free visas.

The bubble sizes make it much clearer that the lower field has some of the most populous countries. We can see from the grey line (median weighted by population) that one half of the world population gets less than 30 visas for free while a good chunk of the other half gets around 100 free visas.

## Exchange of Ideas

While free travel visas are nice to visit new cities and beaches, it is easier to form deeper connections with land and people through a longer stay, such as by studying in the country.

Obviously, some countries are such popular destinations for studying that higher visa cost wouldn’t deter international students, think e.g. US or Australia. However, I was interested to see how visa policies regarding student visas are in the rest of the world. As the data set forms a directed network, I’m going to plot the student visa data as a network. And since the nodes are countries, I’ll plot the network on top of a world map.

#### Another technical note

As this seems a rather complex plot, I was pleasantly surprised that {ggraph} makes this surprisingly simple.

First, we built the graph using {tidygraph} by extracting the countries as nodes from our data set. The data set itself provides the edges.

nodes <- d %>%
group_by(country=source) %>%
summarise(lat = unique(source_lat),
long = unique(source_long))

graph <- tbl_graph(edges = d %>% select(source, target),
nodes = nodes,
directed = TRUE )

To build the plot, we first specify our layout manually by providing the coordinates for each node.

g <- graph %>%
ggraph(layout = "manual", x=nodes$long, y=nodes$lat) 

Next, we’ll need the map which is provided in the {maps} package via map_data("world")

country_shapes <- geom_polygon(aes(x = long, y = lat, group = group),
data = map_data('world'))

To get the full plot, we then simply add up the different layers:

g +
country_shapes +
geom_edge_arc() +
geom_node_point()

After adding aesthetics and some fine-tuning, the final plot looks like this:

To make this visualization less crowded, I omitted countries with less than a million people. The edges are colored by the continent of the visiting country while the nodes are sized by the number of incoming edges, i.e. the number of free student visas it gives out.

Again, it’s important to keep in mind that this data is not complete and the student visa data has more missing values than tourist visa data. However, I think it is still reasonable to assume that if no visa information is easily available, it means more effort for the student and it is thus less likely to be a common student destination. Of course, some of the most popular destinations have fees in place and thus these relations don’t show up here: the US has no incoming edges and Australia has only one coming from New Zealand.

The densest part of the network is centered at Europe: the EU makes it very easy for its citizens to study anywhere within its member countries. But it looks like it’s also very generous in giving free student visas to countries outside of the EU: most edges of all other continents seem to be directed towards Europe. This could just be because there are more countries in Europe but the same observation holds if we merge all EU countries into a single node:

I found curious that both North and South America have almost no free visa relations inside their own continent. Especially for South America, I expected more free visa connections amongst the countries.

Having visa relations between two countries doesn’t necessarily mean that people use this visa though. For example, the country offering the most free student visas in Africa is Benin which, according to Wikipedia, has one of the lowest literacy rates in the world. Benin managed to more than double their university enrollment, so their visa policy might be part of their education strategy. Still, it’s unlikely that generous visa policies by themselves lead to more international students.

## Summary

The more I learn about {ggplot2} the more impressed I am with what is possible. On the other hand, the more new tricks I learn, the more tempted I am to spend more time on it. Anyway, I think the results were still worth the time invest and hopefully, next time, it takes less time to tweak color and fonts.

Full code.