2025-02-24-Python-Soc-Sci

Reproducible analysis: Python for social scientists (24-25 February 2024; 9.30 to 14.30) https://softwaresaved.github.io/2025-02-24-Python-SocSci-online/

We are always happy to know your experience! Plase fill our feedback form, it will help us to better prepare our workshops in the future: https://forms.office.com/e/s8Cwa5inKp

Preparation details

Zoom: https://zoom.us/download
Jupyter Notebook, through Anaconda Navigator: https://www.anaconda.com/download/success .
Alternatively to Jupyter, there is Colaboratory service run by google: https://colab.research.google.com/
Python lesson: https://swcarpentry.github.io/python-novice-gapminder/
Training data: https://swcarpentry.github.io/python-novice-gapminder/files/python-novice-gapminder-data.zip

#Technical Solution 1 (Python 3 kernel missing)
-In Anaconda Prompt type and press enter:
conda install ipykernel

-If you encounter issues, this is a longer process:
conda activate base
-Then:
conda install ipykernel
-Then:
python -m ipykernel install --user --name=base --display-name "Python 3 (base)"

-Alternatively, reinstall jupyter (it should fix most of the issues)
conda install -n base jupyter
----------------------------------------------------------------------------

Attendance list (day 1 - please fill in when you join):

1) Andrzej Romaniuk (host)
2) Yat Wing Wingo Ng
3) Bartlomiej Chybowski (helper)
4) Rajith Lakshman
5) Helen Packwood
6) Paul Spence
7)Elena Tacchino
8) Rowan Hart
9)Pippa Thomson
10)Caroline Garcia Forlim
11) Miguel Carrero
12)Sahana Suraj
13) Rachel Gibson
14)Ilkay Holt
15)Rebecca Harris
16)Fiona Maguire
17)Ahmed Kamala
18)Taofeeq Badmus
19) Diego Chillón Pino (helper)
20) Ilia Afanasev
21)Tianqing Guo
22) Nuria Hermida
23) Marion Lieutaud
24)Danai Korre (Instructor)

Attendance list (day 2 - please fill in when you join):

1) Andrzej Romaniuk(host)
2) Max
3) Helen Packwood
4) Ilia Afanasev
5) Taofeeq Badmus
6)Rowan Hart
7) Bartlomiej Chybowski (helper)
8)Pippa Thomson
9)Rebecca Harris
10)Sahana Suraj
11) Miguel Carrero
12)Ilkay Holt
13) Rajith Lakshman
14)Caroline Garcia Forlim
15)Elena Tacchino
16)Ahmed Kamala
17)Tianqing Guo
18)Danai Korre (Instructor)
19)Fiona Maguire
20) Diego Chillón Pino (helper)
21) Yat Wing Wingo Ng
22)
23)

If you could be a character in any movie, what character and what movie would it be?

Essential shortcuts

Run cells: ctrl + Enter (Cmd + Enter on Mac works too)
Run cells and select bellow: shift + Enter
Run cells and add cell bellow: alt + Enter
Change cell type to code: esc + Y
Change cell type to markdown: esc + M
Add cell bellow: esc + B

Markdown syntax cheatsheet
https://www.markdownguide.org/cheat-sheet/

Python built in functions cheatsheet
https://www.pythoncheatsheet.org/cheatsheet/built-in-functions

Python Libraries students can make use of to aid their studies
https://docs.pyclubs.org/python-across-all-disciplines/disciplines/history

The Python Graph Library
https://python-graph-gallery.com/

Built-in functions and help

# What we put in a function is called an argument
# Functions can be built to include no arguments, one argument, or multiple arguments
print('before')
print()
print('after')

# Each function results in semething. If there is noting in result of use, we get "None" as a result
result = print('example')
print('result of print is', result)

# Key functions for math are max(), min() and round()
print(max(1, 2, 3))
print(min('a', 'A', '0'))

# Fuctions accept specific data, and some may return error if provided invalid data, e.g. mismatch between
# argument types
print(max(1, 'a'))

# Sometimes you do not need to provide some arguments as they may be alreday predefined
# For example, round() actually has two arguments, one being the value you want to round (has to be provided), and one to what decimal point you want to round (by default 0, so no dec points)
round(3.712)
# Additonal argument is metnioned after the required one
round(3.712, 1)

# There is also another type of a function, attached at the end of an object
# This is called method
# Some repeat what standard functions do, but by this design you do not necessarily need to add arguments in ()
# it can help adding readibility, and you can string methods together in a sequence, each after .
#Below, you seee how it works by using alternative to len and .swapcase() to swap lower to uppercase and vice #versa
my_string = 'Hello world!'
print(len(my_string))
print(my_string.swapcase())
print(my_string.__len__())

#Those can be stringed together, and executed in a sequece they are written
print(my_string.isupper()) # Not all the letters are uppercase
print(my_string.upper()) # This capitalizes all the letters

print(my_string.upper().isupper()) # Now all the letters are uppercase

# help() function may be a great idea if you want to know more about a specific function
help(round)

# Syntax Error is the most common error
# Forgot to close the quote marks around the string.
name = 'Feng

# An extra '=' in the assignment.
age = = 52

print("hello world"

# runtime error is an error during the code execution
age = 53
remaining = 100 - aege # mis-spelled 'age'

Exercise 1

Predict what each of the print statements in the program below will print.
Does max(len(rich), poor) run or produce an error message? If it runs, does its result make any sense?

easy_string = "abc"
print(max(easy_string))
rich = "gold"
poor = "tin"
print(max(rich, poor))
print(max(len(rich), len(poor)))

Exercise 2

Why is it that max and min do not return None when they are called with no arguments?

Exercise 3

If Python starts counting from zero, and len returns the number of characters in a string, what index expression will get the last character in the string name? (Note: we will see a simpler way to do this in a later episode.)

Libraries

# Libraries are collection of files, mostly with new functions to do specific tasks or to expand options of an evnrionment you are coding in
# Libraries can be loaded by using import
import math

#And now you can use functions from this library, by typing math and then adding function after ., e.g. pi value here
print('pi is', math.pi)
#You have to mention association to a library each time
print('cos(pi) is', math.cos(math.pi))

# If you are unsure what a library contains, and how it can help you, you can check it via help() function
help(math)

# That's a lot
# But what when we only want specific functions?
# well, you can use from, point to a library, and use import on the functions you want
# Notice you do not need to mention a library this time!

from math import cos, pi
print('cos(pi) is', cos(pi))

# Given some libraries can have quite complex names, we can use "as" to establish an alias we will be using when calling functions form it

import math as m
print('cos(pi) is', m.cos(m.pi))

Ex 1: Exploring the Math Module

What function from the math module can you use to calculate a square root without using sqrt?
Since the library contains this function, why does sqrt exist?

Ex 2: Jigsaw Puzzle (Parson’s Problem) Programming Example

Rearrange the following statements so that a random DNA base is printed and its index in the string. Not all statements may be needed. Feel free to use/add intermediate variables.

PYTHON

bases="ACTTGCTTGAC"
import math
import random
___ = random.randrange(n_bases)
___ = len(bases)
print("random base ", bases[___], "base index", ___)

Ex 3: Importing With Aliases

Fill in the blanks so that the program below prints 90.0.
Rewrite the program so that it uses import without as.
Which form do you find easier to read?

PYTHON

import math as m

angle = ____.degrees(___.pi / 2)
print(____)

Reading tabular data into dataframes

# The library for working with data frames (i.e. tables!) is pandas
import pandas as pd

# Here we import data from a csv file
# Remember to put downloaded folder with all the files, extracted, into the working directory (in Win, usually # C:/Users/NAME)
data_oceania = pd.read_csv('data/gapminder_gdp_oceania.csv')
print(data_oceania)

# We can define column names as row headings
data_oceania_country = pd.read_csv('data/gapminder_gdp_oceania.csv', index_col='country')
print(data_oceania_country)

# we can use .info() to check the data frame
data_oceania_country.info()

# print out columns
print(data_oceania_country.columns)

# .describe() is a useful method when you need summary stats, quick
print(data_oceania_country.describe())

Ex. 1:

Reading Other Data

Read the data in gapminder_gdp_americas.csv (which should be in the same directory as gapminder_gdp_oceania.csv) into a variable called data_americas and display its summary statistics.

Ex. 2: Inspecting Data

After reading the data for the Americas, use help(data_americas.head) and help(data_americas.tail) to find out what DataFrame.head and DataFrame.tail do.

What method call will display the first three rows of this data?
What method call will display the last three columns of this data? (Hint: you may need to change your view of the data.)

Ex. 3 : Writing Data

As well as the read_csv function for reading data from a file, Pandas provides a to_csv function to write dataframes to files. Applying what you’ve learned about reading from files, write one of your dataframes to a file called processed.csv. You can use help to get information on how to use to_csv.

Pandas DataFrames

import pandas as pd
data = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')

# iloc[...,...] is the basic way we can query a data frame, selecting a specific value or values based on their entry # location in a 2d tabular space
print(data.iloc[0, 0])

# However, you do not need to use numerical values, Column and Row names (i.e. labels) also work
print(data.loc["Albania", "gdpPercap_1952"])

# We can also use :, what alone essentially means "All entries from this row or column"
print(data.loc["Albania", :])

print(data.loc[:, "gdpPercap_1952"])

# We can query multiple specific values, below we define a range for rows and columns we want to see
print(data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'])

# We can add other operations to our query, in effect getting data out of it
print(data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'].max())

print(data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'].min())

# Use a subset of data to keep output readable.
subset = data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972']
print('Subset of data:\n', subset)

# Which values were greater than 10000 ? We can use > logical operator
print('\nWhere are values large?\n', subset > 10000)

# We can use boolean masking, thus removing values we are not interested in
mask = subset > 10000
print(subset[mask])

# Here we do it to have summary statistics only for those above a specific value
print(subset[subset > 10000].describe())

#We can string those for analisys
#First, we create a mask for those countries wealthier than a general mean
mask_higher = data > data.mean()
#Then we calculate a wealth score by checking how many times a specific country appeared on the above mean list
wealth_score = mask_higher.aggregate('sum', axis=1) / len(data.columns)
#And now we print results
print(wealth_score)

Ex. 1: Assume Pandas has been imported into your notebook and the Gapminder GDP data for Europe has been loaded:

import pandas as pd
data_europe = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')

Write an expression to find the Per Capita GDP of Serbia in 2007.

Ex. 2:

Do the two statements below produce the same output?
Based on this, what rule governs what is included (or not) in numerical slices and named slices in Pandas?

print(data_europe.iloc[0:2, 0:2])
print(data_europe.loc['Albania':'Belgium', 'gdpPercap_1952':'gdpPercap_1962'])

PLOTTING

# matplotlib is the most commonly used data visualisation library for plots in Python
import matplotlib.pyplot as plt

# Basic plotting is pretty simple, just adding data to .plot() as arguments and later adding additional elements like labels
time = [0, 1, 2, 3]
position = [0, 100, 200, 300]

plt.plot(time, position)
plt.xlabel('Time (hr)')
plt.ylabel('Position (km)')

# We can include pandas in our work, let's see
import pandas as pd

data = pd.read_csv('data/gapminder_gdp_oceania.csv', index_col='country')

# Extract year from last 4 characters of each column name
# The current column names are structured as 'gdpPercap_(year)',
# so we want to keep the (year) part only for clarity when plotting GDP vs. years
# To do this we use replace(), which removes from the string the characters stated in the argument
# This method works on strings, so we use replace() from Pandas Series.str vectorized string functions

years = data.columns.str.replace('gdpPercap_', '')

# Convert year values to integers, saving results back to dataframe

data.columns = years.astype(int)

data.loc['Australia'].plot()

# let's plot transposed data
data.T.plot()
plt.ylabel('GDP per capita')

# We can do muliple types of visualisation
plt.style.use('ggplot')              #this is specifcially a visual style we want
                                                                                #(ggplot is from R, this mimicks this style in Python)
data.T.plot(kind='bar')            #kind= is where we define a type of a plot
plt.ylabel('GDP per capita')       #here we define an y label to display

# Another example
years = data.columns
gdp_australia = data.loc['Australia']

# 'g--' is a green dashed line
plt.plot(years, gdp_australia, 'g--')

# Select two countries' worth of data.
gdp_australia = data.loc['Australia']
gdp_nz = data.loc['New Zealand']

# Plot with differently-colored markers.
plt.plot(years, gdp_australia, 'b-', label='Australia')
plt.plot(years, gdp_nz, 'g-', label='New Zealand')

# Create legend.
plt.legend(loc='upper left')
plt.xlabel('Year')
plt.ylabel('GDP per capita ($)')

plt.scatter(gdp_australia, gdp_nz)

data.T.plot.scatter(x = 'Australia', y = 'New Zealand')

Ex. 1:
Fill in the blanks below to plot the minimum GDP per capita over time for all the countries in Europe. Modify it again to plot the maximum GDP per capita over time for Europe.

data_europe = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
data_europe.____.plot(label='min')
data_europe.____
plt.legend(loc='best')
plt.xticks(rotation=90)

# IF you missed this part
import matplotlib.pyplot as plt
import pandas as pd

data_europe = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
data_europe.min().plot(label='min')
data_europe.max().plot(label='max')
plt.legend(loc='best')
plt.xticks(rotation=90)

#Alternative (by participant)
data_europe.aggregate('min').plot(label='min')
data_europe.aggregate('max').plot(label='max')
plt.legend(loc='best')
plt.xticks(rotation=90)

Ex. 2:
Modify the example in the notes to create a scatter plot showing the relationship between the minimum and maximum GDP per capita among the countries in Asia for each year in the data set. What relationship do you see (if any)?

data_asia = pd.read_csv('data/gapminder_gdp_asia.csv', index_col='country')
data_asia.describe().T.plot(kind='scatter', x='min', y='max')

LISTS

# List is essentially a structure of a one dimension (in contrast to 2d data frame), containing muliple values
# We use [] and , to define what we want to include in the list
pressures = [0.273, 0.275, 0.277, 0.275, 0.276]
print('pressures:', pressures)
print('length:', len(pressures))

# You can fetch a specific value similarily how you would fetch seomthing from a string
print('zeroth item of pressures:', pressures[0])
print('fourth item of pressures:', pressures[4])

# You can overwrite specific values on a list
pressures[0] = 0.265
print('pressures is now:', pressures)

# You can also add new elements using .append()
primes = [2, 3, 5]
print('primes is initially:', primes)
primes.append(7)
print('primes has become:', primes)

# lists can be also combined, but best is to use .extend in such a case
teen_primes = [11, 13, 17, 19]
middle_aged_primes = [37, 41, 43, 47]
print('primes is currently:', primes)
primes.extend(teen_primes)
print('primes has now become:', primes)
primes.append(middle_aged_primes)
print('primes has finally become:', primes)

# also, we can use del to remove an entry from a list
primes = [2, 3, 5, 7, 9]
print('primes before removing last item:', primes)
del primes[4]
print('primes after removing last item:', primes)

# You can create a list with no values (good if you want to decleare before you will be populating it)
emptylist = []
len(emptylist)

# Lists can contain both numbers and strings
goals = [1, 'Create lists.', 2, 'Extract items from lists.', 3, 'Modify lists.']
print(goals)

# Strings can be indexed like lists
element = 'carbon'
print('zeroth character:', element[0])
print('third character:', element[3])
#... but cannot be modified that way
element[0] = 'C'
# We also cannot index beyond its original length
print('99th element of element is:', element[99])

Ex. 1: What does the following program print?

PYTHON

element = 'fluorine'
print(element[::2])
print(element[::-1])

If we write a slice as low:high:stride, what does stride do?
What expression would select all of the even-numbered items from a collection?

LOOPS

# for loop esentially repeats a specific code a set number of times
for number in [2, 3, 5]:
   print(number)

# the sequence can be defined before as a list, useful if you have to iterate through a list of values
primes = [2, 3, 5]

for p in primes:
   squared = p ** 2
   cubed = p ** 3
   print(p, squared, cubed)

Ex. 1: Create a table showing the numbers of the lines that are executed when this program runs, and the values of the variables after each line is executed.

PYTHON

total = 0

for char in "tin":

total = total + 1

1 total = 0
2 total = 0 char = 't'