*Welcome to The Carpentries Etherpad!

This pad is synchronized as you type, so that everyone viewing this page sees the same text. This allows you to collaborate seamlessly on documents.

Use of this service is restricted to members of The Carpentries community; this is not for general purpose use (for that, try https://etherpad.wikimedia.org).

Users are expected to follow our code of conduct: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html

All content is publicly available under the Creative Commons Attribution License: https://creativecommons.org/licenses/by/4.0/

 ----------------------------------------------------------------------------
* Useful links
	* Zoom Links- the password was given in the emails you received prior to the workshop.
		* Pre Workshop meeting:  https://us02web.zoom.us/j/81392842565
		* Workshop: https://us02web.zoom.us/j/81353680435
	* Workshop webpage: https://edcarp.github.io/2020-07-01-sfc-online/
	* Dataset and setup:https://edcarp.github.io/2020-05-26-sfc-online/setup-python-workshop
	* Data Skills Workforce Development https://www.ed.ac.uk/bayes/about-us/our-work/education/workforce-development
	* Code of Conduct https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html
	* Pre-workshop survey: https://carpentries.typeform.com/to/wi32rS?slug=2020-07-01-sfc-online
	* Post-workshop survey: https://carpentries.typeform.com/to/UgVdRQ?slug=2020-07-01-sfc-online
*Attendance list
Please write your name to confirm your attendance at the workshop:
	1. Gillian Buchan X
	2. Mitchelle Mnemo X
	3. Vladimir Cvetkovic X
	4. Alison Harvey X
	5. Gordon Renfrew X
	6. Niall McCandlish X
	7. Matt Quinn X
	8. Ian Watt X
	9. Nathan Goodfriend X
	10. Anna Shirokova
	11. Giammarco Nalin X
	12. Steve Ford X
	13. Lila X

*Operating system: 
Mac, Windows 10, Linux, Mac OSX,


*EDINBURGH CARPENTRIES
https://edcarp.github.io/ -- sign up to the mailing list 
bottom of the hompage)
get in touch with g.peru@epcc.ed.ac.uk 


*Timetable
*Time schedule Wednesday
		* 9:00-09:15 Intro 
		* 9:15-10:30 Data organisation with Spreadsheets 
		* 10:30-11:00 Coffee break 1
		* 11:00-11:45 Data organisation with Spreadsheets cont'd
		* 11:45-12:30 Data cleaning with OpenRefine 
		* 12:30-13:30 Lunch break 
		* 13:30-14:30 Data cleaning with OpenRefine cont'd
		* 14:30-14:45 Coffee break 2
		* 14:45-15:45 Python 
		* 15:45-16:00 Coffee break 3
		* 16:00-17:00 Python 
*Time schedule  Thursday
		* 9:00-10:30 Python 
		* 10:30-11:00 Coffee break 1
		* 11:00-12:30 Python 
		* 12:30-13:30 Lunch break 
		* 13:30-15:00 Python 
		* 15:00-15:30 Coffee break 2
		* 15:30 -17:00 Python 
*Day 1 morning

*Lessons 1: Data organisation with spreadsheets
Setup page: https://edcarp.github.io/2020-05-26-sfc-online/setup-python-workshop

	* You will need any spreadsheet program, e.g. Excel, OpenOffice, Google Sheets, LibreOffice
	* Download this data to your computer: https://ndownloader.figshare.com/articles/6262019/versions/4  Unzip it to a location that you can easily find on your computer.

*Biggest spreadsheet fails
https://blogs.oracle.com/smb/10-of-the-costliest-spreadsheet-boo-boos-in-history
*CSV file format
.csv format: https://www.lifewire.com/csv-file-2622708 
CSV = comma-separated values format - a way to save spreadsheets in plain text files (each line is one rw and values in rows are separates with commas) , e.g.:

Name,weight,height
Harry,85,190
Mia,60,165
Dennis,75,183

Article discussing a demonstrative example:
https://www.bbc.co.uk/news/magazine-22223190 
*Formatting issues
Colour coding used in one field
Zero and blanks both used for empty fields
Multiple values in one cell 'oxen, cows, goats' etc
Data on different tabs - hard to compare
Using keys such as * and highlights to convey meaning
Inconsistent variables (mabatisloping/matabi_sloping)
*used as a key in 1 tab
character and numerical symbols, mixed.
variation in use of a space or a '_' 
Missing data
Date given for Tanzania collection, but not for Mozambique
In Tanzania data, is livestock left blank =0 if so why 0 written in some cases

Brackets used sometimes for singular and not others.
Spelling mistakes in the data
Y and N and yes and no used interchangeably and not consistently ('only in summer')
negative number for rooms?
-99 rooms?
different text used for the same roof_type
Mixed data in the one cell
Some names have underscore instead of space
Mozambique tab livestock column presents bulk number for separate catetgories which are all slightly different, mixing numerical with strings/text; in plots tabel water use is not coded in a consistent way. - Tanzania: dwelling, a comment in the rooms number, livestock is a mix of numerical and text values, look after cows again includes comment hard to interpret
Mozambiique key id missing in livestock
roof_type - use of underscores is variable
Do barns=cowsheds?
Tables in different countries start in different rows!! Grrrrr :(
F12/Tanzania - use of asteriks(sp?) forces a string in cell not an integer
missing data - some rows blank/empty
Tanzania livestock data captured differently to Mozambique
Tanzania has a mix of both numerical and categorical data in the same column
tanzania mixing numerical data and text in one column
Negative values for room entries and plots 
Multiple tables on one tab
Negative numbers eg "-999"
Numbers with asterisks to notes
Yes AND no value with explanatory note
Text (yes / no ) in otherwise numeric column
Mix of Yes, No, Y, N, and numbers in single column
in Tanzania, * used twice but for multiple meanings
Mozambique - Plots - multiple variables with no formatting

The key columns do not represent the same type of records across the two workbooks
*
*Tidy data
*Metadata
ukdataservice.ac.uk 

* one observation per row
* one variable per column
* one value per cell (cells should not be empty)
* one dataset per table
Which is the separation character (i.e. space, comma, semic)?
Village location?
Text explanation of each column title
instance_id means?
Items owned and months_lack_food columns are still showing mutliple entries in 1 field
What is classified as a room?
Definition of frequently for affect_conflicts
Would express months lacking food as a number, rather than naming them - if significant, have each month in a different column, if you need to know food poverty in Jan, for eg.
How to handle NULL vakues in specific columns

*Comment from Zoom Chat
operating systems and tools now have default encodings they use, worth checking which on it is. Should be UTF-8 but does not have to be.
People used to use Latin-1 a while back, as an extension to ASCII for all latin characters. But the world is moving towards UTF-8.
“Mac OS Roman is a character encoding primarily used by the classic Mac OS to represent text.” !!! See: https://en.wikipedia.org/wiki/Mac_OS_Roman

*Feedback for the SpreadSheet session
Can you please come up with one thing you liked about the morning session on spreadsheets and one thing you did not like or that could be improved. Please add these as bullet points.
*Thing I liked
	* format of date and entries (subtle mistakes), limit of digits for large numbers in integer format
	* creating tables using the Insert -> Table approach
	* I liked the info re the dates in Excel - something I always seem to have bother with.
	* Well paced
	* The closer look at data type/tidying was well worth it
*Thing that could be improved
	* the introduction could go faster
	* slow to get started
	* I was hoping for slightly more advanced level

*Lesson 2: Data cleaning with OpenRefine
	* Installation instructions: https://edcarp.github.io/2020-05-26-sfc-online/setup-python-workshop
	* Download this data to your computer: https://ndownloader.figshare.com/articles/6262019/versions/4  Unzip it to a location that you can easily find on your computer.

Download OpenRefine

URL: http://openrefine.org/download.html 
*OpenRefine 3.3
http://127.0.0.1:3333/
The final release of OpenRefine 3.3, released on January 31, 2020. Please backup your workspace directory before installing and report any problems that you encounter. A change log is provided on the release page.
		* Windows kit, Download, unzip, and double-click on openrefine.exe. If you’re having issues with the above, try double-clicking on refine.bat instead.
		* Mac kit, Download, open, drag icon into the Applications folder and double click on it. If you encounter a security warning, see workaround.
		* Linux kit, Download, extract, then type ./refine to start.

Setup/ Install OpenRefine - short video tutorial:
https://vimeo.com/339788742 

https://figshare.com/articles/SAFI_Survey_Results/6262019

*Custom facet
value.replace("[", "").replace("]", "").replace("'", "")
value.split(";")
*Latitude exercise

Chirodzo
Chirodzo - same in both tests
Chirodzo
Chirodzo
Chirodzo - the date values were clear, but the GPS for severla villages seemed too close to call (I might try again)

*Feedback for the OpenRefine session
Can you please come up with one thing you liked about the afternoon session on OpenRefine and one thing you did not like or that could be improved. Please add these as bullet points.

*Thing I liked
	* How to export data, the cleaning steps and whole projects
	* Very interesting and insightful. The amount of hours I could have saved over the years if I knew this existed is scary
	* Seems very powerful. I can think of lots of times where I could use this rather than Excel!
	* Great introduction, enough ground covered for 'initiation'
*Thing that could be improved
	* went a bit fast at the end
	* A few more individual exercises would be good just to cement what we had learned.
	* The above comment is a goo point: it might be nice to have another data set with a set of exercises we need to go through, so a particular problem we have to solve after clean up.

*Day 1: Afternoon and Day 2 
*Lesson 3: Python
	* Installation instructions: https://edcarp.github.io/2020-05-26-sfc-online/setup-python-workshop
	* Download this data to your computer: https://ndownloader.figshare.com/articles/6262019/versions/4 . Unzip it to a location that you can easily find on your computer.
	* Python lesson material: https://edcarp.github.io/2020-07-01-sfc-online/files/python-novice-gapminder-data.zip
	* Robert's slides, jupyter script files, raw input data file (gapminder): https://github.com/robertn01/Data-Carpentry-Social_Sciences-Python/tree/master 
		* https://github.com/robertn01/Data-Carpentry-Social_Sciences-Python


Direct download URLs:
Download & install Python 3.x - https://www.python.org/downloads/ 
Download & install Anaconda - https://www.anaconda.com/products/individual (see bottom of the page!)

Setup/ installation:
Jupyter Notebook
 https://www.youtube.com/watch?v=HW29067qVWk 
 
 Installing Anaconda on Windows:
 https://www.youtube.com/watch?v=uOwCiZKj2rg

Life saver - should be in Google Cloud: Google Colab
https://colab.research.google.com/notebooks/welcome.ipynb 

Data: 
	* https://edcarp.github.io/2020-07-01-sfc-online/files/python-novice-gapminder-data.zip


*different keyboard input modes:

Command mode - binds the keyboard to notebook level actions. Indicated by a grey cell border with a blue left margin.
Edit mode - when you’re typing in a cell. Indicated by a green cell border
Jupiter has two different different keyboard input modes:Command mode - binds the keyboard to notebook level actions. Indicated by a grey cell border with a blue left margin.Edit mode - when you’re typing in a cell. Indicated by a green cell border.


From zoom chat:

Your display with Zoom may get a little crowded, here are some helpful shortcuts to switch panels on / off
*macOS:
Command(⌘)+U: Display/hide Participants panel
Command(⌘)+Shift+H: Show/hide In-Meeting Chat Panel
*Windows
Alt+H: Display/hide In-Meeting Chat panel
Alt+U:Display/hide Participants panel
*More shortcuts here:
https://support.zoom.us/hc/en-us/articles/205683899-Hot-Keys-and-Keyboard-Shortcuts-for-Zoom

*Errors in Spreadsheets:
https://blogs.oracle.com/smb/10-of-the-costliest-spreadsheet-boo-boos-in-history

Excel specifications and limits https://support.microsoft.com/en-us/office/excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3
*UK dataservice (check how they use metadata for their data)
https://ukdataservice.ac.uk/

Any new/ unaswered questions regarding Python/ Anaconda/ Jupyter Lab - please list below:


*Day-2: Python notes
*Exercise:fill in the blanks
data_europe = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
data_europe.____.plot(label='min')
data_europe.____
plt.legend(loc='best')
plt.xticks(rotation=90)

data_europe = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
data_europe.min().plot(label='min')
data_europe.max().plot(label='max') 
plt.legend(loc='best')
plt.xticks(rotation=90)

Lists & loop
my_list = [1, 32, 3, 4]
my_list = sorted(my_list)

for number in [1, 2, 4, 7]:
    print(number)
    print('This is another line')
print('This is a single line!')

for number in my_list:
    print(number)
    print('This is another line')
print('This is a single line!')

my_list_range = range(10)
for wahh in my_list_range:
    print(wahh)

m_l = range(100)
for num in m_l:
    square = num * num
    cube = num ** 3
    print(num, square, cube)
    
total = 0
for number in range(100):
    total = total + number
print(total)

Using a + sign does what you’d expect 
print(“a” + “b”) # ab 
print(“a”, “b”) # a b

Conditionals/ if-else statements:

number = -1

if number > 0:
    print(number, 'is positive')
elif number == 0:
    print(number, 'is zero')
else:
    print(number, 'is negative')


list_of_numbers = [0, -1, -3, 0.0, 3.2, 10, -2]
for number in list_of_numbers:
    if number > 0:
        print(number, 'is positive')
    elif number == 0: # logical evaluation
        print(number, 'is zero')
    else:
        print(number, 'is negative')
        

import pandas as pd
filenames = ['raw_data/gapminder_gdp_africa.csv','raw_data/gapminder_gdp_asia.csv']
for filename in filenames:
    data = pd.read_csv(filename, index_col = 'country')
    print(filename, len(data),'different countries') 

import pandas as pd
filenames2 = glob.glob('raw_data/*gdp*.csv')
for filename in filenames2:
    data = pd.read_csv(filename, index_col = 'country')
    print(filename, len(data),'different countries')

combining what we learnt recently

import pandas as pd
import glob
import matplotlib.pyplot as plt

# read file names for continents
filenames = glob.glob('raw_data/gapminder_gdp_*.csv')

for file in filenames:
    #load .csv into pandas
    data = pd.read_csv(file, index_col = 'country')
    mean = data.mean()
    if 'continent' in data.columns:
        del data['continent']
    years = data.columns.str.strip('gdpPercap_').astype(int)
    continents = file.replace('raw_data/gapminder_gdp_','').replace('.csv','')
    plt.plot(years, mean, label = continents)
plt.legend()
plt.xlabel('Year')
plt.ylabel('Average GDP')
plt.title('GDP in continents')

plt.savefig('gdp_in_continents.pdf')

Functions

def print_greet():
    print('Hello!')
    
def print_date(year, month, day):
    joined = str(day) + '/' +str(month) + '/' + str(year)
    print(joined)

print_date(1871, 3, 19)

def return_date(year, month, day):
    joined = str(day) + '/' +str(month) + '/' + str(year)
    return(joined)

def avg_values(values):
    if len(values) == 0:
        return None # no need 'else'/.. once fn has an output reached 'return' function quits..
    return sum(values)/ len(values)
    

import pandas as pd

data = pd.read_csv('raw_data/gapminder_all.csv', index_col = 'country')

continents = ['Africa', 'Asia','Oceania', 'Europe', 'Americas']

for cont in continents:
    cont_data = data[data['continent'] == cont]
    cont_data.to_csv('raw_data/my_' + cont.lower() + '_data.csv')
    
    Solution to the last exercise:
import pandas as pd

data = pd.read_csv("data/gapminder_all.csv", index_col='country')

# do this for all continents using a for loop
continents = ["Africa", "Asia", "Oceania", "Europe", "Americas"]

for continent in continents:
    continent_data = data[data['continent'] == continent]
    continent_data.to_csv("my_" + continent + "_data.csv")
# at the end, you should have my_americas_data.csv, my_africa_data.csv etc.
*Useful Links

https://python-graph-gallery.com/

====================================
*Use your new skills
(posted with Lucia's permission). I run a charity called Code The City. We run regular data hack weekends (now online) in which volunteers work with industry, charities or the cultural sector on set challenges. Our next event will be on 1st and 2nd August and will be on History and Culture. We will have a number of challenges which will spawn projects and teams. If you fancy getting involved, and using your new skills, please see https://codethecity.org/what-we-do/hack-weekends/code-the-city-20-history-and-culture/ 

And if you are interested in reading about previous events read this one https://codethecity.org/what-we-do/hack-weekends/code-the-city-19-history-data-innovation/  Thanks. Ian ( ian@codethecity.org )


*Thing I liked
	* Pace was better on Day 2 (for me too)
	* Very knowlegdable presenters
	* Covered many topics, I feel like autonomous in starting analysing my data with phyton
	* Very informative. Day 2 pace seemed to better with good flow
*Thing that could be improved
	* More intro on notebooks and when use notebook and when straight script
	* difference between .py and .jpynb, better list of the needed softwares, more structured day 1
	* Bit more context of how Anaconda, JupyerLab etc fit together. Maybe a few more short individual exercises.
	*