*Welcome to The Carpentries Etherpad! This pad is synchronized as you type, so that everyone viewing this page sees the same text. This allows you to collaborate seamlessly on documents. Use of this service is restricted to members of The Carpentries community; this is not for general purpose use (for that, try https://etherpad.wikimedia.org). Users are expected to follow our code of conduct: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html All content is publicly available under the Creative Commons Attribution License: https://creativecommons.org/licenses/by/4.0/ ---------------------------------------------------------------------------- * Useful links * Zoom Links- the password was given in the emails you received prior to the workshop. * Pre Workshop meeting: https://us02web.zoom.us/j/81392842565 * Workshop: https://us02web.zoom.us/j/81353680435 * Workshop webpage: https://edcarp.github.io/2020-07-01-sfc-online/ * Dataset and setup:https://edcarp.github.io/2020-05-26-sfc-online/setup-python-workshop * Data Skills Workforce Development https://www.ed.ac.uk/bayes/about-us/our-work/education/workforce-development * Code of Conduct https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html * Pre-workshop survey: https://carpentries.typeform.com/to/wi32rS?slug=2020-07-01-sfc-online * Post-workshop survey: https://carpentries.typeform.com/to/UgVdRQ?slug=2020-07-01-sfc-online *Attendance list Please write your name to confirm your attendance at the workshop: 1. Gillian Buchan X 2. Mitchelle Mnemo X 3. Vladimir Cvetkovic X 4. Alison Harvey X 5. Gordon Renfrew X 6. Niall McCandlish X 7. Matt Quinn X 8. Ian Watt X 9. Nathan Goodfriend X 10. Anna Shirokova 11. Giammarco Nalin X 12. Steve Ford X 13. Lila X *Operating system: Mac, Windows 10, Linux, Mac OSX, *EDINBURGH CARPENTRIES https://edcarp.github.io/ -- sign up to the mailing list bottom of the hompage) get in touch with g.peru@epcc.ed.ac.uk *Timetable *Time schedule Wednesday * 9:00-09:15 Intro * 9:15-10:30 Data organisation with Spreadsheets * 10:30-11:00 Coffee break 1 * 11:00-11:45 Data organisation with Spreadsheets cont'd * 11:45-12:30 Data cleaning with OpenRefine * 12:30-13:30 Lunch break * 13:30-14:30 Data cleaning with OpenRefine cont'd * 14:30-14:45 Coffee break 2 * 14:45-15:45 Python * 15:45-16:00 Coffee break 3 * 16:00-17:00 Python *Time schedule Thursday * 9:00-10:30 Python * 10:30-11:00 Coffee break 1 * 11:00-12:30 Python * 12:30-13:30 Lunch break * 13:30-15:00 Python * 15:00-15:30 Coffee break 2 * 15:30 -17:00 Python *Day 1 morning *Lessons 1: Data organisation with spreadsheets Setup page: https://edcarp.github.io/2020-05-26-sfc-online/setup-python-workshop * You will need any spreadsheet program, e.g. Excel, OpenOffice, Google Sheets, LibreOffice * Download this data to your computer: https://ndownloader.figshare.com/articles/6262019/versions/4 Unzip it to a location that you can easily find on your computer. *Biggest spreadsheet fails https://blogs.oracle.com/smb/10-of-the-costliest-spreadsheet-boo-boos-in-history *CSV file format .csv format: https://www.lifewire.com/csv-file-2622708
 CSV = comma-separated values format - a way to save spreadsheets in plain text files (each line is one rw and values in rows are separates with commas)
, e.g.: Name,weight,height Harry,85,190 Mia,60,165 Dennis,75,183 Article discussing a demonstrative example: https://www.bbc.co.uk/news/magazine-22223190 *Formatting issues Colour coding used in one field Zero and blanks both used for empty fields Multiple values in one cell 'oxen, cows, goats' etc Data on different tabs - hard to compare Using keys such as * and highlights to convey meaning Inconsistent variables (mabatisloping/matabi_sloping) *used as a key in 1 tab character and numerical symbols, mixed. variation in use of a space or a '_' Missing data Date given for Tanzania collection, but not for Mozambique In Tanzania data, is livestock left blank =0 if so why 0 written in some cases Brackets used sometimes for singular and not others. Spelling mistakes in the data Y and N and yes and no used interchangeably and not consistently ('only in summer') negative number for rooms? -99 rooms? different text used for the same roof_type Mixed data in the one cell Some names have underscore instead of space Mozambique tab livestock column presents bulk number for separate catetgories which are all slightly different, mixing numerical with strings/text; in plots tabel water use is not coded in a consistent way. - Tanzania: dwelling, a comment in the rooms number, livestock is a mix of numerical and text values, look after cows again includes comment hard to interpret Mozambiique key id missing in livestock roof_type - use of underscores is variable Do barns=cowsheds? Tables in different countries start in different rows!! Grrrrr :( F12/Tanzania - use of asteriks(sp?) forces a string in cell not an integer missing data - some rows blank/empty Tanzania livestock data captured differently to Mozambique Tanzania has a mix of both numerical and categorical data in the same column tanzania mixing numerical data and text in one column Negative values for room entries and plots Multiple tables on one tab Negative numbers eg "-999" Numbers with asterisks to notes Yes AND no value with explanatory note Text (yes / no ) in otherwise numeric column Mix of Yes, No, Y, N, and numbers in single column in Tanzania, * used twice but for multiple meanings Mozambique - Plots - multiple variables with no formatting The key columns do not represent the same type of records across the two workbooks * *Tidy data *Metadata ukdataservice.ac.uk * one observation per row * one variable per column * one value per cell (cells should not be empty) * one dataset per table Which is the separation character (i.e. space, comma, semic)? Village location? Text explanation of each column title instance_id means? Items owned and months_lack_food columns are still showing mutliple entries in 1 field What is classified as a room? Definition of frequently for affect_conflicts Would express months lacking food as a number, rather than naming them - if significant, have each month in a different column, if you need to know food poverty in Jan, for eg. How to handle NULL vakues in specific columns *Comment from Zoom Chat operating systems and tools now have default encodings they use, worth checking which on it is. Should be UTF-8 but does not have to be. People used to use Latin-1 a while back, as an extension to ASCII for all latin characters. But the world is moving towards UTF-8. “Mac OS Roman is a character encoding primarily used by the classic Mac OS to represent text.” !!! See: https://en.wikipedia.org/wiki/Mac_OS_Roman *Feedback for the SpreadSheet session Can you please come up with one thing you liked about the morning session on spreadsheets and one thing you did not like or that could be improved. Please add these as bullet points. *Thing I liked * format of date and entries (subtle mistakes), limit of digits for large numbers in integer format * creating tables using the Insert -> Table approach * I liked the info re the dates in Excel - something I always seem to have bother with. * Well paced * The closer look at data type/tidying was well worth it *Thing that could be improved * the introduction could go faster * slow to get started * I was hoping for slightly more advanced level *Lesson 2: Data cleaning with OpenRefine * Installation instructions: https://edcarp.github.io/2020-05-26-sfc-online/setup-python-workshop * Download this data to your computer: https://ndownloader.figshare.com/articles/6262019/versions/4 Unzip it to a location that you can easily find on your computer. Download OpenRefine URL: http://openrefine.org/download.html *OpenRefine 3.3 http://127.0.0.1:3333/ The final release of OpenRefine 3.3, released on January 31, 2020. Please backup your workspace directory before installing and report any problems that you encounter. A change log is provided on the release page. * Windows kit, Download, unzip, and double-click on openrefine.exe. If you’re having issues with the above, try double-clicking on refine.bat instead. * Mac kit, Download, open, drag icon into the Applications folder and double click on it. If you encounter a security warning, see workaround. * Linux kit, Download, extract, then type ./refine to start. Setup/ Install OpenRefine - short video tutorial: https://vimeo.com/339788742 https://figshare.com/articles/SAFI_Survey_Results/6262019 *Custom facet value.replace("[", "").replace("]", "").replace("'", "") value.split(";") *Latitude exercise Chirodzo Chirodzo - same in both tests Chirodzo Chirodzo Chirodzo - the date values were clear, but the GPS for severla villages seemed too close to call (I might try again) *Feedback for the OpenRefine session Can you please come up with one thing you liked about the afternoon session on OpenRefine and one thing you did not like or that could be improved. Please add these as bullet points. *Thing I liked * How to export data, the cleaning steps and whole projects * Very interesting and insightful. The amount of hours I could have saved over the years if I knew this existed is scary * Seems very powerful. I can think of lots of times where I could use this rather than Excel! * Great introduction, enough ground covered for 'initiation' *Thing that could be improved * went a bit fast at the end * A few more individual exercises would be good just to cement what we had learned. * The above comment is a goo point: it might be nice to have another data set with a set of exercises we need to go through, so a particular problem we have to solve after clean up. *Day 1: Afternoon and Day 2 *Lesson 3: Python * Installation instructions: https://edcarp.github.io/2020-05-26-sfc-online/setup-python-workshop * Download this data to your computer: https://ndownloader.figshare.com/articles/6262019/versions/4 . Unzip it to a location that you can easily find on your computer. * Python lesson material: https://edcarp.github.io/2020-07-01-sfc-online/files/python-novice-gapminder-data.zip * Robert's slides, jupyter script files, raw input data file (gapminder): https://github.com/robertn01/Data-Carpentry-Social_Sciences-Python/tree/master * https://github.com/robertn01/Data-Carpentry-Social_Sciences-Python Direct download URLs: Download & install Python 3.x - https://www.python.org/downloads/ Download & install Anaconda - https://www.anaconda.com/products/individual (see bottom of the page!) Setup/ installation: Jupyter Notebook 
https://www.youtube.com/watch?v=HW29067qVWk 
 
Installing Anaconda on Windows: 
https://www.youtube.com/watch?v=uOwCiZKj2rg Life saver - should be in Google Cloud: Google Colab https://colab.research.google.com/notebooks/welcome.ipynb Data: * https://edcarp.github.io/2020-07-01-sfc-online/files/python-novice-gapminder-data.zip *different keyboard input modes: Command mode - binds the keyboard to notebook level actions. Indicated by a grey cell border with a blue left margin. Edit mode - when you’re typing in a cell. Indicated by a green cell border Jupiter has two different different keyboard input modes: Command mode - binds the keyboard to notebook level actions. Indicated by a grey cell border with a blue left margin. Edit mode - when you’re typing in a cell. Indicated by a green cell border. From zoom chat: Your display with Zoom may get a little crowded, here are some helpful shortcuts to switch panels on / off *macOS: Command(⌘)+U: Display/hide Participants panel Command(⌘)+Shift+H: Show/hide In-Meeting Chat Panel *Windows Alt+H: Display/hide In-Meeting Chat panel Alt+U:Display/hide Participants panel *More shortcuts here: https://support.zoom.us/hc/en-us/articles/205683899-Hot-Keys-and-Keyboard-Shortcuts-for-Zoom *Errors in Spreadsheets: https://blogs.oracle.com/smb/10-of-the-costliest-spreadsheet-boo-boos-in-history Excel specifications and limits https://support.microsoft.com/en-us/office/excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3 *UK dataservice (check how they use metadata for their data) https://ukdataservice.ac.uk/ Any new/ unaswered questions regarding Python/ Anaconda/ Jupyter Lab - please list below: *Day-2: Python notes *Exercise:fill in the blanks data_europe = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country') data_europe.____.plot(label='min') data_europe.____ plt.legend(loc='best') plt.xticks(rotation=90) data_europe = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country') data_europe.min().plot(label='min') data_europe.max().plot(label='max') plt.legend(loc='best') plt.xticks(rotation=90) Lists & loop my_list = [1, 32, 3, 4] my_list = sorted(my_list) for number in [1, 2, 4, 7]: print(number) print('This is another line') print('This is a single line!') for number in my_list: print(number) print('This is another line') print('This is a single line!') my_list_range = range(10) for wahh in my_list_range: print(wahh) m_l = range(100) for num in m_l: square = num * num cube = num ** 3 print(num, square, cube) total = 0 for number in range(100): total = total + number print(total) Using a + sign does what you’d expect
 print(“a” + “b”) # ab
 print(“a”, “b”) # a b Conditionals/ if-else statements: number = -1 if number > 0: print(number, 'is positive') elif number == 0: print(number, 'is zero') else: print(number, 'is negative') list_of_numbers = [0, -1, -3, 0.0, 3.2, 10, -2] for number in list_of_numbers: if number > 0: print(number, 'is positive') elif number == 0: # logical evaluation print(number, 'is zero') else: print(number, 'is negative') import pandas as pd filenames = ['raw_data/gapminder_gdp_africa.csv','raw_data/gapminder_gdp_asia.csv'] for filename in filenames: data = pd.read_csv(filename, index_col = 'country') print(filename, len(data),'different countries') import pandas as pd filenames2 = glob.glob('raw_data/*gdp*.csv') for filename in filenames2: data = pd.read_csv(filename, index_col = 'country') print(filename, len(data),'different countries') combining what we learnt recently import pandas as pd import glob import matplotlib.pyplot as plt # read file names for continents filenames = glob.glob('raw_data/gapminder_gdp_*.csv') for file in filenames: #load .csv into pandas data = pd.read_csv(file, index_col = 'country') mean = data.mean() if 'continent' in data.columns: del data['continent'] years = data.columns.str.strip('gdpPercap_').astype(int) continents = file.replace('raw_data/gapminder_gdp_','').replace('.csv','') plt.plot(years, mean, label = continents) plt.legend() plt.xlabel('Year') plt.ylabel('Average GDP') plt.title('GDP in continents') plt.savefig('gdp_in_continents.pdf') Functions def print_greet(): print('Hello!') def print_date(year, month, day): joined = str(day) + '/' +str(month) + '/' + str(year) print(joined) print_date(1871, 3, 19) def return_date(year, month, day): joined = str(day) + '/' +str(month) + '/' + str(year) return(joined) def avg_values(values): if len(values) == 0: return None # no need 'else'/.. once fn has an output reached 'return' function quits.. return sum(values)/ len(values) import pandas as pd data = pd.read_csv('raw_data/gapminder_all.csv', index_col = 'country') continents = ['Africa', 'Asia','Oceania', 'Europe', 'Americas'] for cont in continents: cont_data = data[data['continent'] == cont] cont_data.to_csv('raw_data/my_' + cont.lower() + '_data.csv') Solution to the last exercise: import pandas as pd data = pd.read_csv("data/gapminder_all.csv", index_col='country') # do this for all continents using a for loop continents = ["Africa", "Asia", "Oceania", "Europe", "Americas"] for continent in continents: continent_data = data[data['continent'] == continent] continent_data.to_csv("my_" + continent + "_data.csv") # at the end, you should have my_americas_data.csv, my_africa_data.csv etc. *Useful Links https://python-graph-gallery.com/ ==================================== *Use your new skills (posted with Lucia's permission). I run a charity called Code The City. We run regular data hack weekends (now online) in which volunteers work with industry, charities or the cultural sector on set challenges. Our next event will be on 1st and 2nd August and will be on History and Culture. We will have a number of challenges which will spawn projects and teams. If you fancy getting involved, and using your new skills, please see https://codethecity.org/what-we-do/hack-weekends/code-the-city-20-history-and-culture/ And if you are interested in reading about previous events read this one https://codethecity.org/what-we-do/hack-weekends/code-the-city-19-history-data-innovation/ Thanks. Ian ( ian@codethecity.org ) *Thing I liked * Pace was better on Day 2 (for me too) * Very knowlegdable presenters * Covered many topics, I feel like autonomous in starting analysing my data with phyton * Very informative. Day 2 pace seemed to better with good flow *Thing that could be improved * More intro on notebooks and when use notebook and when straight script * difference between .py and .jpynb, better list of the needed softwares, more structured day 1 * Bit more context of how Anaconda, JupyerLab etc fit together. Maybe a few more short individual exercises. *