Welcome to The Carpentries Etherpad!

This pad is synchronized as you type, so that everyone viewing this page sees the same text. This allows you to collaborate seamlessly on documents.

Use of this service is restricted to members of The Carpentries community; this is not for general purpose use (for that, try https://etherpad.wikimedia.org).

Users are expected to follow our code of conduct: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html

All content is publicly available under the Creative Commons Attribution License: https://creativecommons.org/licenses/by/4.0/

----------------------------------------------------------------------------

Welcome to Software Carpentry!

Links:

Workshop Webpage: https://uw-madison-datascience.github.io/2021-08-09-uwmadison-swc/
Pre-workshop Survey: https://carpentries.typeform.com/to/wi32rS?slug=2021-08-09-uwmadison-swc
Feedback: https://forms.gle/EJnp7kTSHgu73BfS6
Intro Slides: https://docs.google.com/presentation/d/1hL_JkVviKn8lGhT36U9tk0f85DOEWcnx_WENATuJEeo/edit?usp=sharing
Wrap-up Slides: https://docs.google.com/presentation/d/1_AI1PkdrITILKH3Os2ssCjUJHKm6cVkxWWz8BSRUTZg/edit?usp=sharing
Get one-on-one help from the data science facilitators by attending our office hours or coding-meetups: https://datascience.wisc.edu/hub/#dropin

Day 5 - Git and/or Workflows

Lesson: https://carpentries-incubator.github.io/swc-ext-python/

Sign-in:

Name (pronouns optional), Affiliation, what is one thing you've leared so far you'd like to implement in your work?

Sarah Stevens (she/her/hers), Data Science Hub, Instructor
Clare Michaud (she/her/hers), Data Science Hub,
Steven Warren (he/him) - Helper, iSchool MA Student,
Chandler Meyer (she/her/hers), Plant Breeding and Plant Genetics. Modifying my data set with python
Jeremiah Yee, biostatistics, glob
Joni Sedillo, medical genetics, using git
Katie Ziebarth, chemistry, Git
Chris Endemann (he/him/his), Data Science Hub, Helper
Chris Kirby - GLAS Education

Notes:

Vote below by adding a +1 to your preferred option
Review and learn about Collaborating in GitHub, then Workflows: +1,+1
Start with Workflows instead: +1,+1+1+1+1

initial gdp_plots.py:

import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt

# load data and transpose so that country names are
# the columns and their gdp data becomes the rows
data = pandas.read_csv('data/gapminder_gdp_oceania.csv', index_col = 'country').T

# create a plot of the transposed data
ax = data.plot()

# display the plot
plt.show()

git add gdp_plots.py
git commit -m "First commit of analysis script"
git status
nano .gitignore

data/*.csv
*.ipynb

Ctrl+x
y
enter
git add .gitignore
git commit -m "adding ignore file"
git log --oneline

nano gdp_plots.py

import pandas
import matplotlib.pyplot as plt
data=pandas.read_csv(filename, index_col='country')
ax=data.plot()
filename='data/gapminder_gdp_oceania.csv'
# set some plot attributes
ax.set_xlabel('Year')
ax.set_ylabel('GDP Per Capita')
ax.set_xticks(range(len(data.index)))
ax.set_xticklabels(data.index, rotation=45)
plt.show()

Ctrl+x
y
enter

gdp_plots.py v2

import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt

filename='data/gapminder_gdp_oceania.csv'

# load data and transpose so that country names are
# the columns and their gdp data becomes the rows
data = pandas.read_csv(filename, index_col = 'country').T

# create a plot of the transposed data
ax = data.plot()

# set some plot attributes
ax.set_xlabel('Year')
ax.set_ylabel('GDP Per Capita')
# set the x locations and labels
ax.set_xticks(range(len(data.index)))
ax.set_xticklabels(data.index, rotation = 45)

# display the plot
plt.show()

# test your script
python gdp_plots.py

# add/commit
git add gdp_plots.py
git commit -m "improving plot format"

# want to be able to run our script using variable input values: python gdp_plots.py FILENAME
nano args_list.py

import sys
print('the argument list is:', sys.argv)

python args_list.py # prints out, "the argument list is: ['args_list.py']
python args_list.py arg1 arg2 arg3 # prints name of script followed by all arguments

# edit our original script to take inputs
nano gdp_plots.py

import sys
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt
# OLD: filename='data/gapminder_gdp_oceania.csv'
filename=sys.argv[1]
# load data and transpose so that country names are
# the columns and their gdp data becomes the rows
data = pandas.read_csv(filename, index_col = 'country').T
# create a plot of the transposed data
ax = data.plot()
# set some plot attributes
ax.set_xlabel('Year')
ax.set_ylabel('GDP Per Capita')
# set the x locations and labels
ax.set_xticks(range(len(data.index)))
ax.set_xticklabels(data.index, rotation = 45)
# display the plot
plt.show()

Ctrl+x
y
enter

# Run script on oceania gdp
python gdp_plots.py data/gapminder_gdp_oceania.csv

python gdp_plots.py data/gapminder_gdp_asia.csv

# Ctrl+A: move to beginning of previous command entered in command line

git status
rm args_list.py # remove this file
git status
git diff gdp_plots.py
git commit -m "adding cmdline arguments" gdp_plots.py # file has to be tracked previously for this to work

# to change the name from master to main
git checkout -b main
git branch -d master

# make 2 branches
git branch py-multi-files # python branch
git branch sh-multi-files # git bash branch

git branch # see all branches

# switch to python branch
git checkout py-multi-files
git branch # current branch highlighted in green with asterisk next to it

# open our script
nano gdp_plots.py

# want to be able to run multiple files at once - use a for loop!
#filename=sys.argv[1] # comment this out

for filename in sys.argv[1:]:

# load data and transpose so that country names are
# the columns and their gdp data becomes the rows
data = pandas.read_csv(filename, index_col = 'country').T
# create a plot of the transposed data
ax = data.plot()
# set some plot attributes
ax.set_xlabel('Year')
ax.set_ylabel('GDP Per Capita')
# set the x locations and labels
ax.set_xticks(range(len(data.index)))
ax.set_xticklabels(data.index, rotation = 45)
# display the plot
plt.show()

python gdp_plots.py data/gapminder_gdp_oceania.csv data/gapminder_gdp_africa.csv
git add gdp_plots.py
git commit -m "allowing plot generation for multiple files at once"
nano gdp_plots.py

# change plt.show to....
split_name1=filename.split('.')[0] # data/gapminder_gdp_X
split_name2=split_name1.split('/')[1] # data/gapminder_gdp_X
save_name='figs/' + split_name2 + '.png'
plt.savefig(save_name)

Ctrl+x
y
enter

git add gdp_plots.py
git commit -m "saves fig for each plot as file"

# edit .gitignore to ignore any files in figs folder
nano .gitignore
figs/
Ctrl+x
y
enter

git add .gitignore
git gommit -m "ignoring figures"
git log --oneline --graph --all --decorate

# create bash script
touch gdp_plots.sh
nano gdp_plots.sh

for filename in data/gapminder_gdp_oceania.csv data/gapminder_gdp_africa.csv
do

python gdp_plots.py $filename

done

# exit nano
Ctrl+x
y
enter

# run script
bash gdp_plots.sh

# Edit python script
nano gdp_plots.py

# save plot with unique filename
split_name1=filename.split('.')[0] # data/gapminder_gdp_X
split_name2=split_name1.split('/')[1]
save_name='figs/' + split_name2 + '.png'
plt.savefig(save_name)

git add gdp_plots.sh
git add gdp_plots.py
git status

git commit -m "wrote .sh script and updated python script to save figs to unique names"
echo "figs/" >> .gitignore # echo prints out whatever comes after it, >> will append output of echo onto file specified after ">>"
cat .gitignore
git add .gitignore
git commit -m "ignore figs folder"

# let's time our scripts and see which is faster
time bash gdp_plots.sh
git checkout py-multi-files # switch branch
time python gdp_plots.py data

# nano gdp_plots.py
# check for -a flag in arguments
if '-a' in sys.argv:

filenames=glob.glob('data/*gdp*.csv')

else:

filenames=sys.argv[1:]

for filename in filenames:

# exit nano
Ctrl+x
y
enter
cat gdp_plots.py

python gdp_plots.py -a # doesn't work due to *gdp_americas.csv file being formatted differently than the others

# edit script
nano gdp_plots.py
if '-a' in sys.argv:

filenames=glob.glob('data/*gdp*[ae].csv')

else:

filenames=sys.argv[1:]

#
ls -l figs

# add/commit
git add gdp_plots.py
git commit -m "adding a flag to run script for all gdp datasets except americas

python gdp_plots.py

cd ..

###
nano gdp_plots.py
if '-a' in sys.argv:

filenames=glob.glob('data/*gdp*[ae].csv')
if filenames == []:
- # file list is empty (no files found)
- print("No files found in this folder.")
- print("Make sure data folder and files are located in current directory")
else:
filenames=sys.argv[1:]

Version after correcting for silent errors:
import sys
import glob
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt

# make sure additional arguements or flag
# have been provided by the user
if len(sys.argv) == 1:
        # why the program will not continue
        print("Not enough arguments have been provide")
        # how this can be corrected
        print("Usage: python gdp_plots.py < filenames >")
        print("Options:")
        print("-a : plot all gdp data sets in current directory")

# check for -a flag in arguments
if '-a' in sys.argv:
        filenames = glob.glob('data/*gdp*[ae].csv')
        if filenames == []:
                # file list is empty (no files found)
                print("No files found in this folder.")
                print("Make sure the data folder and files are located")
                print("in the current directory")
else:
        filenames = sys.argv[1:]

for filename in filenames:

        # load data and transpose so that country names are
        # the columns and their gdp data becomes the rows
        data = pandas.read_csv(filename, index_col = 'country').T

        # create a plot of the transposed data
        ax = data.plot()

        # set some plot attributes
        ax.set_xlabel('Year')
        ax.set_ylabel('GDP Per Capita')
        # set the x locations and labels
        ax.set_xticks(range(len(data.index)))
        ax.set_xticklabels(data.index, rotation = 45)

        # save the plot
        split_name1 = filename.split('.')[0] #data/gapminder_gdp_XXX
        split_name2 = split_name1.split('/')[1]
        save_name = 'figs/' + split_name2 + '.png'
        plt.savefig(save_name)

cd ..
python swc-gapminder/gdp_plots.py -a # error message is outputted!
cd swc-gapminder
git status
git add gdp_plots.py
git commit -m "handling case if no files are present in current directory"

### Let's start a new branch to begin refactoring or reorganizing our code
git checkout -b refactor
git branch # check branch
cat gdp_plots.py

# how should we reorganize our script into a set of fxns?

- one function that parses arguments
- one function for creating one plot
- one function that creates multiple plots
- one function that will call all possible functions--the "main" function

import sys
import glob
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt

def parse_arguments(argv):
    """
    Parse the argument list passed from the command line
    (after the program filename is removed) and return a list
    of filenames.

    Input:
    ------
        argument list (normally sys.argv[1:])

    Returns:
    --------
        filenames: list of strings, list of files to plot
    """

def create_plot(filename):
    """
    Creates a plot for the specified
    data file.

    Input:
    ------
        filename: string, path to file to plot

    Returns:
    --------
        none
    """

def create_plots(filenames):
    """
    Takes in a list of filenames to plot
    and creates a plot for each file.

    Input:
    ------
        filenames: list of strings, list of files to plot

    Returns:
    --------
        none
    """

def main():
    """
    main function - does all the work
    """

# call main
main()

# create a new script
touch refactored_gdp_plot.py
nano refactored_gdp_plot.py
# past template above into this new file

# try to past original code into template where appropriate

Refactored script.
import sys
import glob
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt

def parse_arguments(argv):
    """
    Parse the argument list passed from the command line
    (after the program filename is removed) and return a list
    of filenames.

    Input:
    ------
        argument list (normally sys.argv[1:])

    Returns:
    --------
        filenames: list of strings, list of files to plot
    """
    # make sure additional arguements or flag
    # have been provided by the user
    if len(sys.argv) == 1:
        # why the program will not continue
        print("Not enough arguments have been provide")
        # how this can be corrected
        print("Usage: python gdp_plots.py < filenames >")
        print("Options:")
        print("-a : plot all gdp data sets in current directory")

    # check for -a flag in arguments
    if '-a' in sys.argv:
        filenames = glob.glob('data/*gdp*[ae].csv')
        if filenames == []:
            # file list is empty (no files found)
            print("No files found in this folder.")
            print("Make sure the data folder and files are located")
            print("in the current directory")
    else:
        filenames = sys.argv[1:]
    return filenames

def create_plot(filename):
    """
    Creates a plot for the specified
    data file.

    Input:
    ------
        filename: string, path to file to plot

    Returns:
    --------
        none
    """

    # load data and transpose so that country names are
    # the columns and their gdp data becomes the rows
    data = pandas.read_csv(filename, index_col = 'country').T

    # create a plot of the transposed data
    ax = data.plot()

    # set some plot attributes
    ax.set_xlabel('Year')
    ax.set_ylabel('GDP Per Capita')
    # set the x locations and labels
    ax.set_xticks(range(len(data.index)))
    ax.set_xticklabels(data.index, rotation = 45)

    # save the plot
    split_name1 = filename.split('.')[0] #data/gapminder_gdp_XXX
    split_name2 = split_name1.split('/')[1]
    save_name = 'figs/' + split_name2 + '.png'
    plt.savefig(save_name)

def create_plots(filenames):
    """
    Takes in a list of filenames to plot
    and creates a plot for each file.

    Input:
    ------
        filenames: list of strings, list of files to plot

    Returns:
    --------
        none
    """
    for filename in filenames:
        create_plot(filename)

def main():
    """
    main function - does all the work
    """
    # parse arguments
    files_to_plot = parse_arguments(sys.argv[1:])

    # generate plots
    create_plots(files_to_plot)

# call main
main()

git add gdp_plots.py
git commit -m "refactored code into functions"
git checkout main
git merge refactor

# jupyter notebook time
jupyter lab

import gdp_plots # produces an error

# back to command prompt
nano gdp_plots.py
# change main() to....
if __name__ == '__main__':

main()

# if you're just importing fxn, fxn will not run
Ctrl+x
y
enter

gdp_plots.create_plot("data/gapminder_gdp_oceania.csv")

# back to command prompt
git add gdp_plots.py
git commit -m "moving call to the main function"

Day 4 - Python pt 2

Lesson: http://swcarpentry.github.io/python-novice-gapminder/

Sign-in:

Name (pronouns optional), Affiliation, icebreaker: what's the best book, movie, or tv show you've seen/read recently?

Clare Michaud (she/her), Data Science Hub, Better Call Saul
Steven Warren (he/him), iSchool MA, Good Time is a good thriller that I watched on Netflix recently
Chris Endemann, Data Science Hub, The Hobbit
Sarah Stevens (she/her/hers), Data Science Hub, CODA
Anthony Boyd (he/him), Undergrad student, Nuclear Engineering, Peaky Blinders
Stephan Blanz, BME / WITNe, Manifest (TV Show)
Joni Sedillo (she/her), Postdoc medical genetics, Klara and the Sun (book)
Jeremiah Yee, biostatistics, White Lotus
Katie Ziebarth, chemistry

Notes:

Lists

myList=['a', 'b', 'c']
myList=[1, 2, 3]
myList[0] # index first element of myList
myList[0] = .265 # store .265 as first element of myList

primes = [2, 3, 5]
primes.append(7) # add 7 as last/4th element of list
teen_primes=[11, 13, 17, 19]
middle_aged_primes=[37, 41, 43, 47]
primes.extend(teen_primes) # extend list to include elements of teen_primes
primes.append(middle_aged_primes) # append whole list at a single index.

primes=[2, 3, 5, 7, 9]
del primes[4] # delete the last element of primes. 9 isn't a prime value

primes=[] # empty list
goals = [1, 'Create lists.', 2, 'Extract items from lists.', 3, 'Modify lists.'] # list is mix of integers and strings. you can mix object types in list

element='carbon'
print(element[0]) # 1st element
print(element[3]) # 4th element
element[0]='C' # this doesn't work. strings have different properties than lists
element[99] # string index out of range

Exercise

print('string to list:' list('tin')) # convert string to list
print('list to string:', ''.join(['g', 'o', 'l', 'd'])) # convert list into a string/word
list('some string') # ['s', 'o', 'm', 'e', ' ', 's', 't', 'r', 'i', 'n', 'g']
print('-'.join(['x', 'y', 'z'])) # x-y-z

element='fluorine'
print(element[::2]) # go from start of string to end (two colons)-- print every other letter (2)
print(element[::-1]) # print string in reverse order

### Stepping through a list
1. If we write a slice as low:high:stride, what does stride do?
- stride is the size of the step when moving from low to high elements
2. What expression would select all of the even-numbered items from a collection?
- myList[1::2]

# Program A
old=list('gold')
new = old # simple assignment. new and old both reference the same object.
new[0]='D'
# both new and old are identical

# Program B
old=list('gold')
new=old[:] # assigning a slice. this method of assignment creates a new object implicitly
new[0] = 'D'

For Loops

# iterate through the collection of numbers, [2, 3, 5]. As you iterate, 'number' is used to reference each element of list.
for number in [2, 3, 5]: # don't forget the colon!

print(number) # body of loop (run for each element of list). Need to indent when adding code to the body of the loop.

# Sum the first 10 integers.
total=0
for nuimber in range(10):

total = total + (number + 1)

print(total)

# Exercise: Print out 'nit' using the skeleton code provided below
original = 'tin'
result = ''
for char in original:

result = char + result

print(result)

Loop Exercises: Practice Accumulating
# Exercise 1)
# Total length of the strings in the list: ["red", "green", "blue"] => 12
total = 0
for word in ["red", "green", "blue"]:

____ = ____ + len(word)

print(total)

# Exercise 2)
# List of word lengths: ["red", "green", "blue"] => [3, 5, 4]
lengths = ____
for word in ["red", "green", "blue"]:
lengths.____(____)
print(lengths)

# Exercise 3)
# Concatenate all words: ["red", "green", "blue"] => "redgreenblue"
words = ["red", "green", "blue"]
result = ____
for ____ in ____:
____
print(result)

# Exercise 4)
Create an acronym: Starting from the list ["red", "green", "blue"], create the acronym "RGB" using a for loop.
Hint: You may need to use a string method to properly format the acronym.

## Conditionals

mass=3.54
if mass > 3.0:

print(mass, 'is large')

mass=2.07
if mass > 3.0:

print(mass, 'is large')

if (mass[i] <= 2 or mass[i] >= 5) and velocity[i] > 20: # can use and/or to check for combinations of certain conditions. Use parentheses just like you would in math.

# Looping over datasets

import pandas as pd
for filename in ['data/gapminder_gdp_africa.csv', 'data/gapminder_gdp_asia.csv']:

data = pd.read_csv(filename, index_col='country')
print(filename, data.min())

# How to find files that match a certain pattern
import glob
print('all csv files in data directory:', glob.glob('data/*.csv')) # find and print all .csv files in data folder
# * matching zero or more characters
# ? matches exactly one character
print('all jpg files:' glob.glob('*.jpg'))

for filename in glob.glob('data/gapminder_*.csv'):

data=pd.read_csv(filename)
print(filename, data['gdpPercap_1952'].min())

for filename in glob.glob('data/*as*.csv'):

data=pd.read_csv(filename)
print(filename, data['gdpPercap_1952'].min())

import glob
import pandas as pd
for filename in glob.glob('data/*.csv'):

contents = pd.read_csv(filename)
if len(contents) < 50:
- print(filename, len(contents))

Exercise! Write a program that reads in the regional data sets and plots the average GDP per capita for each region over time in a single chart.

import glob
import pandas as pd
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1,1)
for filename in glob.glob('data/gapminder_gdp*.csv'):
    dataframe = pd.read_csv(filename)
    # extract <region> from the filename, expected to be in the format data/gapminder_gdp_<region>.csv'.    # we will split the string using the split method and `_` as our separator,    # retrieve the last string in the list that split returns (`<region>.csv`),     # and then remove the `.csv` extension from that string.    region = filename.split('_')[-1][:-4]
    dataframe.mean().plot(ax=ax, label=region)
plt.legend()
plt.xlabel('Year')
plt.ylabel('Mean GDP per Capita')
plt.xticks(rotation=90)
plt.show()

Functions

def print_greeting():
    print('Hello!')

print_greeting()

def print_date(year, month, day):
    joined = str(year) + '/' + str(month) + '/' + str(day)
    print(joined)
print_date(1871, 3, 19) # 1871/3/19
print_date(month=3, day=10, year=1871 # 1871/3/19

def average(values):

if len(values)==0:
- return None
return sum(values) / len(values)

a=average([1, 3, 4])

# Use 3 single quotes to start a comment block. Useful for adding info on the purpose of your fxn.

# local vs. global variables
pressure=103.9

def adjust(t):

'''Takes input t and returns temperature as output'''
temperature=t*1.43/pressure
retrun temperature

print('adjusted:', adjust(0.9))
print('temperature after the call:', temperature)

Day 3 - Python pt 1

Lesson: http://swcarpentry.github.io/python-novice-gapminder/

Sign-in:

Name (pronouns optional), Affiliation, ICEBREAKER: something fun you did over the weekend

Clare Michaud (she/her), Data Science Hub, spent a lot of time in various east side parks yesterday - helper/host
Heather Shimon (she/her), Science & Engineering Libraries, bike ride through the arboretum
Chris Endemann (he/him/his), Data Science Hub, went to "sessions at McPike park" on Friday
Stephan Blanz, Department of Biomedical Engineering, spent time with friends around a bonfire
Scott Prater (he/him/his), UW Digital Collections Center, went to Van Gogh Experience show in Milwaukee
Sarah Stevens (she/her/hers), Data Science Hub, went to the memorial union terrace for the first (and second) time in awhile! - Helper
Anthony Boyd (he/him), Undergrad student, Nuclear Engineering,
Jeremiah Yee, Biostatistics, tiled a shower <- wow! is this fun? I missed the fun part. Not really lol
Joni Sedillo (she/her), Medical genetics postdoc, took my daughter and dog to the park
Katie Ziebarth; chemistry grad student; went to Green Bay
Steven Warren (he/him) - helper, iSchool MA student, had a picnic with friends at Tenney Park

Notes:

Enter markdown mode: Esc+m
Run current cell: Shift+Enter or Ctrl+Enter

Headings
- One hashtag: level 1 heading
- Two hastags: level 2 heading
- etc.

Make a new line below: Esc (for editing mode), then press b
Make a new line above: Esc (for editing mode), then press a

Printing
- Print variable contents to screen: print()
- combine additional strings in print output using commas: print(first_name, 'is', my_age, 'years old')

How to index different parts of a string variable
- print(my_name[0]) # first index is always "0" in python - prints out first letter of my_name
- print(first_name[0:5]) # the second index (5) says you'll print up to, but not including, the 5th index
- print(first_name[6]) # gets last element of string
- print(first_name[-1]) # gets last element of string

Variable Types
- Variable types control what kinds of commands can be run on the variable (e.g. can't subtract strings)
- Print type of variable: print(type(my_variable))

full_name = "Stephan" + " " + "Blanz" # combine these three strings into single string stored in full_name
separator = '=' * 10 # repeat '=' 10 times
print(len(separator)) # print length of separator
print(my_name[len(my_name)-1]) # print last element of my_name

Casting variables
number_two = int('2') # cast string of '2' as an integer

Division and remainder operations
- use / for "floor division", e.g. print(5 / 3) yields 1
- use % to get remainder after floor division, e.g. print(5 % 3) yields 2

Imaginary/real components
complex = 6 + 2j
print(complex.real) # print real component of complex number
print(complex.imag) # print imaginary component of complex number

Max/min/round
print(max(1, 2, 3)) # prints max of list of numbers (3)
print(min('a', 'A', '0', '1')) # all inputs are characters. sorts characters by numerical characters. prints 0
round(3.712) # rounds to 4
round(3.712, 1) # round to the first decimal place

Get help on a function
- help(round)
- press and hold shift + tab after spelling out a given function to see info about the function

Object Methods
my_string = "Hello world"
print(len(my_string))
print(my_string.swapcase()) # can run this on any string

print(my_string.isupper()) # can run this on any string, checks if entire string is upper case or not
print(my_string.upper()) # can run this on any string, changes string to be entirely upper case
print(my_string.upper().isupper) # can run this on any string

Challenge:
easy_string = "abc"
rich = "gold"
poor = "tin"
number_two_str = '2'

Think about what will happen when you run the following. Then give it a try.
print(max(easy_string))
print(max(rich, poor, number_two_str))
print(min(rich, poor, number_two_str))
print(max(len(rich), len(poor)))

Import a library - do this at the top of your script to maintain a clean script
import math
help(math) # get help on math library

print(math.pi)
print(math.cos(math.pi)) # get cosine of pi

# Import only a handful of fxns - warning: this can get messy if you're importing fxns that are used by multiple libraries
from math import cos, pi # get only two fxns from math library
print(cos(pi))

# can give your imported package a nickname to reference throughout your script
import math as m
m.cos(m.pi)

base = "TATTAGCTTA"
print(type(base))

Using random
import random # do this at top of script
random.randrange(0, len(base)) # random integer from 0 to length of base (exludes last index)
print("Random int:", rand_int, base[rand_int]) # print index value and element at that index

bases="ACTTGCTTGAC"
import math
import random
___ = random.randrange(n_bases)
___ = len(bases)
print("random base ", bases[___], "base index", ___)

Pandas dataframes!
import pandas as pd # most python users refer to pandas as "pd"
data = pd.read_csv('data/gapminder_gdp_oceania.csv')
data = pd.read_csv(r'fullpath/gapminder_gdp_oceania.csv')
data = pd.read_csv(r'fullpath/gapminder_gdp_oceania.csv', index_col='country) # country is now your index column
data # view dataframe
data.info() # get some info on dataframe
data.columns # print out list of all columns # notice that country is not longer technically a column of dataframe--it is an indexing variable
data.T # transpose dataframe
data.describe # print some summary stats (e.g. min/max, mean) on dataframe

data.iloc[rowInd,colInd] # index specific rows/columns based on iteger indices
data.loc["Albania", "gdpPercap_1952"] # index specific columns/rows based on labels

data.loc["Albania", :]
subset = data.loc['Croatia':'Finland', 'gdpPercap_1967':'gdpPercap_1997'] # from Croatia to Finalnd, from 1967 to 1997

mask = subset > 12000
subset[max] # shows all elements in dataframe where gdp was > 12000# everything else is NaN
subset_greater12k = subset[mask]
subset_greater12k.describe() # summary stats on this subset

mask_higher = data > data.mean()
wealth_score = mask_higher.aggregate('sum', axis=1) / len(data.columns) # axis=1 means sum over columns instead of rows
wealth_score

data.groupby(wealth_score).sum()

# Exercise - how to get gdp data from Austria, 1952-1957
data.loc['Austria', gdpPercap_1952' : gdpPercap_1957']
data.iloc[1, 0:2]

New notebook
import matplotlib.pyplot as plt
import pandas as pd
data = pd.read_csv(r'fullpath/gapminder_gdp_oceania.csv', index_col='country) # country is now your index column

time = [0, 1, 2, 3]
position = [0, 100, 150, 200]
plt.plot(time, position)
plt.xlabel('Time (hr)')
plt.ylabel('Position (km)')
plt.show() # if not using jupyter notebook, need this to actually display the plot

years = data.columns.str.strip('gdpPercap_') # remove gdpPercap_ from beginning of each column names
data.column = years.astype(int) # replace dataframe columns with new column names

data.loc['Australia'].plot() # plot gdp per year of Australia

# to plot across columns instead of rows, transpose the dataframe
data.T.plot() # plot gdp for each country
plt.xlabel('Year')
plt.ylabel('GDP per capita')

# creating a "ggplot" -- a plotting object borrowed from R
plot.style.use('ggplot')
data.T.plot(kind='bar')
plt.xlabel('Year')
plt.ylabel('GDP per capita')

years = data.columns
gdp_australia = data.loc['Australia']
plt.plot(years, gdp_australia, 'g--') # make a green dashed line on plot
gdp_nz = data.loc['New Zealand']
plt.plot(years, gdp_australia, 'b-', label='Australia')
plt.plot(years, gdp_nz, 'g--', label=='New Zealand')
plt.legend(loc='upper left') # add a legend and explicitly control location of legend
plt.scatter(gdp_australia, gdp_nz)

# Let's start with a fresh dataframe
data = pd.read_csv(r'fullpath/gapminder_gdp_oceania.csv', index_col='country) # country is now your index column
data.head()
data.max().plot()

data_all = pd.read_csv(r'fullpath/gapminder_gdp_allcsv', index_col='country) # country is now your index column
data_all.head(10) # look at first 10 rows
data_all.plot(kind='scatter', x='gdpPercap_2007', y='lifeExp_2007', s=data_all['pop_2007']/1e6)

Day 2 - git/GitHub

Lesson: https://carpentries-incubator.github.io/git-novice-branch-pr/

Sign-in:

Name (pronouns optional), Affiliation, one project you'd like to be able to use version control (git) on and why

Sarah Stevens (she/her/hers), Data Science Hub, was trying to think of a project but I think I do use git on most of them already, yay! - Helper
Chris Endemann (he/him/his), Data Science Hub - Helper, I like to use Git on is all of my research projects. Most recently I have used it on a project involving estimating brain connectivity patterns.
Clare Michaud (she/her/hers), Data Science Hub - Helper/Host, one project that I will be using git on is new Carpentry workshop material
Chris Kirby (he/him/his) - I would like to use git on our LENNS Project
Una Baker (she/her/hers), postdoc in nuclear engineering. My team uses git for multiple projects and I always forget which order to execute the commands!
Jeremiah Yee (he/him/his) postdox in biostatistics. Using git to make sure using right version
Casey Schacher (she / her), Science and Engineering Librarian, Instructor
Anthony Boyd (he/him), Undergrad student, Nuclear Engineering, I would like to use git for research project
Andrey Vega. Plant Breeding and Plant Genetics.
Steven Warren (he/him), iSchool MA student - Helper, I'm interested in using Git for building/hosting new workshop/website content.
Chandler Meyer(She/her), Plant Breeding and Plant Genetics. I would like to use it for a sequencing project or other projects
Amanda DeWitt (she/her), Center for Health Disparities Research. I would like to use a version control on the data processing programs used in our upcoming data collection/cleaning
Scott Wildman (he/him), Academic Planning and Institutioinal Research - Helper
Joni Sedillo (she/her), Post Doc, I work in a medical genetics lab studying epigenetics

Notes

Git Configuration
- set name: git cofig --global user.name "yourName"
- set email: git config -global user.email "emailAddress"
- control git console color scheme: git config --global color.ui "auto"

- (Windows only): git config --global core.autocrlf true
- (Mac only): git config --global core.autocrlf input

- set editor: git config --global core.editor "nano -w"

Git Help
git config -h
git config --help

Creating a repository ("repo")
- check current directory: pwd
cd Desktop
ls
mkdir gitWorkshop # creates the folder if you don't have it already
cd gitWorkshop
pwd
- create a subdirectory: mkdir planets
cd planets
- initialize repo: git init
- Note: The planets subdirectory is not a Git repository. Git will now track any changes in this directory.
ls
- See hidden git-related files: ls -a
- Check status: git status
- Note: you don't want to nest repos. You already are tracking the contents of your subdirectory--no need to track additional folders inside the 'planets' subdirectory.
- git status shows you that you are currently on the "master" branch--more on this later
pwd

Tracking a new file
- Open up a plain text file using nano: nano mars.txt
- Enter some text into the blank file: "Cold and dry, but everything is my favorite color."
- Ctrl+x or Command+x to exit nano editor
- press "y"
- enter
- Check that it worked: cat mars.txt

- check that Git sees the new file as untracked: git status
shows that you have a new file--mars.txt
- add the file: git add mars.txt
- You will get a warning, but it's not a concern to us. Unix/windows have different line endings. It's saying that it will convert windows line endings to unix line endings.
git status: shows that file is added, but not yet committed

- Commit the file and describe what is being committed/changed in your repo: git commit -m "Start notes on Mars"
- There is now a permanent copy of this version of the file stored in a hidden .git folder in your planets folder/repo
- check that file is commited: git status
- To see a log of all commits, type: git log

- Edit your original file: nano mars.txt
- type: "The two moons may be a problem for Wolfman."
- Ctrl+x or Command+x to exit nano editor
- press "y"
- enter
- Check that it worked: cat mars.txt
- see if Git notices the change: git status
- Need to re-add and re-commit the file. But before that...
git diff
- shows difference between first file committed and the current version of the file

Challenge time! Try to add and commit your most recent change to your repository.
git add mars.txt
git commit -m "Add concerns about effects of Mars' moons on Wolfman"
- Note: best to commit files/changes one at a time so that your message can reflect a specific change. Also so that you can go back to very specific versions of files, if need be.
git status

- Let's add a third line to our file: nano mars.txt
- Enter some new text on a new line: "But the Mummy will appreciate the lack of humidity."
- Ctrl+x or Command+x to exit nano editor
- press "y"
- enter
- Check that it worked: cat mars.txt
- See difference between current version and last committed version: git diff
git add mars.txt
git diff
- Nothing shows up as a change. This is because you've added the latest change.
git diff --staged
- The above command shows the difference between the files in staging area (files that are added), and files that are already committed
git commit -m "Discuss concerns about Mars' lcimate for Mummy."

What happens if we only change specific words in a file?
nano mars.txt
- up/down and left/right arrows to move across and within lines
- add a couple of words in the file
- Ctrl+x or Command+x to exit nano editor
- press "y"
- enter
- Check that it worked: cat mars.txt
git diff
- to see which specific words are changed, enter the following command: git diff --color-words mars.txt
git status
git add mars.txt
git commit -m "Adding descriptors for monsters."
git status
- Check out all current commits: git log
- Most recent commit listed first
- How can we shorten the log display?
- View most recent commit: git log -1
- View second to most recent commit: git log -2
- get a condensed view of log: git log --oneline

- Add another line of text
nano mars.txt
"An ill-considered change" (or any text you want)
- Ctrl+x or Command+x to exit nano editor
- press "y"
- enter
- Check that it worked: cat mars.txt
git diff HEAD mars.txt
- look at difference between current version and HEAD-1 (back 1 from most recent commit): git diff HEAD~1 mars.txt
git diff HEAD~2 mars.txt
- get a condensed view of log: git log --oneline
- Look at differences and commit message: git show HEAD~2 mars.txt
git log --online
- You can use the commit's unique identifiers to look at differences from specific commits
git diff UniqueCommitIdentifier mars.txt

How can we restore a previous version?
- let's first edit the file again
nano mars.txt
- delete everything! Replace with one line of text if you'd like, but it's not critical.
- Ctrl+x or Command+x to exit nano editor
- press "y"
- enter
- Check that it worked: cat mars.txt
- everything is deleted :(
git status
- Restore file to most recent version committed in repo: git checkout HEAD mars.txt
- see if the file is restored: cat mars.txt
- Review log: git log --online
- Restore an earlier version of the file: git checkout HEAD~2 mars.txt
cat mars.txt
git status
git checkout HEAD mars.txt
cat mars.txt
- HEAD (most recently committed version of file) is restored!
git status
- Incorrectly detatch the HEAD: git checkout HEAD~1
- Fix the detatched head: git checkout master

Challenge time!
nano venus.txt
- Add text: "Venus is beautiful and full of love"
cat venus.txt
git add venus.txt
nano venus.txt
- Add "Venus is too hot to be suitable as a base" as text
git commit -m "comment on Venus as an unsuitable base"
git checkout HEAD
git log --oneline venus.txt
git checkout HEAD venus.txt
cat venus.txt
- Only has one line because the git add happened before the 2nd line was added so the commit only had the first line in it. So the take home is that if you've changed something for a file that is already added to the staging already it will commit the staged version. You can update the staged file using `git reset venus.txt` which will bring it down from the staging area, then you can `git add` it again.

What are branches?
Sometimes we want to keep our main work "safe" from experimental changes.
Branches can be merged into the main/master branch (after you've tested it).

# to see what branches are available for a repo
git branch
# we only have one so far, shows and asterisk next to the branch you are on

Dracula is running an analysis in python and in bash to see which one is faster - will have two experimental branches (one for python, one for bash) and then will merge the one that is fastest with the master branch

git branch pythondev # create a new branch called pythondev
git branch # check again and we are still on the master branch but have a new branch called pythondev
# need to move to the pythondeve branch
git checkout pythondev # switching to the pythondev branch
touch analysis.py # creating the analysis script - pretending we worked on it a lot
git add analysis.py
git commit -m "wrote and tested python analysis script"
git log --oneline # we can see we have the new commit in the pythondev branch
git checkout master # going back to the master branch to confirm the new file and commit don't exist there

git branch # check that we are on the master branch
git log --oneline # see that we don't have the python analysis commmit
ls # check that the python analysis script doesn't exist

# now we want to do this same experiment for the bash analysis
git checkout -b bashdev # this both creates the branch and moves to it in one step, above we did this in two steps for the pythondev branch
git branch # confirm we on the bashdev branch
ls # can see that we don't have the python analysis script because we made the branch from master so it only has the commits and files that the master branch hat when we created the new branch
touch analysis.sh # created bash analysis script, imagining we worked a lot on it
git add anlaysis.sh
git commit -m "wrote and tested bash analysis script"
git stats # checking we commited it

# turns out the python script is faster and we only want to keep the python script
git checkout master # switch back to the master branch
git branch # double check that we are on the master branch
git merge pythondev # merging the pythondev branch (files and commits) into the branch we are currently on (the master branch)
ls # see that analysis.py is now in the master branch
git log --oneline # see that we have the commit from the pythondev branch

# clean up the experimental branches - since we don't need them anymore and if we find them later we might get confused
git branch -d pythondev # deleting the pythondev
git branch -d bashdev # trying to delete the bashdeve branch but..
# we get an error because these commits are not in any other branches...we still want to delete it so we can do..
git branch -D bashdev # and it deletes the bashdev branch

# conflicts - how to handle issues with different edits to the files in separate branches
# create a conflict on purpose so we can learn to resolve it
git branch marsTemp
nano mars.txt # add a line about day length
git add mars.txt
git commit mars.txt # adding this line to the master
git checkout marsTemp
# The new change (day length) isn't in the marsTemp branch as it was added AFTER marsTemp was created
nano mars.txt # add a line about temperature (Yeti)
git add mars.txt
git commit mars.txt # adding this to the marsTemp branch
git checkout master
git merge marsTemp
# CONFLICT - some changes are made to mars.txt that help us identify where the conflict is
# Fix this be editing the mars.txt file and removing (or otherwise editing) the extra text - be sure to remove the identifier lines ("<<<<<<<", ">>>>>>>", "=======")
cat mars.txt # just to make sure we have what we want after editing.
# But this hasn't completed the merge -> commit
git commit -m "Merged changes from marsTemp"
git checkout marsTemp # going back to marsTemp branch
nano mars.txt # opening up the file
# in nano you can see that it only has the marsTemp changes not the master changes - the merge is not bi-directional
# adding line to end of this branch
The polar caps will appreaciate will probably be Yeti's home
# out of nano
git add mars.txt
git commit -m "wrote a note about Yeti's home"
git checkout master # switching back to the master branch
git log --oneline #looking at the log and see that we don't have the last commit in marsTemp
git merge marsTemp # merging in the latest changes from marsTemp

#challenge
Create and switch to a new branch, change something in that branch, and then merge it back into the master branch

#github!!! - repository hosting services

# Create a github account - https://github.com/

In github...

click new repository button
repository name - planets
description - can leave blank or put something like "practice repo for learning git"
keep public (you can make it private but this repo doesn't need to be private) - casey made hers private
don't initialize a readme, gitignore or license - because we are importing an existing repo
click create repository
choose the ssh authentication optino and then copy the address next to it

back in the bash shell
# connecting the local planets repo to the github one
git remote add origin git@github.com:USERNAME/planets.git
# check that you setup the remote connection
git remote -v # should show two origin conections (fetch and push)

# setting up authentication for github
# check for ssh keys
ls -al ~/.ssh
# creating a new key
ssh-keygen -t ed25519 -C "YOUREMAIL4GITHUB"
#asks where to save the key, press enter to use the default location
# asks if you'd like to have a passphase - leave blank if you'd like not to be asked for a passphrase or enter password if you'd like
ssh -T git@github.com # will show permission denyed
cat ~/.ssh/id_ed25519.pub # then copy your public key from the screen - we will need it to put it into github

Back in github...

Click on your icon in the upper right hand corner, and choose settings option
From the menu list on the left hand slide, scroll down and choose "SSH and GPG keys"
Click New SSH Key button
Name it with the name of your computer e.g. "Casey's Work Laptop"
Paste the key you copied from the cmdline into the Key section
Click Add SSH Key
Enter your password for github

Back in command line / unix shell

ssh -T git@github.com #confirming that the ssh connection is setup
# should get successfull message
git push origin master

WE WILL PICK THIS BACK UP ON THE EXTRA DAY NEXT FRIDAY - AT LEAST FOR A BIT

Day 1 - Unix Shell

Lesson: http://swcarpentry.github.io/shell-novice/

Sign-in:

Name (pronouns optional), Affiliation, a sentence or two describing your research/work

Sarah Stevens (she/her/hers), Data Science Hub, I help researchers learn the computational/data science skills they need to do their work. - Instructor
Karl Broman (he/him/his), Biostatistics & Medical Informatics, applied statistics, mostly in genetics - Helper
Chris Endemann (he/him/his), Data Science Hub - Helper
Scott Prater (he/him/his), UW Digital Collections Center, Digital Library Architect - Helper
Heather Shimon (she/her), Science & Engineering Libraries - Helper
Clare Michaud (she/her), Data Science Hub - Helper
Steven Warren (he/him), iSchool MA student - Helper
Una Baker (she/hers), Engineering Physics, Postdoc in Nuclear Engineering
Chandler Meyer (she/hers), Graduate student, Plant Breeding and Plant Genetics, genetic engineering and gene function research
Anthony Boyd (he/him), Undergrad student, Nuclear Engineering
Tariq Alauddin, Research Analyst
Joni Sedillo (she/her), Post Doc, I work in a medical genetics lab studying epigenetics
Chris Kirby (he/him) I have borrowed a laptop from Emily hence the screen name. I work for GLAS Education we do outreach for Yerkes Observatory programs.

Notes:

Download data: http://swcarpentry.github.io/shell-novice/data/shell-lesson-data.zip

Unzip/extract all data and save to desktop
Rename folder to data-shell if needed

What happens if you type a command that doesn't exist?
ks
error message - ks: command not found

Commands in the Unix Shell
ls -list the contents
ls -F displays a slash after the paths that are directories

pwd - print working directory, tells you what folder you are working in

Help
ls -- command for help for Windows
man for Mac (stands for manual)

Challenge: Exploring More ls Flags
1. You can also use two options at the same time. What does the command ls do when used with the -l option? What about if you use both the -l and the -h option?

Some of its output is about properties that we do not cover in this lesson (such as file permissions and ownership), but the rest should be useful nevertheless.

ls -l "l" stands for long form, additional information like file size and the time of its last modification
ls -lh gives sizes with more information, "h" stands for human readable

2. By default, ls lists the contents of a directory in alphabetical order by name. The command ls -t lists items by time of last change instead of alphabetically. The command ls -r lists the contents of a directory in reverse order. Which file is displayed last when you combine the -t and -r flags? Hint: You may need to use the -l flag to see the last changed dates.

--
More commands and flags
ls -t lists by last modified date order
ls- tr adding the "r" flag reverses the order
ls -ltr lists long form by time in reverse order

ls Desktop -to print contents of desktop
ls Desktop/data-shell

prints: creatures/ data/ molecules/ north-pacific-gyre/ notes.txt pizza.cfg solar.pdf writing/

to move to a different directory
cd -change directory
cd Desktop

Reminder - pwd to see what directory you are in

/ is your first folder at the beginning of your file system

cd data-shell

cd data

How to move back up one folder?

cd .. to go back to data-shell

cd .. to go back to desktop

can move more than one directory at a time
cd data-shell/data

Challenge: absolute vs relative paths
Starting from /Users/amanda/data, which of the following commands could Amanda use to navigate to her home directory, which is /Users/amanda?

cd . No: . stands for the current directory.

2. cd /                                 No: / stands for the root directory.
3. cd /home/amanda       No: Amanda’s home directory is /Users/amanda.
4. cd ../..                            No: this command goes up two levels, i.e. ends in /Users.
5. cd ~                              Yes: ~ stands for the user’s home directory, in this case /Users/amanda.
6. cd home                       No: this command would navigate into a directory home in the current directory if it exists.
7. cd ~/data/..                 Yes: unnecessarily complicated, but correct.
8. cd                                 Yes: shortcut to go back to the user’s home directory.
9. cd ..                               Yes: goes up one level.

shortcut: cd - takes you to the last folder that you were in

Challenge: Relative Path Resolution
Using the filesystem diagram below, if pwd displays /Users/thing, what will ls -F ../backup display?

../backup: No such file or directory No: there is a directory backup in /Users.
2012-12-01 2013-01-08 2013-01-27 No: this is the content of Users/thing/backup, but with .., we asked for one level further up.
2012-12-01/ 2013-01-08/ 2013-01-27/ No: see previous explanation.
original/ pnas_final/ pnas_sub/ Yes: ../backup/ refers to /Users/backup/.

--
cd ..
ls
look inside folder north-pacific-gyre

ls nor +Tab for tab complete
Tab again to see inside folder inside of north-pacific-gyre

mkdir -"make directory"; create a new folder
mkdir thesis

Avoid spaces in file names, can use underscore or dashes - be consistent
Also avoid other special characters - $&()"

cd thesis

nano draft.txt to create file (nano is a text editor)

^ is the same as Ctrl -

It's not "publish or perish" any more,
It's "share and thrive".

Ctrl X to save and exit - Y for yes, press enter to keep the same name

cat - concatenate, or show file contents

cat draft.txt

mv - move a file to a new path and can use to rename
mv thesis/draft.txt thesis/quotes.txt
cat thesis/quotes.txt

mv thesis/quotes.txt .

Challenge: Moving Files to a new folder
After running the following commands, Jamie realizes that she put the files sucrose.dat and maltose.dat into the wrong folder. The files should have been placed in the raw folder.

Fill in the blanks to move these files to the raw/ folder (i.e. the one she forgot to put them in)
$ mv sucrose.dat maltose.dat ____/____

Solution: $ mv sucrose.dat maltose.dat ../raw

Without having cd analyzed first:
mv analyzed/maltose.dat analyzed/sucrose.dat raw
or
mv analyzed/maltose.dat analyzed/sucrose.dat ./raw

Challenge: renaming files
Suppose that you created a plain-text file in your current directory to contain a list of the statistical tests you will need to do to analyze your data, and named it: statstics.txt

Solution: mv statstics.txt statistics.txt

--
Deleting files - be careful! Very hard to get back.
rm
example: rm quotes.txt
rm -r thesis to remove a directory
rm -i to check in with you before deleting

cd molecules
Using wildcards for accessing multiple files at once: * ?
* to access all files that end in .pdb
ls *.pdb

touch my_notes.txt
can scroll back using up arrow for previous commands
scroll to ls *.pdb

? wildcard for one character
ls p???ane.pdb

Challenge: wildcards
When run in the molecules directory, which ls command(s) will produce this output?
ethane.pdb methane.pdb

ls *t*ane.pdb

shows all files whose names contain zero or more characters (*) followed by the letter t, then zero or more characters (*) followed by ane.pdb. This gives ethane.pdb methane.pdb octane.pdb pentane.pdb.
2. ls *t?ne.*
shows all files whose names start with zero or more characters (*) followed by the letter t, then a single character (?), then ne. followed by zero or more characters (*). This will give us octane.pdb and pentane.pdb but doesn’t match anything which ends in thane.pdb.
3. ls *t??ne.pdb - Correct
fixes the problems of option 2 by matching two characters (??) between t and ne. This is the solution
4. ls ethane.* only shows files starting with ethane.

--

Running multiple commands at a time - pipes and filters

wc - count - it counts the number of lines, words, and characters in files
wc cubane.pdb
20 156 1158 cubane.pdb

wc *.pdb - count all of the files ending with .pdb with total counts

wc -l *.pdb to show only the line count

to print output to file name >
wc -l *.pdb > lengths.txt

cat lengths.txt to see what is in file

to sort by number in lenghts.txt
sort -n lengths.txt > sorted-lengths.txt

cat sorted-lengths.txt

head - gives you the top of the file
head -n 1 sorted-lengths.txt

-n 1 gives the top line

echo - prints whatever is typed after it

echo hello > testfile01.txt
overwrites the file content in testfile01.txt
echo hello >> testfile02.txt
echo hello >> testfile02.txt
appends both hellos into the same file (hello is included twice)

> overwrites
>> appends

combine with pipe |
sort -n lengths.txt | head -n 1
wc -l *.pdb | sort -n | head -n 1

Challenge: Piping Commands Together
In our current directory, we want to find the 3 files which have the least number of lines. Which command listed below would work?

wc -l * > sort -n > head -n 3
wc -l * | sort -n | head -n 1-3
wc -l * | head -n 3 | sort -n
wc -l * | sort -n | head -n 3 Correct

--
cd ../north-pacific-gyre/2012-07-03/
wc -l *.txt | sort -n | head -n 5        to see if there are any files that are too small
wc -l *.txt | sort -n | tail -n 5           to see if there are any files that are too big
ls *Z.txt              to check for files that end with Z
ls *[AB].txt          to check for files that end with AB

cd ../../creatures
head unicorn.dat

Loops
for filename in *.dat
> do
>    head -n 2 $filename | tail -n 1
> done

printed:
CLASSIFICATION: basiliscus vulgaris
CLASSIFICATION: bos hominus
CLASSIFICATION: equus monoceros

for x in basilisk.dat minotaur.dat unicorn.dat
> do
>      head -n 2 $x | tail -n 1
> done

One line version
for x in basilisk.dat minotaur.dat unicorn.dat; do head -n 2 $x | tail -n 1; done

history - to see history of commands

Shell Scripts can be useful in creating scripts for commands that you run for multiple datasets: https://swcarpentry.github.io/shell-novice/06-script/index.html

Last lesson is for finiding things

Use grep to select lines from text files that match simple patterns.
Use find to find files and directories whose names match simple patterns.