CS6603 AI, Ethics, and Society

Module 1: Data, Individuals, and Society

Notes by Taichi Nakatani (tnakatani3@gatech.edu)

Lesson 1 - Introduction

China's Social Credit system

  • Punishes those not loyal to Communist party. No due process.
  • AI for monitoring
  • Profiling via video (age, gender)
  • Glasses to recognize those on wanted list.

What is big data

Big Data: Process of applying computing power to aggregate large, complex sets of information.

References and Links:

Targeted Messaging

  • Big data allows orgs to target specific demographics
  • The more data you allow to be collected the more they can target your interests.

What's the problem

  • Organizations aren't interested in you personally, but as a collective "you" with similar traits and behaviors.
  • These assumptions are based on historical data, which means it can embed our historical biases.

Example: Criminial detection with headshots

  • Chinese scientists claiming they can distinguish criminials by headshots with 90% accuracy, and algos are free from biases that cloud human judgment.
  • Data: 1,800 photos Chinese men aged 18-55. 1,100 were photos of non-criminals scraped, others were pictures of criminials provided by police.
    • Behavioral bias: Mugshots, people not smiling and not happy. Non-mugshots, ppl typically smile. Injecting behavioral bias of not smiling to criminality.
  • Physiognomy: Practice of using ppl's outer appearance to infer inner character.

Example: AI guessing sexual orientation

  • AI "learned" that gay men had larger foreheads than straight men, and vice versa for lesbians.
  • Data: Scraped dating website. 35,000 facial images of ~14,000 ppl, straight/gay evenly distributed.
  • Eval: Compared ML accuracy to annotation via Amazon Turk.

References and links:

AI & Unintended Consequences

  • Most companies cover up embedded biases in their algorithms by blocking certain outputs rather than retraining the models to abstrain from thes biases.
  • Search results for images of certain positions (doctors) returned a specific gender (men). Reflects societal bias.
  • Bias is fed back by user's behaviors (preferring to click male doctors instead of female doctor images)
  • Unintended consequences caused by using historical data for future prediction (criminality based on historical data).

References and Links:

Lesson 3 - Relationship between Ethics and Law

Ethics: Principles that distinguish what is morally right/wrong. No governing authority to sanction it.

Law: System of rules established by governemtn to maintain stability and justice. Defines legal rights and provides means of enforcing them.

Lesson 4 - Data Collection

Lesson 5 - Fairness and Bias

Algorithmic fairness: how can we ensure that our algorithms act in ways that are fair?

  • Accountability: How to supervise/audit AI which have large impact
  • Transparency: Why does an algo behave a certain way, explainability.
  • AI safety: How to make AI without unintended negative consequences.

Why fairness is hard

Bank loan problem: If sensitive attribute (e.g. postal code) is correlated with other attributes, AI will find those correlations.

  • Easy to predict class if you have lots of other information (e.g. home address spending patterns)
  • More sophisticated approaches are necessary.

Principles for Quantifying Fairness

Group Fairness: Assessing fairness by using statistical parity. Require same percentage of group A and B to receive loans (in bank loan context).

  • Formula - Given two groups, both group would receive same % of loans.
    P(loan|no repay, A) == P(loan|no repay, B)
    P(no loan|would repay) == P(no loan|would repay, B)
  • Problem: What is group A & B have different probability of likeliness to pay?
  • Should bank take a loss for the sake of group fairness?

Individual Fairness: Assess fairness by whether similar people (background) experience similar outcomes.

  • Thes measures compare the protected group against the unprotected group.
  • Risk difference (UK law): To ensure fairness, risk difference (ie "absolute risk reduction") should be minimal.
  • Risk ratio (EU court of justice): Proportion of protected/unprotected group is the focus.
  • Problem: Consistency might result in everyone being treated equally badly.

indivfair

What is bias?

  • moralmachine.mit.edu - addresses trolley problem (end one life to save five?)

Module 2: BS of Big Data & Stats 101

Lesson 6: Overview

Statistics: The science of collecting, organizing, presenting, analyzing and interpreting data to assist in making effective decisions.

Goal of module:

  • How to identify bad statistics.
  • How to use data to train your algorithms in more unbiased and fair ways.

Brief history of stats: Graunt's "Natural and Political Observation Made upon the Bills of Mortality"

  • Used London bills of mortality to estimate city's population in ~1660.
  • Stats: There were ~3 deaths for every 88 people. Since London bills cited 13,200 deaths, Graunt estimated London population to be about 387,200 (13,200 * 88 / 3).
  • Issues: Higher income neighborhoods would have lower death rates, and vice versa for lower economic neighborhoods. Doesn't account for the homeless. Assumes population ratio for each neighborhood is the same.

How to mislead through poor sampling

Definitions:

  • sample: data collected
  • population: the body in which the data is collected from

Example: Analyzing change in high school students' interest in computing.

  • Biased sampling: Sampling from specific sample (e.g. high income high school students in Georgia) and extrapolating findings to the population.
  • Problem: Most analysis doesn't provide deep enough info on the sample population.

How to mislead through poor analysis

Definitions:

  • Data analysis: Process of gathering, modeling, transforming data with the goal of highlighting useful info, suggesting conclusions, and supporting decision making.
  • Problem: Scientists have propensity to throw all data and see what works. This magnifies biases in the data.

Example: Lying with graphs

  • In graphs below, right chart doesn't include zero. Warps the distance between y-values. Misleads that unemployment rates are significantly falling.

charts

Example: Unemployment data

  • Sample: 60,000 households.
  • Formula: Unemployment Rate = # unemployed / # labor force
  • Problem: Questionnaire doesn't count those who haven't looked for a job in over 30 days to be part of the labor force. Thus, they aren't counted in the unemployment rate numbers.

Household Survey vs Establishment Survey

  • Household survey asks if you're working. Establishment asks how many ppl are on payroll.
  • Household survey includes agricultural workers, self-employed, and private household workers. Establishment doesn't.
  • Household survey counts people on unpaid leave as employed. Establishment doesn't.
  • Household survey only counts ppl over age of 16. Establishment doesn't.
  • Establishment survey often "double counts" jobs (e.g. person works 2 jobs, employee quits one job and is employed at another in same payroll period)
  • Leads to delta between household and establishment. Establishment numbers are also often edited.

payroll

How to mislead through interpretation

tl;dr - don't trust graphs

  • Bar chart axes should include zero.
  • Don't invert the y-axis.

Reference: https://www.callingbullshit.org/tools/tools_misleading_axes.html

Lesson 7: Python and Stats 101

Defining Data Analytics

  • Descriptive Analytics: Methods of organizing, summarizing, presenting data (freq table, histogram, mean, variance)
  • Inferential Analytics: Methods used to determine ideas about a population using stats.

Diff between big data & data analytics

  • Big data focuses on handling non-traditional "big" data.
  • Focuses on gaining meaningful insight regardless of size of data.

AI / ML / DL

  • AI: machines imitating intelligent human behavior
  • ML: process of computer able to continuously improve its own performance by incorporating new data into an existing stats model
  • DL: artificial neural networks learn from large amounts of data.

All about the data

Data:

  • Facts and figures, collected and analyzed.
  • Can have quantitative / qualitative values
  • Can be continuous / categorical
  • Ordinal/rank - in order but not necessarily equal (e.g. Likert scale)
  • Cross-sectional - collected at the same time.
  • Time-series - data collected over several time periods.

Lesson 8: Descriptive Statistics

Types of Descriptive Statistics

Descriptive stats: Methods of organizing, summarizing, and presenting data in an informative way (freq table, histogram, mean, variance)

Inferential Analytics: Methods used to determine something about a population on the basis of a sample (ML/AI for big data)

  • Population: Entire set of indiv or objects of interest or the measurements obtained from all individuals or objects of interest.
  • Sample: Portion, or part of the population of interest.

Types of Studies

  • Experimental Study: One variable is manipulated, and second variable is observed and measured to determine effect of treatment variable. Measurements are compared to see if there are differences between conditions. (Facebook's emotion contagion experiment)
  • Correlation Study: Determine whether there is a relationship between two variables and to describe the relationship. A correlational study simply observes the two variables as they exist naturally.
  • Quasi-Experimental: Compares groups based on a variable that differentiate the groups (e.g. male/female)

Sampling & Sampling Error

Sampling Error: Discrepency between a sample statistic and its population parameter. Leads to sampling bias.

Median, Mean, and Mode

When to use median vs mean:

  • Mean is best for symmetric distributions
  • Median is less sensitive to outliers than the mean. Better measure for highly skewed distributions (e.g. family income, housing prices, etc)

Using mean vs median for your messaging

Headline should be the one on bottom if math was done correctly. guns

Mode

  • Most frequently occurring number (score, measurement)
  • Value that is observed most frequently.
  • Value is undefined for sequences with no duplicates.
  • Example: Avg number of tix purchased per person for a GT football game is almost always going to be accurately reflected by the mode.
  • Lying with mode: Any survey that rates on a broad scale can be manipulated to emphasize the mode.
    • If you survey 100 ppl on scale of 1-10 about their feelings on a subject, and more people rate it "10" than any other number, then even if only one more person gave 10 rating than gave a 1 rating, 10 is the mode average.

How to mislead with averages

Example: Manipulating average income of a neighborhood

  • Real estate agent manipulates avg income with perfect honesty and "truthfulness", tell different ppl that the avg income in the neighborhood is:
  1. $150,000 - mean of the incomes of all the families in the neighborhood. One home is a giant, expensive mansions.
  2. $35,000 - median income.
  3. $10,000 - mode of neighborhood

misleadavg

Frequency Distribution

Definition: Tallies number of times a data point occurs.

Cumulative frequency distribution: "Runnint total" of frequencies.

  • Tells the total number of data items at different stages in the data set.
  • How to lie with frequency distribution: Showing "cumulative" frequency distribution vs basic distribution.

Ref: https://www150.statcan.gc.ca/n1/edu/power-pouvoir/ch10/5214862-eng.htm

Variability

Definition: Measures the amount of "scatter" in a dataset. Shows how well the avg characterizes data as a whole.

# Both have same mean (50) but different stdev (20 vs 10)
a = [30, 50, 70]
b = [40, 50, 60]

Examples: range, variance, stdev, interquartile range, coefficient of variation.

range

quartiles

Ref: https://junkcharts.typepad.com/junk_charts/boxplot/

Inferential Statistics: Sampling Bias

Definition: Drawing inferences about an individual based on data drawn from a larger group of similar individuals.

Examples: Credit card / loans, hiring.

Chain of reasoning for inferential stats

chain

  • Not all samples will lead to good prediction about an entire population.

Case Studies

Institute decide to get rid of the Chick-fil-A Express in the student center. After a survey of all the faculty, it was overwhelmingly decided that Chick-fil-A would be replaced with To- Go Fogo de Chão Brazilian Steakhouse.

  • Who is the population? A: Faculty (NOT students)
  • The population which is weighed more towards a certain group will result in conclusions drawn from the data to be inaccurate.

Simpson's Paradox

Definition: Trend appears in several different groups of data but disappears of reverses when these groups are combined.

Example (ref: https://blog.revolutionanalytics.com/2013/07/a-great-example-of-simpsons-paradox.html)

  • Since 2000, the median US wage has risen about 1%, adjusted for inflation.
  • But over the same period, within every education subgroup, the median wage is lower now than its was in 2000:
    • high school dropouts,
    • high school graduates with no college education,
    • people with some college education, and
    • people with Bachelor’s or higher degrees
  • WHY? Changing educational profile of the workforce
    • There are now more college graduate (with higher-paying jobs), but wages for college grads collectively have fallen at a slower rate (-1.2%) vs those with lower education (-7.9% for high school dropouts). I.e., the growth in college grads swamps the wage decline for specific groups.

Example (How statistics can be misleading - Mark Liddell): https://www.youtube.com/watch?v=sxYrzzy3cq8&t=26s

  • How can Hospital A with lower survival rates for both healthy/unhealthy patients get a better surival rate overall compared to Hospital B that has higher survival rates for both?
  • WHY? Relative proportion of healthy/unhealthy patients in each sample.

Biased Sampling

Statistical definition of bias

  • Estiamtor is unbiased if mean of means == true mean (bias is zero)
  • "Mean of means" is called the "expected value" of the estimator.

Types of Sampling Bias

  1. Area Bias: Bias by sampling in specific area not representative of the population (e.g. sampling only from East End in Pittsburgh vs all neighborhoods).
  2. Selection Bias_: Sampling method is done in a way that proper randomization is not achieved, ergo sample isn't representative of the population (e.g. cherry picking sample to confirm your hypothesis).
  3. Self-selection Bias: Participants' decision to join a study may be correlated with traits that affec tthe study. Individual select themselves into a group, causing biased sample with nonprobability sampling. E.g. Sample under-21 about drinking, respondents are probably those that don't drink.
  4. Leading Question Bias: Giving participants a clue as to the desired answer. (e.g. "Don't you think that...?" suggests agreement)
  5. Social Desirability Bias: Participants persuaded to answer in a socially acceptable manner (e.g. "Do you brush your teeth in the morning?" in a group of people).

Biased Sampling Example

samplingmethods

  1. (a) - high variability and high bias
  2. (b) - low variabilty and low bias
  3. (c) - high variability and low bias
  4. (d) - low variability and high bias
  • Good sampling method has both low bias and low variability
  • Graph B is theoretically the best, but it suggests distribution is gaussian (but it might not be)

Types of Randomized Sampling

Simple random sampling: randomly sample from population

Systematic Sampling

  • Given data is sequentially numbered, choose nth piece of data.
  • Cons: There may be bias in sequence of students.

Statified random sampling: Data is divided in subgroups (strata)

  • Based on specific characteristis (age, education level)
  • Use random sampling within each strata.

Pros and Cons of each:

sampling

Cluster random sampling: Split population into similar parts of clusters.

  • Each cluster should be mini version of entire population.
  • Select one or few clusters at random and select simple random sample from each cluster.
  • If each cluster fairly represents the full population, cluster random sampling will give us an unbiased example.
  • Pro: Useful when difficult and costly to develop complete list of population members (e.g. all items sold at grocery store)

Non-probability Sampling: Participants are chosen/choose themselves so that chance of being selected is not known.

  • No one has figured out how to select a representative sample of internet users.

Inferential Statistics: Causation vs Correlation

Correlation tells us two variables are related.

Types of relationship reflected in correlation:

  • X causes Y or Y causes X (causal relationship)
  • X and Y are caused by a third variable Z (spurious relationship)

Important: Correlation doesn't imply causation.

Correlation coefficient summarizes the association between 2 variables.

  • 1.0 is perfect positive relationship, 0.0 is no relationship, -1.0 is perfect negative

Correlation vs Causation Examples

"Correlation between worker's education levels and wages is strongly positive"

Issues:

  • Recall: Correlation tells us two variables are related but doesn't tell us why
  • Causation: Education improves skills and skilled workers get better paying jobs
  • Rebuttal: Individuals are born with innate talent A which is relevant for success in education as well as relevant for success on the job.

Examples of "spurious correlations": www.tylervigen.com

spurious

Relationships

Relationships between two variables are often influenced by other unknown variables.

  • Common response: Variable Z (unknown) affects X and Y. Change in an unknown variable is causing change in both our explanatory variable and our response variable.
  • Confounding: Variable Z (unknown) or X affects Y. Either the change in our explanatory variable is causing changes in the response variable, or that a change in an unknown variable is causing changes in the response variable.

Measuring Linear Correlation in Python

Linear correlation coefficient: a measure of the strength and direction of a linear association between two random variables (also called the Pearson product-moment correlation coefficient)

from scipy import stats

scipy.stats.pearsonr(X, Y)
  • The linear correlation coefficient quantifies the strengths and directions of movements in two random variables
  • Correlations of -1 or +1 imply an exact linear relationship
  • Positive correlations imply that as x increases, so does y.
  • Negative correlations imply that as x increases, y decreases.

Inferential Statistics: Confidence

Empirical Rule

Definition: For a normal distribution, almost all of the data will fall within 3 standard deviations from the mean. Assumes that the data follows a gaussian distribution.

bell

Example: IQ

  • IQ scores are normally distributed with a mean of 100 and a stdev of 15.
  • 68% of IQ scores (85 to 115) can be calculated by ±1 standard deviations of the mean.
  • 95% of IQ scores (70 to 130) can be calculated by ±2 standard deviations of the mean.

Population Proportions and Margin of Error (MoE)

sample

Margin of Error:

  • How confident we are is usually expressed as a percentage
  • Going back to Empirical Rule, we saw that 95% of area of a normal curve lies within +-2 stdev of the mean.
  • This means that we are 95% certain that the population proportion is within ±2 stdev of the sample proporition. ±2 stdev is our margin of error
  • Percentage margin of error depends on sample size.
# At 95% level of confidence
# n = sample size
margin_of_error = 1 / math.sqrt(n)

# Example of n=1000, MoE is ±3%
0.03 = 1 / math.sqrt(1000)

Example: Surveys

Company X surveys customers and finds that 50% of the respondents say its customer service is "very good". The confidence level is cited as 95% ± 3% MoE.

  • This means if the survey is conducted 1000 times, the percentage of those who respond "very good" will range between 47 and 53 percent 95 percent of the time.

Confidence Interval

  • We can estimate population proportion using a confidence interval.
  • If we build the MoE around the true value, it will capture 95% of all the samples.
  • If we build the MoE around the sample statistic, it would have a 95% chance of capturing the true value.

Example:

moe

  • If we run this sample many times, 95% of the time the proportion of those who spent over $5 will be within a ±0.2 MoE interval from the sampled proportion (e.g. 0.4).

Sample Size and MoE

  • MoE estimates how accurately the results of the poll reflect the true value, ie. the population.
  • As sample size increases, MoE decreases.
  • The MoE decrease wrt sample size is logarithmic, so need to consider cost/benefit in sampling.

MoE Table: | Sample Size | % MoE | |:------------|:-| | 25 | ±20% | | 64 | ±12.5% | | 100 | ±10% | | 256 | ±6.25% | | 400 | ±5% | | 625 | ±4% | | 1111 | ±3% | | 1600 | ±2.5% | | 2500 | ±2% | | 10000 | ±1% |

Applications of MoE: Examples

Example 1:

  • A company claims 30% of ppl who eat their product really like it. CI is cited as 95%.
  • In June, an independent survey was conducted with 625 randomly selected members to verify this claim.
  • Result of survey was that 125 liked the product.
  • Q: Would you say that at a 5% level of significance, that the company was correct in stating that 30% of people liked their product?

Answer:

  • Proportion of people who liked the product in sample: 125/625 = 0.2, 20%
  • MoE of n=625 for 95% confidence is 1/math.sqrt(625), ±0.04 (±40%)
  • MoE range is within 0.26 to 0.34
  • Conclusion: No, 20% is not within the MoE range of 26-34%

Example 2:

  • In a survey I want a MoE to be ±5% at 95% level of confidence. What sample size must I pick in order to achieve this?
  • Answer: Sample size should be 400 to get MoE 5% at 95% level of confidence:
    # Derive sample size from MoE of 0.05
    moe = 0.05
    moe = 1/math.sqrt(n)
    moe**2 = 1/n
    1/(moe**2) = n
    n = 400

Example 3:

  • Company claims that 10% of candies it produces are green.
  • Students found that in a large sample of 500 M&Ms, 60 were green.
  • Q: Assuming company claim is true, would 60/500 proportion be unusually high or low proportion of green M&Ms?

Answer:

  • Proportion of green M&Ms is 0.12, 12%
  • MoE of sample size is 0.0445, ±4.5%
  • Sample proportion is within the MoE range of 0.075 to 0.165, or 7.5-16.5%.
  • Conclusion: 10% proportion is within the MoE for its sample size.

Module 3: AI/ML Techniques

Goal: Understand and apply basic AI/ML techniques to data scenarios, with a focus on instituting "fair" practices when designing decision-making systems based on big data.

Lesson 12: Word Embeddings

Word Embeddings (NLP)

  • Word embeddings transform human language meaningfully into a numerical form. This allows algortihms to understand the nuances implicitly encoded into our language.

Bias in word embeddings

  • NLP products, if trained on toxic data will generate biased/toxic output (e.g. Micoosoft chatbot).
  • Data can be "sanitized" to prevent biased/toxic outputs.

Word Simlarity & Relatedness

A note on word vs semantic similarity:

  • Semantic similarity: Metric defined over a set of terms, and its distance/similarity is based on the likeliness of their meaning.
  • Word simlarity: Metric comparison of a word's syntactical representation or string format.

2 prevailing use of simlarity:

  1. Using a dictionary (e.g. WordNet)
  2. Learning simlarity statistics using a large corpus of data.

Vector Space Models

  • Vectorization: Process of converting text to numbers. This conversions helps to measure simlarity between words.
  • Vector space models are models representing text as a vector of identifiers in which similar words are mapped to points in geometric space.

cosine

Representations

Document Occurrence: Assign identifiers corresponding to the count of words in each document (from a cluster of docs) in which the word occurs.

documents

  • 12 documents (menus), "chocolate" is mentioned 7 times in 5 docs.
  • Vector space model may find relation between the docs, e.g. they are dessert menus. Can use to score probability that "chocolate" in menus are dessert menus.

Word Context: Quantify co-occurrence of terms in a corpus by constructing a co-occurrence matrix to capture the number of times a term appears in the context of another term.

Example: Create word cooccurrence table between "chocolate is the bets dessert in the world", "GT is the best university in the world" and "The world runs on chocolate".

wordcontext

Example: Comparing tiny corpus of sports corpus. Doc occurrence finds "losangeles" + "dodgers" and "atlanta" + "falcons" co-occur. Word occurrence shows different viewpoint.

dococcurrence

wordoccurrence

Cosine similarity & word analogy

  • Cosine similarity estimates how similar two words are.
  • IMPORTANT: Similarity measures is highly dependent on what vector representation is selected to represetn the words found in your corpus.

FORMULA:

cosine

  • Given two vectors a and b, the cosine similarity is defined as the dot-product of the two vectors divided by their length.
  • The formula measures the cosine of the angle between two vectors projected in a multi-dimensional space.
  • The closer the angle, the more similar the words are. 1 = related, 0 = unrelated, -1 = related but opposite
# Similarity = (A.B) / (||A||.||B||) 

import numpy as np
from numpy.linalg import norm
from itertools import permutations

# toy vectors using atlanta, falcons, los angeles and dodgers
atlanta = ('atlanta', np.array([1, 1, 0, 0]))
falcons = ('falcons', np.array([1, 1, 0, 0]))
los_angeles = ('los angeles', np.array([0, 0, 1, 1]))
dodgers = ('dodgers', np.array([0, 0, 1, 1]))

# compute cosine similarity
def cos_sim(x, y):
    return np.dot(x, y) / (norm(x) * norm(y))

# compute cosine similarities among toy vectors
for p1, p2 in list(permutations([atlanta, falcons, los_angeles, dodgers], 2)):
    cosine = cos_sim(p1[1], p2[1])
    print(f"Similarity({p1[0]}, {p2[0]}): {round(cosine, 2)}")

Results: Cosine similarity between atlanta and falcons, and los angeles and dodgers are similar.

Similarity(atlanta, falcons): 1.0
Similarity(atlanta, los angeles): 0.0
Similarity(atlanta, dodgers): 0.0
Similarity(falcons, atlanta): 1.0
Similarity(falcons, los angeles): 0.0
Similarity(falcons, dodgers): 0.0
Similarity(los angeles, atlanta): 0.0
Similarity(los angeles, falcons): 0.0
Similarity(los angeles, dodgers): 1.0
Similarity(dodgers, atlanta): 0.0
Similarity(dodgers, falcons): 0.0
Similarity(dodgers, los angeles): 1.0

Word Analogy Task

  • Task: "a is to b as c is to ???"
  • Formula: Find the word vector that is most similar to the result vector of vec_c + vec_b - vec_a

Examples: From http://bionlp-www.utu.fi/wv_demo/ - English GoogleNews Model

analogies

Word Embeddings (Word2Vec)

  • Stores each word as point in multidimensional spaces represented by a vector of a fixed number of dimensions (generally 300).
  • Dimensions are projections along different axes.
  • Assumption: Similar words have similar angles
  • Unsupervised, built just by reading large corpus of data
  • Example: "Chocolate" might be represented as [1, 0, 1, 1, 0, 2]

vecspace

Vector Space Models for Word Embeddings

Context prediction models (Skipgram, W2V): Predict the context of a given word by learning probabilities of co-occurence from a corpus.

  • In theory, words that share similar contexts tend to have similar meanings.
  • Thus, instead of counting co-occurrence we should be able to generate word vectors that can predict the context of a word based on its surrounding words by learning from a corpus of data.

Word2Vec

w2v

2 Types of Word2Vec:

  1. Continuous Bag of Words (CBOW): Neural network trained to predict which word fits in a gap in a sentence. Example: "the student ___ the exam", model optimized to predict gap with word with highest possibility.
  2. Skipgram: Starts with a single word embedding and tries to predict the surrounding words.
    • W2V uses words a few positions away from each center word to predict similarities between every word and its context words.
    • Pairs of center word / context word are called skip grams
    • Example: "the student passed the exam". Center word = "passed", context words = ["the", "student", "the", "exam"].

cbowskipgram

skipgram

Word2Vec Params

Important parameters

  • Window size: Can affect the result of the vector space model
  • Iterations: Can affect

Lesson 13: Bias in Word Embeddings

Bias in Word Embeddings

Why does this happen?


Lesson 14: Facial Recognition

Facial Recognition Algorithms

Steps:

  1. Face detection, Two-class classification. First step of any auto-face recognition system
    • Positioning
    • Rotation and pose
    • Occlusion (hidden face)
    • Resolution
    • Single image or sequence of images (video)
  2. Segmentated based on face detection, and normalized/translate/scale/rotated.
    • Multi-class classification (one person vs all others)
  3. Face Identification - tell which person it is.
  4. Face verification - verify whether it is the same person who it is claiming it is.

Method:

  • Face recognitions algos "measure" nodal points on the face (distance between eyes, length of nose, angle of jaws)
  • Features: upper ridges of eye, nose shape, mouth size, position of features relative to each other.
  • Face space: Theory of psychology that defines a multidimensional space in which recognizable faces are stored. Representation of faces in this space are according to invariant features of the face itself.

facespace

  • Appearance-based methods (classifier) trained, typically using supervised learning methods. Deep neural nets are most common method as of 2022.

Human biases

  • "Own Race" bias (Meissner and Brigham) - 2x more likely to identify own race than other race
  • "Own gender" bias
  • "Own age" bias

Deep Neural Network for Facial Recognition

  1. Facebook DeepFace - largest facial dataset (in 2014), trained on 4MM images belonging to more than 4000 identities.
  2. Microsoft Celeb Dataset - 10MM of 100,000 individuals. Scraped from images with Creative Commons license
  3. Duke MTMC Dataset Analysis - capture students between student lectures. 2MM frames of 2,000 students
  4. Stanford Branwash Dataset Analysis - 10,000 images, 82,000 annotated heads.

All 3 were taken down in 2019. All were CC, but were all used by foreign surveillance and defence organizations.

Other datasets: MegaFace Dataset - face recognition training set 4.7MM faces. Sourced from Flickr.

Emotions: Facial Recognition

Facial recognition algorithms are used to gauge a person's emotion. Used for driver attention, monitor movie audience reaction, healthcare use.

AIs originally built upon Ekman's studies (emotion expressions are universal)

Procedure:

  1. Extract facial features
  2. Feed features into a classifier (NNs)
  3. Classify image/features to one of the pre-selected emotion categories (6 universal emotions + neutral).

Facial Action Units

fau

Case Study: TSA's Screening of Passengers by Observation Techniques (SPOT) Program

  • deploys over 3,000 behavioral detection officers in an effort to identify passengers that may pose a risk to aviation security.
  • criticised for racial profiling.

Lesson 15: Bias in Facial Recognition

In the wild, facial identification becomes problematic because:

  • resolution
  • facial pose
  • illumination
  • occlusion

Results in:

  1. Facial feature points not found
  2. Higher errors
  3. Not enough data or feature points to analyze

Error rates for face recognition:

  1. False positives - matching a wrong person to an image
  2. False negatives - not matching the right person to an image
  • No standards exist for "acceptable" error rates. Depends on the facial recognition system used and its application.

Bias in the Data

biasdata

Why Bias Occurs in the Data

Training sets are hard to get. Need to buy/scrape/obtain more samples from underrepresented classes. Grey area occurs with regards to scraping.

Lesson 16: Predictive Algorithms Pt 1

Evaluation Metrics

alt

alt

Module 4: Bias Mitigation Applications

Lesson 19: Fairness and Bias

Algorithmic Fairness - mitiagate effects of unwarranted bias/discrimination from AIML algos. Focus on mathematical formalism / algo arpproaches to fairness.

Examples of algo bias: types

Addressing Source of Bias

bias

Problem: Biased data stored in protected attributes. Solution: Remove protected class attributes. But redundant inherent encodings in other features that correlate to protected class may occur.

Addressing Fairness Measures

Problem: There are issues with error-rate imbalances such that different groups have different outcomes. Solution: Only outcomes matter, mmake sure g roups are in line with preetermined "fiarness" metrics

Issues:

  • There are many definitions for fairness
  • Many of the definitions conflict

Principles for quantifying fairness

  • predictions for ppl with similar non-protected attributes should be similar
  • differences should be mostly explainable by non-protected attributes

Two basic frameworks for measuring fairness:

  • Fairness at individual: consistency or individual fairness
  • Fairness at group: statistical parity

Fairness in Loan Scoring Models

redline

Max Profit Model - Setting different thresholds for disadvantaged groups in order to maximize profit. Split into priviledge vs unpriviledged group.

Profits computed on 4 components:

  1. Max profit if you grant loan to someone with high probability of paying it back
  2. Lose profit if you grant a loan to someone with low probability of paying it back.
  3. Neutral if you deny a loan to someone with someone with low probability of paying it back.
  4. (IMPORTANT) Lose some profit if you deny a loan to someone with a higher probability of paying it back (Opportunity costs)

Set different thresholds to the two groups, gives most loans to those with the highest probability of paying back. At least gives loans out to underpriviledged group, rather than denying then altogether.

Blinding Model - Class features and all "proxy" information removed.

  • Model is still unfair without sensitive data. Biases are still encoded by proxies in the dataset.
  • Priviledged group will have generally higher thresholds on all decision features.
  • Results in less profit in case study than max profit model.

Demographic Parity - All groups have same percentage approved.

  • Leads to bias against priviledged group.
  • Makes less profit than max profit model, but more than blind model.

Equal Opportunity - Same percentage of "credit-worthy" candidates, ie. true positives, in both groups.

  • Best of all one-threshold models, but still doesn't do better than Max Profit model.

Other Group Fairness Metrics

  1. Statistical Parity Difference - calculate delta between unprivileged positives and privileged positives.
  2. Disparate Impact - calculate ratio between unprivileged positives vs privileged positives.

otherfairness

Other Biases in the Algorithm Process

biascycle

biasmitigation

3 phases of bias mitigation steps

  1. Preprocessing algorithms - modify training data
  2. In-processing algorithms - modify learning algorithm
  3. Post-processing algorithms - modify prediction labels

Lesson 20: Fairness and Bias Assessment Tools

AI Fairness 360

360

360algos

Preprocessing

preprocess

What-if Tool

https://pair-code.github.io/what-if-tool/

  • No code method of exploring ML models

Other tools

othertools

Lesson 21: AI/ML Techniques for Bias Mitigation

Fair Classifiers

  • Can't simply drop protected attributes because other features are correlated with them.

Race/Sex Discrimination on different algorithms

alt

Fairness-aware Algo Trade-offs

Determining thresholds for accuracy vs fairness must take into considerations: legal, ethical, gain trust.

When false positives is better than false negatives.

  • Image privacy: Something that needs to be blurred is not blurred.

When false negatives is better than false positives.

  • Spam filtering: An important email is flagged as spam (false positive), you end up not reading the email.

Bias consideration with regards to task: Example with gender.

  • Gender discrimination is illegal with loan applications.
  • Gender-specific medical diagnosis is desirable.

Lesson 22: AIES Wrap-up