--------------------------------------------------------------------------- -- Practical SQL: A Beginner's Guide to Storytelling with Data, 2nd Edition -- by Anthony DeBarros -- Try It Yourself Questions and Answers ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- -- Chapter 1: Setting Up Your Coding Environment ---------------------------------------------------------------------------- -- There are no Try It Yourself exercises in this chapter! ---------------------------------------------------------------------------- -- Chapter 2: Creating Your First Database and Table ---------------------------------------------------------------------------- -- 1. Imagine you're building a database to catalog all the animals at your -- local zoo. You want one table for tracking all the kinds of animals and -- another table to track the specifics on each animal. Write CREATE TABLE -- statements for each table that include some of the columns you need. Why did -- you include the columns you chose? -- Answer (yours will vary): -- The first table will hold the animal types and their conservation status: CREATE TABLE animal_types ( animal_type_id bigint PRIMARY KEY GENERATED ALWAYS AS IDENTITY, common_name text NOT NULL, scientific_name text NOT NULL, conservation_status text NOT NULL, CONSTRAINT common_name_unique UNIQUE (common_name) ); -- It's OK if your answer doesn't have all the keywords in the example above. Those -- keywords reference concepts you'll learn in later chapters, including table -- constraints, primary keys and and IDENTITY columns. What's important is that you -- considered the individual data items you would want to track. -- The second table will hold data on individual animals. Note that the -- column animal_type_id references the column of the same name in the -- table animal types. This is a foreign key, which you will learn about in -- Chapter 8. CREATE TABLE menagerie ( menagerie_id bigint PRIMARY KEY GENERATED ALWAYS AS IDENTITY, common_name text REFERENCES animal_types (common_name), date_acquired date NOT NULL, gender text, acquired_from text, name text, notes text ); -- 2. Now create INSERT statements to load sample data into the tables. -- How can you view the data via the pgAdmin tool? -- Answer (again, yours will vary): INSERT INTO animal_types (common_name, scientific_name, conservation_status) VALUES ('Bengal Tiger', 'Panthera tigris tigris', 'Endangered'), ('Arctic Wolf', 'Canis lupus arctos', 'Least Concern'); -- data source: https://www.worldwildlife.org/species/directory?direction=desc&sort=extinction_status INSERT INTO menagerie (common_name, date_acquired, gender, acquired_from, name, notes) VALUES ('Bengal Tiger', '3/12/1996', 'F', 'Dhaka Zoo', 'Ariel', 'Healthy coat at last exam.'), ('Arctic Wolf', '9/30/2000', 'F', 'National Zoo', 'Freddy', 'Strong appetite.'); -- To view data via pgAdmin, in the object browser, right-click Tables and -- select Refresh. Then right-click the table name and select -- View/Edit Data > All Rows -- 2b. Create an additional INSERT statement for one of your tables. On purpose, -- leave out one of the required commas separating the entries in the VALUES -- clause of the query. What is the error message? Does it help you find the -- error in the code? -- Answer: In this case, the error message points to the missing comma. INSERT INTO animal_types (common_name, scientific_name, conservation_status) VALUES ('Javan Rhino', 'Rhinoceros sondaicus' 'Critically Endangered'); ---------------------------------------------------------------------------- -- Chapter 3: Beginning Data Exploration with SELECT ---------------------------------------------------------------------------- -- 1. The school district superintendent asks for a list of teachers in each -- school. Write a query that lists the schools in alphabetical order along -- with teachers ordered by last name A-Z. -- Answer: SELECT school, first_name, last_name FROM teachers ORDER BY school, last_name; -- 2. Write a query that finds the one teacher whose first name starts -- with the letter 'S' and who earns more than $40,000. -- Answer: SELECT first_name, last_name, school, salary FROM teachers WHERE first_name LIKE 'S%' -- remember that LIKE is case-sensitive! AND salary > 40000; -- 3. Rank teachers hired since Jan. 1, 2010, ordered by highest paid to lowest. -- Answer: SELECT last_name, first_name, school, hire_date, salary FROM teachers WHERE hire_date >= '2010-01-01' ORDER BY salary DESC; ---------------------------------------------------------------------------- -- Chapter 4: Understanding Data Types ---------------------------------------------------------------------------- -- 1. Your company delivers fruit and vegetables to local grocery stores, and -- you need to track the mileage driven by each driver each day to a tenth -- of a mile. Assuming no driver would ever travel more than 999 miles in -- a day, what would be an appropriate data type for the mileage column in your -- table. Why? -- Answer: numeric(4,1) -- numeric(4,1) provides four digits total (the precision) and one digit after -- the decimal (the scale). That would allow you to store a value as large -- as 999.9. -- In practice, you may want to consider that the assumption on maximum miles -- in a day could conceivably exceed 999.9 and go with the larger numeric(5,1). -- 2. In the table listing each driver in your company, what are appropriate -- data types for the drivers’ first and last names? Why is it a good idea to -- separate first and last names into two columns rather than having one -- larger name column? -- Answer: varchar(50) -- or text -- 50 characters is a reasonable length for names, and using either varchar(50) -- or text ensures you will not waste space when names are shorter. Using text will -- ensure that if you run into the exceptionally rare circumstance of a name longer -- than 50 characters, you'll be covered. Also, separating first and last names -- into their own columns will let you later sort on each independently. -- 3. Assume you have a text column that includes strings formatted as dates. -- One of the strings is written as '4//2017'. What will happen when you try -- to convert that string to the timestamp data type? -- Answer: Attempting to convert a string of text that does not conform to -- accepted date/time formats will result in an error. You can see this with -- the below example, which tries to cast the string as a timestamp. SELECT CAST('4//2021' AS timestamp with time zone); ---------------------------------------------------------------------------- -- Chapter 5: Importing and Exporting Data ---------------------------------------------------------------------------- -- 1. Write a WITH statement to include with COPY to handle the import of an -- imaginary text file that has a first couple of rows that look like this: id:movie:actor 50:#Mission: Impossible#:Tom Cruise -- Answer: The WITH statement will need the options seen here: COPY actors FROM 'C:\YourDirectory\movies.txt' WITH (FORMAT CSV, HEADER, DELIMITER ':', QUOTE '#'); -- If you'd like to try actually importing this data, save the data in a file -- called movies.txt and create the actors table below. You can then run the COPY -- statement. CREATE TABLE actors ( id integer, movie text, actor text ); -- Note: You may never encounter a file that uses a colon as a delimiter and -- pound sign for quoting, but anything is possible! -- 2. Using the table us_counties_pop_est_2019 you created and filled in this -- chapter, export to a CSV file the 20 counties in the United States that had -- the most births. Make sure you export only each county’s name, state, and -- number of births. (Hint: births are totaled for each county in the column -- births_2019.) -- Answer: COPY ( SELECT county_name, state_name, births_2019 FROM us_counties_pop_est_2019 ORDER BY births_2019 DESC LIMIT 20 ) TO 'C:\YourDirectory\us_counties_births_export.txt' WITH (FORMAT CSV, HEADER); -- Note: This COPY statement uses a SELECT statement to limit the output to -- only the desired columns and rows. -- 3. Imagine you're importing a file that contains a column with these values: -- 17519.668 -- 20084.461 -- 18976.335 -- Will a column in your target table with data type numeric(3,8) work for these -- values? Why or why not? -- Answer: -- No, it won't. In fact, you won't even be able to create a column with that -- data type because the precision must be larger than the scale. The correct -- type for the example data is numeric(8,3). ---------------------------------------------------------------------------- -- Chapter 6: Basic Math and Stats with SQL ---------------------------------------------------------------------------- -- 1. Write a SQL statement for calculating the area of a circle whose radius is -- 5 inches. Do you need parentheses in your calculation? Why or why not? -- Answer: -- (The formula for the area of a circle is: pi * radius squared.) SELECT 3.14 * 5 ^ 2; -- The result is an area of 78.5 square inches. -- Note: You do not need parentheses because exponents and roots take precedence -- over multiplication. However, you could include parentheses for clarity. This -- statement produces the same result: SELECT 3.14 * (5 ^ 2); -- 2. Using the 2019 Census county estimates data, calculate a ratio of births to -- deaths for each county in New York state. Which region of the state generally -- saw a higher ratio of births to deaths in 2019? -- Answer: SELECT county_name, state_name, births_2019 AS births, deaths_2019 AS DEATHS, births_2019::numeric / deaths_2019 AS birth_death_ratio FROM us_counties_pop_est_2019 WHERE state_name = 'New York' ORDER BY birth_death_ratio DESC; -- Generally, counties in and around New York City had the highest ratio of births -- to deaths in the 2019 estimates. One exception to the trend is Jefferson County, -- which is upstate on the U.S./Canadian border. -- 3. Was the 2019 median county population estimate higher in California or New York? -- Answer: -- California had a median county population estimate of 187,029 in 2019, almost double -- that of New York, at 86,687. Here are two solutions: -- First, you can find the median for each state one at a time: SELECT percentile_cont(.5) WITHIN GROUP (ORDER BY pop_est_2019) FROM us_counties_pop_est_2019 WHERE state_name = 'New York'; SELECT percentile_cont(.5) WITHIN GROUP (ORDER BY pop_est_2019) FROM us_counties_pop_est_2019 WHERE state_name = 'California'; -- Or both in one query (credit: https://github.com/Kennith-eng) SELECT state_name, percentile_cont(0.5) WITHIN GROUP (ORDER BY pop_est_2019) AS median FROM us_counties_pop_est_2019 WHERE state_name IN ('New York', 'California') GROUP BY state_name; -- Finally, this query shows the median for each state: SELECT state_name, percentile_cont(0.5) WITHIN GROUP (ORDER BY pop_est_2019) AS median FROM us_counties_pop_est_2019 GROUP BY state_name; ---------------------------------------------------------------------------- -- Chapter 7: Joining Tables in a Relational Database ---------------------------------------------------------------------------- -- 1. According to the Census population estimates, which county had the -- greatest percentage loss of population between 2010 and 2019? Try -- an Internet search to find out what happened. (Hint: The loss is related -- to a particular type of facility.) -- Answer: -- Concho County, Texas, lost 33 percent of its population from 2010 to -- 2019, the result of the closure of Eden Detention Center. -- https://www.texasstandard.org/stories/after-edens-prison-closes-what-comes-next-for-this-small-texas-town/ -- Simply use ASC in the ORDER BY clause to re-order the results, like this: SELECT c2019.county_name, c2019.state_name, c2019.pop_est_2019 AS pop_2019, c2010.estimates_base_2010 AS pop_2010, c2019.pop_est_2019 - c2010.estimates_base_2010 AS raw_change, round( (c2019.pop_est_2019::numeric - c2010.estimates_base_2010) / c2010.estimates_base_2010 * 100, 1 ) AS pct_change FROM us_counties_pop_est_2019 AS c2019 JOIN us_counties_pop_est_2010 AS c2010 ON c2019.state_fips = c2010.state_fips AND c2019.county_fips = c2010.county_fips ORDER BY pct_change ASC; -- 2. Apply the concepts you learned about UNION to create query -- results that merge queries of the Census county population estimates -- for 2010 and 2019. Your results should include a column called year -- that specifies the year of the estimate for each row in the results. -- Answer: -- Note that you pass a string for the year in both queries. SELECT '2010' AS year, state_fips, county_fips, county_name, state_name, estimates_base_2010 AS estimate FROM us_counties_pop_est_2010 UNION SELECT '2019' AS year, state_fips, county_fips, county_name, state_name, pop_est_2019 AS estimate FROM us_counties_pop_est_2019 ORDER BY state_fips, county_fips, year; -- 3. Using the percentile_cont() function from Chapter 6, -- determine the median of the percent change in estimated county -- population between 2010 and 2019. -- Answer: -0.5% SELECT percentile_cont(.5) WITHIN GROUP (ORDER BY round( (c2019.pop_est_2019::numeric - c2010.estimates_base_2010) / c2010.estimates_base_2010 * 100, 1 )) AS percentile_50th FROM us_counties_pop_est_2019 AS c2019 JOIN us_counties_pop_est_2010 AS c2010 ON c2019.state_fips = c2010.state_fips AND c2019.county_fips = c2010.county_fips; ---------------------------------------------------------------------------- -- Chapter 8: Table Design That Works for You ---------------------------------------------------------------------------- -- Consider the following two tables from a database you’re making to keep -- track of your vinyl LP collection. Start by reviewing these CREATE TABLE -- statements. -- The albums table includes information specific to the overall collection -- of songs on the disc. The songs table catalogs each track on the album. -- Each song has a title and a column for its composers, who might be -- different than the album artist. CREATE TABLE albums ( album_id bigserial, catalog_code varchar(100), title text, artist text, release_date date, genre varchar(40), description text ); CREATE TABLE songs ( song_id bigserial, title text, composers text, album_id bigint ); -- Use the tables to answer these questions: -- 1. Modify these CREATE TABLE statements to include primary and foreign keys -- plus additional constraints on both tables. Explain why you made your -- choices. -- Answer (yours may vary slightly): CREATE TABLE albums ( album_id bigint GENERATED ALWAYS AS IDENTITY, catalog_code text NOT NULL, title text NOT NULL, artist text NOT NULL, release_date date, genre text, description text, CONSTRAINT album_id_key PRIMARY KEY (album_id), CONSTRAINT release_date_check CHECK (release_date > '1/1/1925') ); CREATE TABLE songs ( song_id bigint GENERATED ALWAYS AS IDENTITY, title text NOT NULL, composer text NOT NULL, album_id bigint REFERENCES albums (album_id), CONSTRAINT song_id_key PRIMARY KEY (song_id) ); -- Answers: -- a) Both tables get a primary key using surrogate key id values that are -- auto-generated via IDENTITY. -- b) The songs table references albums via a foreign key constraint. -- c) In both tables, the title and artist/composer columns cannot be empty, which -- is specified via a NOT NULL constraint. We assume that every album and -- song should at minimum have that information. -- d) In albums, the release_date column has a CHECK constraint -- because it would be likely impossible for us to own an LP made before 1925. -- 2. Instead of using album_id as a surrogate key for your primary key, are -- there any columns in albums that could be useful as a natural key? What would -- you have to know to decide? -- Answer: -- We could consider the catalog_code. We would have to answer yes to -- these questions: -- 1. Is it going to be unique across all albums released by all companies? -- 2. Will an album always have a catelog code? -- 3. To speed up queries, which columns are good candidates for indexes? -- Answer: -- Primary key columns get indexes by default, but we should add an index -- to the album_id foreign key column in the songs table because we'll use -- it in table joins. It's likely that we'll query these tables to search -- by titles and artists, so those columns in both tables should get indexes -- too. The release_date in albums also is a candidate if we expect -- to perform many queries that include date ranges.