customer
Table Preview(first 5 rows)
C_CUSTKEY | C_NAME | C_ADDRESS | C_CITY | C_NATION | C_REGION | C_PHONE | C_MKTSEGMENT | |
---|---|---|---|---|---|---|---|---|
0 | 1 | Customer#000000001 | j5JsirBM9P | MOROCCO 4 | MOROCCO | AFRICA | 25-214-662-5265 | BUILDING |
1 | 2 | Customer#000000002 | 487LW1dovn6Q4dMVym | JORDAN 5 | JORDAN | MIDDLE EAST | 23-159-880-1131 | AUTOMOBILE |
2 | 4 | Customer#000000004 | 4u58h f | EGYPT 6 | EGYPT | MIDDLE EAST | 14-108-116-4678 | MACHINERY |
3 | 5 | Customer#000000005 | hwBtxkoBF qSW4KrI | CANADA 2 | CANADA | AMERICA | 13-360-101-6891 | HOUSEHOLD |
4 | 7 | Customer#000000007 | 8OkMVLQ1dK6Mbu6WG9 | CHINA 3 | CHINA | ASIA | 28-777-614-8279 | AUTOMOBILE |
date
Table Preview(first 5 rows)
D_DATEKEY | D_DATE | D_DAYOFWEEK | D_MONTH | D_YEAR | D_YEARMONTHNUM | D_YEARMONTH | D_DAYNUMINWEEK | D_DAYNUMINMONTH | D_DAYNUMINYEAR | D_MONTHNUMINYEAR | D_WEEKNUMINYEAR | D_SELLINGSEASON | D_LASTDAYINWEEKFL | D_LASTDAYINMONTHFL | D_HOLIDAYFL | D_WEEKDAYFL | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 19920101 | January 1, 1992 | Thursday | January | 1992 | 199201 | Jan1992 | 5 | 1 | 1 | 1 | 1 | Winter | 0 | 1 | 1 | 1 |
1 | 19920102 | January 2, 1992 | Friday | January | 1992 | 199201 | Jan1992 | 6 | 2 | 2 | 1 | 1 | Winter | 0 | 1 | 0 | 1 |
2 | 19920103 | January 3, 1992 | Saturday | January | 1992 | 199201 | Jan1992 | 7 | 3 | 3 | 1 | 1 | Winter | 1 | 1 | 0 | 0 |
3 | 19920104 | January 4, 1992 | Sunday | January | 1992 | 199201 | Jan1992 | 1 | 4 | 4 | 1 | 1 | Winter | 0 | 1 | 0 | 0 |
4 | 19920105 | January 5, 1992 | Monday | January | 1992 | 199201 | Jan1992 | 2 | 5 | 5 | 1 | 1 | Winter | 0 | 1 | 0 | 1 |
lineorder
Table Preview(first 5 rows)
LO_ORDERKEY | LO_LINENUMBER | LO_CUSTKEY | LO_PARTKEY | LO_SUPPKEY | LO_ORDERDATE | LO_ORDERPRIORITY | LO_SHIPPRIORITY | LO_QUANTITY | LO_EXTENDEDPRICE | LO_ORDTOTALPRICE | LO_DISCOUNT | LO_REVENUE | LO_SUPPLYCOST | LO_TAX | LO_COMMITDATE | LO_SHIPMODE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 277316 | 7 | 19190 | 99546 | 1170 | 19950228 | 4-NOT SPECIFIED | 0 | 13 | 2009202 | 23455557 | 0 | 2009202 | 92732 | 3 | 19950525 | REG AIR |
1 | 5929219 | 4 | 28511 | 48214 | 1427 | 19980702 | 1-URGENT | 0 | 14 | 1627094 | 20688659 | 4 | 1562010 | 69732 | 0 | 19980922 | SHIP |
2 | 3693318 | 4 | 15811 | 100325 | 1170 | 19920807 | 2-HIGH | 0 | 4 | 530128 | 19688727 | 9 | 482416 | 79519 | 0 | 19921013 | FOB |
3 | 2765669 | 2 | 2258 | 126666 | 35 | 19920504 | 4-NOT SPECIFIED | 0 | 15 | 2538990 | 3523267 | 8 | 2335870 | 101559 | 8 | 19920703 | |
4 | 686017 | 1 | 23836 | 172615 | 804 | 19951014 | 1-URGENT | 0 | 17 | 2868937 | 4014677 | 7 | 2668111 | 101256 | 8 | 19951213 | TRUCK |
part
Table Preview(first 5 rows)
P_PARTKEY | P_NAME | P_MFGR | P_CATEGORY | P_BRAND1 | P_COLOR | P_TYPE | P_SIZE | P_CONTAINER | |
---|---|---|---|---|---|---|---|---|---|
0 | 10 | moccasin royal | MFGR#2 | MFGR#21 | MFGR#2128 | floral | LARGE BURNISHED STEEL | 44 | LG CAN |
1 | 15 | dark sky | MFGR#3 | MFGR#34 | MFGR#3438 | navajo | LARGE ANODIZED BRASS | 45 | LG CASE |
2 | 16 | brown turquoise | MFGR#5 | MFGR#54 | MFGR#5415 | deep | PROMO PLATED TIN | 2 | MED PACK |
3 | 22 | floral beige | MFGR#4 | MFGR#44 | MFGR#4421 | medium | PROMO POLISHED BRASS | 19 | LG DRUM |
4 | 25 | forest magenta | MFGR#2 | MFGR#23 | MFGR#2336 | blush | STANDARD BRUSHED COPPER | 3 | JUMBO BAG |
supplier
Table Preview(first 5 rows)
S_SUPPKEY | S_NAME | S_ADDRESS | S_CITY | S_NATION | S_REGION | S_PHONE | |
---|---|---|---|---|---|---|---|
0 | 1 | Supplier#000000001 | sdrGnXCDRcfriBvY0KL,i | PERU 7 | PERU | AMERICA | 27-989-741-2988 |
1 | 2 | Supplier#000000002 | TRMhVHz3XiFu | ETHIOPIA 2 | ETHIOPIA | AFRICA | 15-768-687-3665 |
2 | 3 | Supplier#000000003 | BZ0kXcHUcHjx62L7CjZS | ARGENTINA8 | ARGENTINA | AMERICA | 11-719-748-3364 |
3 | 4 | Supplier#000000004 | qGTQJXogS83a7MB | MOROCCO 5 | MOROCCO | AFRICA | 25-128-190-5944 |
4 | 5 | Supplier#000000005 | lONEYAh9sF | IRAQ 1 | IRAQ | MIDDLE EAST | 21-750-942-6364 |
stg_date
Table Preview (first 5 rows)
day_of_week | month_name | year_ | year_month_abbr | day_number_in_week | day_number_in_month | day_number_in_year | month_number | week_number | selling_season | date_key | full_date | is_holiday | is_last_day_of_month | is_last_day_of_week | is_weekday | year_month_numeric | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Thursday | January | 1992 | Jan1992 | 5 | 1 | 1 | 1 | 1 | Winter | 19920101 | 1992-01-01 | True | True | False | True | 199201 |
1 | Friday | January | 1992 | Jan1992 | 6 | 2 | 2 | 1 | 1 | Winter | 19920102 | 1992-01-02 | False | True | False | True | 199201 |
2 | Saturday | January | 1992 | Jan1992 | 7 | 3 | 3 | 1 | 1 | Winter | 19920103 | 1992-01-03 | False | True | True | False | 199201 |
3 | Sunday | January | 1992 | Jan1992 | 1 | 4 | 4 | 1 | 1 | Winter | 19920104 | 1992-01-04 | False | True | False | False | 199201 |
4 | Monday | January | 1992 | Jan1992 | 2 | 5 | 5 | 1 | 1 | Winter | 19920105 | 1992-01-05 | False | True | False | True | 199201 |
stg_date.sql (clean the table)
-- COCOON BLOCK START: PLEASE DO NOT MODIFY THIS BLOCK FOR SELF-MAINTENANCE
-- Generated at 2024-08-26 00:48:59.029137+00:00
WITH
"date_renamed" AS (
-- Rename: Renaming columns
-- D_DATEKEY -> date_key
-- D_DATE -> full_date
-- D_DAYOFWEEK -> day_of_week
-- D_MONTH -> month_name
-- D_YEAR -> year_
-- D_YEARMONTHNUM -> year_month_numeric
-- D_YEARMONTH -> year_month_abbr
-- D_DAYNUMINWEEK -> day_number_in_week
-- D_DAYNUMINMONTH -> day_number_in_month
-- D_DAYNUMINYEAR -> day_number_in_year
-- D_MONTHNUMINYEAR -> month_number
-- D_WEEKNUMINYEAR -> week_number
-- D_SELLINGSEASON -> selling_season
-- D_LASTDAYINWEEKFL -> is_last_day_of_week
-- D_LASTDAYINMONTHFL -> is_last_day_of_month
-- D_HOLIDAYFL -> is_holiday
-- D_WEEKDAYFL -> is_weekday
SELECT
"D_DATEKEY" AS "date_key",
"D_DATE" AS "full_date",
"D_DAYOFWEEK" AS "day_of_week",
"D_MONTH" AS "month_name",
"D_YEAR" AS "year_",
"D_YEARMONTHNUM" AS "year_month_numeric",
"D_YEARMONTH" AS "year_month_abbr",
"D_DAYNUMINWEEK" AS "day_number_in_week",
"D_DAYNUMINMONTH" AS "day_number_in_month",
"D_DAYNUMINYEAR" AS "day_number_in_year",
"D_MONTHNUMINYEAR" AS "month_number",
"D_WEEKNUMINYEAR" AS "week_number",
"D_SELLINGSEASON" AS "selling_season",
"D_LASTDAYINWEEKFL" AS "is_last_day_of_week",
"D_LASTDAYINMONTHFL" AS "is_last_day_of_month",
"D_HOLIDAYFL" AS "is_holiday",
"D_WEEKDAYFL" AS "is_weekday"
FROM "memory"."main"."date"
),
"date_renamed_cleaned" AS (
-- Clean unusual string values:
-- month_name: The problem is that there are two misspelled month names in the dataset. 'Augest' is a misspelling of 'August', and 'Octorber' is a misspelling of 'October'. These are typos that need to be corrected to ensure accurate data representation and analysis. The correct values should be the proper spellings of these month names.
SELECT
"date_key",
"full_date",
"day_of_week",
CASE
WHEN "month_name" = 'Augest' THEN 'August'
WHEN "month_name" = 'Octorber' THEN 'October'
ELSE "month_name"
END AS "month_name",
"year_",
"year_month_numeric",
"year_month_abbr",
"day_number_in_week",
"day_number_in_month",
"day_number_in_year",
"month_number",
"week_number",
"selling_season",
"is_last_day_of_week",
"is_last_day_of_month",
"is_holiday",
"is_weekday"
FROM "date_renamed"
),
"date_renamed_cleaned_casted" AS (
-- Column Type Casting:
-- date_key: from INT to VARCHAR
-- full_date: from VARCHAR to DATE
-- is_holiday: from INT to BOOLEAN
-- is_last_day_of_month: from INT to BOOLEAN
-- is_last_day_of_week: from INT to BOOLEAN
-- is_weekday: from INT to BOOLEAN
-- year_month_numeric: from INT to VARCHAR
SELECT
"day_of_week",
"month_name",
"year_",
"year_month_abbr",
"day_number_in_week",
"day_number_in_month",
"day_number_in_year",
"month_number",
"week_number",
"selling_season",
CAST("date_key" AS VARCHAR)
AS "date_key",
CASE
WHEN regexp_full_match("full_date", 'April \d{1,2}, \d{4}') THEN CAST(strptime("full_date", '%B %-d, %Y') AS DATE)
WHEN regexp_full_match("full_date", 'Augest \d{1,2}, \d{4}') THEN CAST(strptime(replace("full_date", 'Augest', 'August'), '%B %-d, %Y') AS DATE)
WHEN regexp_full_match("full_date", 'December \d{1,2}, \d{4}') THEN CAST(strptime("full_date", '%B %-d, %Y') AS DATE)
WHEN regexp_full_match("full_date", '(January|February|March|April|May|June|July|August|September|October|November|December) \d{1,2}, \d{4}') THEN CAST(strptime("full_date", '%B %d, %Y') AS DATE)
WHEN regexp_full_match("full_date", 'Octorber \d{1,2}, \d{4}') THEN CAST(strptime(replace("full_date", 'Octorber', 'October'), '%B %d, %Y') AS DATE)
END
AS "full_date",
CAST("is_holiday" AS BOOLEAN)
AS "is_holiday",
CAST("is_last_day_of_month" AS BOOLEAN)
AS "is_last_day_of_month",
CAST("is_last_day_of_week" AS BOOLEAN)
AS "is_last_day_of_week",
CAST("is_weekday" AS BOOLEAN)
AS "is_weekday",
CAST("year_month_numeric" AS VARCHAR)
AS "year_month_numeric"
FROM "date_renamed_cleaned"
)
-- COCOON BLOCK END
SELECT *
FROM "date_renamed_cleaned_casted"
stg_date.yml (Document the table)
version: 2
models:
- name: stg_date
description: The table is about calendar dates. It contains various attributes of
dates including the date key, full date, day of week, month, year, and numeric
representations. It also includes flags for special days like holidays, last day
of week/month, and selling seasons. The table provides a comprehensive breakdown
of date information for data analysis and reporting purposes.
columns:
- name: day_of_week
description: Name of the day of the week
tests:
- not_null
- accepted_values:
values:
- Saturday
- Thursday
- Friday
- Monday
- Sunday
- Tuesday
- Wednesday
cocoon_meta:
data_type:
current_data_type: VARCHAR
- name: month_name
description: Name of the month
tests:
- not_null
- accepted_values:
values:
- August
- January
- July
- March
- May
- October
- April
- June
- September
- February
- December
- November
cocoon_meta:
data_type:
current_data_type: VARCHAR
- name: year_
description: Year in YYYY format
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: INT
- name: year_month_abbr
description: Year and month abbreviation
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: VARCHAR
- name: day_number_in_week
description: Day number within the week (1-7)
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: INT
- name: day_number_in_month
description: Day number within the month
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: INT
- name: day_number_in_year
description: Day number within the year
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: INT
- name: month_number
description: Month number within the year (1-12)
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: INT
- name: week_number
description: Week number within the year
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: INT
- name: selling_season
description: Current selling season (e.g., Winter)
tests:
- not_null
- accepted_values:
values:
- Summer
- Winter
- Fall
- Christmas
- Spring
cocoon_meta:
future_accepted_values:
- Back to School
- Halloween
- Valentine's Day
- Easter
data_type:
current_data_type: VARCHAR
- name: date_key
description: Unique identifier for each date
tests:
- not_null
- unique
cocoon_meta:
unique_reason: This column is a unique identifier for each date. For this table,
each row represents a specific date. The date_key appears to be a concatenation
of year, month, and day, which would be unique across rows.
data_type:
current_data_type: VARCHAR
- name: full_date
description: Full date in 'Month Day, Year' format
tests:
- not_null
- unique
cocoon_meta:
unique_reason: This column represents the full date in 'YYYY-MM-DD' format.
For this table, each row represents a specific date. The full_date would be
unique across rows as it represents a distinct calendar date.
data_type:
current_data_type: DATE
- name: is_holiday
description: Flag for holiday
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: BOOLEAN
- name: is_last_day_of_month
description: Flag for last day of the month
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: BOOLEAN
- name: is_last_day_of_week
description: Flag for last day of the week
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: BOOLEAN
- name: is_weekday
description: Flag for weekday (Monday-Friday)
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: BOOLEAN
- name: year_month_numeric
description: Year and month in YYYYMM format
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: VARCHAR
tests: []
stg_part
Table Preview (first 5 rows)
part_id | part_name | manufacturer_code | category_code | brand_code | color | type_and_material | size_ | container_type | |
---|---|---|---|---|---|---|---|---|---|
0 | 10 | moccasin royal | MFGR#2 | MFGR#21 | MFGR#2128 | floral | LARGE BURNISHED STEEL | 44 | LG CAN |
1 | 15 | dark sky | MFGR#3 | MFGR#34 | MFGR#3438 | navajo | LARGE ANODIZED BRASS | 45 | LG CASE |
2 | 16 | brown turquoise | MFGR#5 | MFGR#54 | MFGR#5415 | deep | PROMO PLATED TIN | 2 | MED PACK |
3 | 22 | floral beige | MFGR#4 | MFGR#44 | MFGR#4421 | medium | PROMO POLISHED BRASS | 19 | LG DRUM |
4 | 25 | forest magenta | MFGR#2 | MFGR#23 | MFGR#2336 | blush | STANDARD BRUSHED COPPER | 3 | JUMBO BAG |
stg_part.sql (clean the table)
-- COCOON BLOCK START: PLEASE DO NOT MODIFY THIS BLOCK FOR SELF-MAINTENANCE
-- Generated at 2024-08-26 00:51:15.538935+00:00
WITH
"part_renamed" AS (
-- Rename: Renaming columns
-- P_PARTKEY -> part_id
-- P_NAME -> part_name
-- P_MFGR -> manufacturer_code
-- P_CATEGORY -> category_code
-- P_BRAND1 -> brand_code
-- P_COLOR -> color
-- P_TYPE -> type_and_material
-- P_SIZE -> size_
-- P_CONTAINER -> container_type
SELECT
"P_PARTKEY" AS "part_id",
"P_NAME" AS "part_name",
"P_MFGR" AS "manufacturer_code",
"P_CATEGORY" AS "category_code",
"P_BRAND1" AS "brand_code",
"P_COLOR" AS "color",
"P_TYPE" AS "type_and_material",
"P_SIZE" AS "size_",
"P_CONTAINER" AS "container_type"
FROM "memory"."main"."part"
)
-- COCOON BLOCK END
SELECT *
FROM "part_renamed"
stg_part.yml (Document the table)
version: 2
models:
- name: stg_part
description: The table is about parts. It contains details of various parts including
their unique key, name, manufacturer, category, brand, color, type, size, and
container. Each row represents a different part with its specific attributes.
The table seems to be a catalog or inventory of parts with their characteristics
and classifications.
columns:
- name: part_id
description: Unique identifier for the part
tests:
- not_null
- unique
cocoon_meta:
unique_reason: This column represents a unique identifier for each part. For
this table, each row represents a different part, and part_id appears to be
unique across rows.
data_type:
current_data_type: INT
- name: part_name
description: Descriptive name of the part
tests:
- not_null
cocoon_meta:
patterns:
- regex: ^[a-z]+ [a-z]+$
summary: Two color names or descriptors separated by a space
data_type:
current_data_type: VARCHAR
- name: manufacturer_code
description: Manufacturer code of the part
tests:
- not_null
- accepted_values:
values:
- MFGR#2
- MFGR#4
- MFGR#1
- MFGR#5
- MFGR#3
cocoon_meta:
data_type:
current_data_type: VARCHAR
- name: category_code
description: Category code of the part
tests:
- not_null
- accepted_values:
values:
- MFGR#23
- MFGR#41
- MFGR#14
- MFGR#11
- MFGR#24
- MFGR#54
- MFGR#53
- MFGR#22
- MFGR#45
- MFGR#32
- MFGR#42
- MFGR#34
- MFGR#51
- MFGR#13
- MFGR#31
- MFGR#25
- MFGR#21
- MFGR#44
- MFGR#43
- MFGR#33
- MFGR#52
- MFGR#55
- MFGR#35
- MFGR#12
- MFGR#15
cocoon_meta:
data_type:
current_data_type: VARCHAR
- name: brand_code
description: Brand code of the part
tests:
- not_null
cocoon_meta:
patterns:
- regex: MFGR#\d{3,4}
summary: Manufacturer codes starting with "MFGR#" followed by 3 or 4 digits
data_type:
current_data_type: VARCHAR
- name: color
description: Color of the part
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: VARCHAR
- name: type_and_material
description: Type and material of the part
tests:
- not_null
cocoon_meta:
patterns:
- regex: (ECONOMY|MEDIUM|SMALL|LARGE|PROMO|STANDARD) (ANODIZED|POLISHED|PLATED|BURNISHED|BRUSHED)
(STEEL|NICKEL|BRASS|COPPER|TIN)
summary: Product type, treatment, and material in all caps
data_type:
current_data_type: VARCHAR
- name: size_
description: Size of the part
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: INT
- name: container_type
description: Container type for the part
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: VARCHAR
tests: []
stg_customer
Table Preview (first 5 rows)
customer_id | customer_name | city | country | region | phone_number | market_segment | street_address | |
---|---|---|---|---|---|---|---|---|
0 | 1 | Customer#000000001 | MOROCCO | MOROCCO | AFRICA | 25-214-662-5265 | BUILDING | j5JsirBM9P |
1 | 2 | Customer#000000002 | JORDAN 5 | JORDAN | MIDDLE EAST | 23-159-880-1131 | AUTOMOBILE | 487LW1dovn6Q4dMVym |
2 | 4 | Customer#000000004 | EGYPT | EGYPT | MIDDLE EAST | 14-108-116-4678 | MACHINERY | 4u58h f |
3 | 5 | Customer#000000005 | CANADA 2 | CANADA | AMERICA | 13-360-101-6891 | HOUSEHOLD | hwBtxkoBF qSW4KrI |
4 | 7 | Customer#000000007 | CHINA | CHINA | ASIA | 28-777-614-8279 | AUTOMOBILE | 8OkMVLQ1dK6Mbu6WG9 |
stg_customer.sql (clean the table)
-- COCOON BLOCK START: PLEASE DO NOT MODIFY THIS BLOCK FOR SELF-MAINTENANCE
-- Generated at 2024-08-26 00:47:19.528883+00:00
WITH
"customer_renamed" AS (
-- Rename: Renaming columns
-- C_CUSTKEY -> customer_id
-- C_NAME -> customer_name
-- C_ADDRESS -> street_address
-- C_CITY -> city
-- C_NATION -> country
-- C_REGION -> region
-- C_PHONE -> phone_number
-- C_MKTSEGMENT -> market_segment
SELECT
"C_CUSTKEY" AS "customer_id",
"C_NAME" AS "customer_name",
"C_ADDRESS" AS "street_address",
"C_CITY" AS "city",
"C_NATION" AS "country",
"C_REGION" AS "region",
"C_PHONE" AS "phone_number",
"C_MKTSEGMENT" AS "market_segment"
FROM "memory"."main"."customer"
),
"customer_renamed_trimmed" AS (
-- Trim Leading and Trailing Spaces
SELECT
"customer_id",
"customer_name",
"city",
"country",
"region",
"phone_number",
"market_segment",
TRIM("street_address") AS "street_address"
FROM "customer_renamed"
),
"customer_renamed_trimmed_cleaned" AS (
-- Clean unusual string values:
-- city: The problem is that the city column contains country names followed by numbers, with inconsistent spacing and truncation. The country names should be standardized, and the numbers (which likely represent some kind of code or index) should be separated. "UNITED ST" is truncated and should be "UNITED STATES". "UNITED KI" should be "UNITED KINGDOM". "SAUDI ARA" should be "SAUDI ARABIA". The correct values should have consistent spacing between the country name and the number, with the country name in full.
-- street_address: Values starting with commas are unusual. Random strings like '2z1 vZUo2qjaxg7qJouIBb' are not typical for street addresses.
SELECT
"customer_id",
"customer_name",
CASE
WHEN "city" = 'UNITED ST4' THEN 'UNITED STATES 4'
WHEN "city" = 'UNITED ST0' THEN 'UNITED STATES 0'
WHEN "city" = 'UNITED ST1' THEN 'UNITED STATES 1'
WHEN "city" = 'UNITED KI5' THEN 'UNITED KINGDOM 5'
WHEN "city" = 'SAUDI ARA7' THEN 'SAUDI ARABIA 7'
WHEN "city" = 'ARGENTINA4' THEN 'ARGENTINA 4'
WHEN "city" = 'ARGENTINA5' THEN 'ARGENTINA 5'
WHEN "city" = 'MOZAMBIQU7' THEN 'MOZAMBIQUE 7'
WHEN "city" = 'MOZAMBIQU9' THEN 'MOZAMBIQUE 9'
WHEN "city" = 'RUSSIA 5' THEN 'RUSSIA 5'
WHEN "city" = 'RUSSIA 6' THEN 'RUSSIA 6'
WHEN "city" = 'RUSSIA 3' THEN 'RUSSIA 3'
WHEN "city" = 'MOROCCO 5' THEN 'MOROCCO 5'
WHEN "city" = 'FRANCE 8' THEN 'FRANCE 8'
WHEN "city" = 'JAPAN 3' THEN 'JAPAN 3'
WHEN "city" = 'ALGERIA 8' THEN 'ALGERIA 8'
WHEN "city" = 'ALGERIA 0' THEN 'ALGERIA 0'
WHEN "city" = 'ALGERIA 5' THEN 'ALGERIA 5'
WHEN "city" = 'CHINA 5' THEN 'CHINA 5'
WHEN "city" = 'EGYPT 7' THEN 'EGYPT 7'
WHEN "city" = 'EGYPT 1' THEN 'EGYPT 1'
WHEN "city" = 'EGYPT 3' THEN 'EGYPT 3'
WHEN "city" = 'JORDAN 5' THEN 'JORDAN 5'
WHEN "city" = 'JORDAN 8' THEN 'JORDAN 8'
WHEN "city" = 'KENYA 2' THEN 'KENYA 2'
WHEN "city" = 'KENYA 9' THEN 'KENYA 9'
WHEN "city" = 'PERU 8' THEN 'PERU 8'
WHEN "city" = 'CANADA 2' THEN 'CANADA 2'
WHEN "city" = 'CANADA 4' THEN 'CANADA 4'
WHEN "city" = 'CANADA 9' THEN 'CANADA 9'
WHEN "city" = 'CANADA 0' THEN 'CANADA 0'
WHEN "city" = 'VIETNAM 0' THEN 'VIETNAM 0'
WHEN "city" = 'GERMANY 0' THEN 'GERMANY 0'
WHEN "city" = 'GERMANY 9' THEN 'GERMANY 9'
WHEN "city" = 'GERMANY 4' THEN 'GERMANY 4'
WHEN "city" = 'IRAN 4' THEN 'IRAN 4'
WHEN "city" = 'IRAN 6' THEN 'IRAN 6'
WHEN "city" = 'IRAN 7' THEN 'IRAN 7'
WHEN "city" = 'IRAQ 4' THEN 'IRAQ 4'
WHEN "city" = 'IRAQ 9' THEN 'IRAQ 9'
WHEN "city" = 'IRAQ 5' THEN 'IRAQ 5'
WHEN "city" = 'INDIA 8' THEN 'INDIA 8'
WHEN "city" = 'INDONESIA1' THEN 'INDONESIA 1'
WHEN "city" = 'INDONESIA3' THEN 'INDONESIA 3'
WHEN "city" = 'BRAZIL 4' THEN 'BRAZIL 4'
WHEN "city" = 'ROMANIA 4' THEN 'ROMANIA 4'
WHEN "city" = 'ROMANIA 7' THEN 'ROMANIA 7'
WHEN "city" = 'MOZAMBIQU2' THEN 'MOZAMBIQUE'
WHEN "city" = 'ROMANIA 1' THEN 'ROMANIA'
WHEN "city" = 'ROMANIA 3' THEN 'ROMANIA'
WHEN "city" = 'VIETNAM 5' THEN 'VIETNAM'
WHEN "city" = 'ARGENTINA2' THEN 'ARGENTINA'
WHEN "city" = 'CHINA 2' THEN 'CHINA'
WHEN "city" = 'CHINA 4' THEN 'CHINA'
WHEN "city" = 'FRANCE 0' THEN 'FRANCE'
WHEN "city" = 'JAPAN 9' THEN 'JAPAN'
WHEN "city" = 'UNITED KI3' THEN 'UNITED KINGDOM'
WHEN "city" = 'ALGERIA 2' THEN 'ALGERIA'
WHEN "city" = 'ARGENTINA7' THEN 'ARGENTINA'
WHEN "city" = 'BRAZIL 8' THEN 'BRAZIL'
WHEN "city" = 'ETHIOPIA 8' THEN 'ETHIOPIA'
WHEN "city" = 'FRANCE 6' THEN 'FRANCE'
WHEN "city" = 'INDONESIA2' THEN 'INDONESIA'
WHEN "city" = 'JORDAN 2' THEN 'JORDAN'
WHEN "city" = 'KENYA 3' THEN 'KENYA'
WHEN "city" = 'ROMANIA 5' THEN 'ROMANIA'
WHEN "city" = 'UNITED KI0' THEN 'UNITED KINGDOM'
WHEN "city" = 'ETHIOPIA 6' THEN 'ETHIOPIA'
WHEN "city" = 'INDONESIA7' THEN 'INDONESIA'
WHEN "city" = 'MOROCCO 4' THEN 'MOROCCO'
WHEN "city" = 'MOROCCO 8' THEN 'MOROCCO'
WHEN "city" = 'MOZAMBIQU1' THEN 'MOZAMBIQUE'
WHEN "city" = 'MOZAMBIQU3' THEN 'MOZAMBIQUE'
WHEN "city" = 'UNITED ST2' THEN 'UNITED STATES'
WHEN "city" = 'UNITED ST5' THEN 'UNITED STATES'
WHEN "city" = 'EGYPT 0' THEN 'EGYPT'
WHEN "city" = 'EGYPT 8' THEN 'EGYPT'
WHEN "city" = 'EGYPT 9' THEN 'EGYPT'
WHEN "city" = 'INDIA 9' THEN 'INDIA'
WHEN "city" = 'IRAN 1' THEN 'IRAN'
WHEN "city" = 'IRAQ 3' THEN 'IRAQ'
WHEN "city" = 'PERU 6' THEN 'PERU'
WHEN "city" = 'UNITED ST6' THEN 'UNITED STATES'
WHEN "city" = 'UNITED ST8' THEN 'UNITED STATES'
WHEN "city" = 'CANADA 6' THEN 'CANADA'
WHEN "city" = 'CHINA 0' THEN 'CHINA'
WHEN "city" = 'CHINA 1' THEN 'CHINA'
WHEN "city" = 'CHINA 3' THEN 'CHINA'
WHEN "city" = 'GERMANY 6' THEN 'GERMANY'
WHEN "city" = 'INDONESIA8' THEN 'INDONESIA'
WHEN "city" = 'IRAQ 7' THEN 'IRAQ'
WHEN "city" = 'JAPAN 1' THEN 'JAPAN'
WHEN "city" = 'JAPAN 6' THEN 'JAPAN'
WHEN "city" = 'ROMANIA 6' THEN 'ROMANIA'
WHEN "city" = 'VIETNAM 1' THEN 'VIETNAM'
WHEN "city" = 'ARGENTINA8' THEN 'ARGENTINA'
WHEN "city" = 'GERMANY 7' THEN 'GERMANY'
WHEN "city" = 'SAUDI ARA1' THEN 'SAUDI ARABIA 1'
WHEN "city" = 'SAUDI ARA2' THEN 'SAUDI ARABIA 2'
WHEN "city" = 'SAUDI ARA4' THEN 'SAUDI ARABIA 4'
WHEN "city" = 'UNITED KI2' THEN 'UNITED KINGDOM 2'
WHEN "city" = 'UNITED KI7' THEN 'UNITED KINGDOM 7'
WHEN "city" = 'MOZAMBIQU5' THEN 'MOZAMBIQUE 5'
WHEN "city" = 'INDONESIA6' THEN 'INDONESIA 6'
WHEN "city" = 'IRAN 0' THEN 'IRAN 0'
WHEN "city" = 'IRAQ 6' THEN 'IRAQ 6'
WHEN "city" = 'KENYA 1' THEN 'KENYA 1'
WHEN "city" = 'KENYA 7' THEN 'KENYA 7'
WHEN "city" = 'MOROCCO 1' THEN 'MOROCCO 1'
WHEN "city" = 'VIETNAM 2' THEN 'VIETNAM 2'
WHEN "city" = 'BRAZIL 2' THEN 'BRAZIL 2'
WHEN "city" = 'CANADA 3' THEN 'CANADA 3'
WHEN "city" = 'CHINA 6' THEN 'CHINA 6'
WHEN "city" = 'GERMANY 5' THEN 'GERMANY 5'
WHEN "city" = 'INDIA 4' THEN 'INDIA 4'
WHEN "city" = 'IRAN 3' THEN 'IRAN 3'
WHEN "city" = 'JAPAN 0' THEN 'JAPAN 0'
WHEN "city" = 'JORDAN 3' THEN 'JORDAN 3'
WHEN "city" = 'MOROCCO 9' THEN 'MOROCCO 9'
WHEN "city" = 'VIETNAM 8' THEN 'VIETNAM 8'
WHEN "city" = 'ARGENTINA3' THEN 'ARGENTINA 3'
WHEN "city" = 'CHINA 9' THEN 'CHINA 9'
WHEN "city" = 'FRANCE 2' THEN 'FRANCE 2'
WHEN "city" = 'GERMANY 8' THEN 'GERMANY 8'
WHEN "city" = 'INDIA 3' THEN 'INDIA 3'
WHEN "city" = 'INDIA 7' THEN 'INDIA 7'
WHEN "city" = 'JORDAN 1' THEN 'JORDAN 1'
WHEN "city" = 'JORDAN 7' THEN 'JORDAN 7'
WHEN "city" = 'KENYA 0' THEN 'KENYA 0'
WHEN "city" = 'PERU 0' THEN 'PERU 0'
WHEN "city" = 'BRAZIL 0' THEN 'BRAZIL 0'
WHEN "city" = 'BRAZIL 7' THEN 'BRAZIL 7'
WHEN "city" = 'FRANCE 9' THEN 'FRANCE 9'
WHEN "city" = 'IRAN 5' THEN 'IRAN 5'
WHEN "city" = 'JAPAN 5' THEN 'JAPAN 5'
WHEN "city" = 'KENYA 8' THEN 'KENYA 8'
WHEN "city" = 'MOROCCO 6' THEN 'MOROCCO 6'
WHEN "city" = 'RUSSIA 9' THEN 'RUSSIA 9'
WHEN "city" = 'CANADA 1' THEN 'CANADA 1'
WHEN "city" = 'EGYPT 5' THEN 'EGYPT 5'
WHEN "city" = 'INDIA 0' THEN 'INDIA 0'
WHEN "city" = 'INDONESIA5' THEN 'INDONESIA 5'
WHEN "city" = 'IRAN 9' THEN 'IRAN 9'
WHEN "city" = 'JORDAN 6' THEN 'JORDAN 6'
WHEN "city" = 'PERU 4' THEN 'PERU 4'
WHEN "city" = 'PERU 9' THEN 'PERU'
WHEN "city" = 'RUSSIA 4' THEN 'RUSSIA'
WHEN "city" = 'RUSSIA 7' THEN 'RUSSIA'
WHEN "city" = 'SAUDI ARA9' THEN 'SAUDI ARABIA'
WHEN "city" = 'UNITED KI1' THEN 'UNITED KINGDOM'
WHEN "city" = 'ALGERIA 1' THEN 'ALGERIA'
WHEN "city" = 'CHINA 8' THEN 'CHINA'
WHEN "city" = 'FRANCE 5' THEN 'FRANCE'
WHEN "city" = 'GERMANY 1' THEN 'GERMANY'
WHEN "city" = 'INDIA 1' THEN 'INDIA'
WHEN "city" = 'IRAN 8' THEN 'IRAN'
WHEN "city" = 'KENYA 6' THEN 'KENYA'
WHEN "city" = 'ROMANIA 2' THEN 'ROMANIA'
WHEN "city" = 'RUSSIA 8' THEN 'RUSSIA'
WHEN "city" = 'UNITED KI8' THEN 'UNITED KINGDOM'
WHEN "city" = 'VIETNAM 6' THEN 'VIETNAM'
WHEN "city" = 'VIETNAM 9' THEN 'VIETNAM'
WHEN "city" = 'ARGENTINA0' THEN 'ARGENTINA'
WHEN "city" = 'BRAZIL 6' THEN 'BRAZIL'
WHEN "city" = 'ETHIOPIA 0' THEN 'ETHIOPIA'
WHEN "city" = 'ETHIOPIA 1' THEN 'ETHIOPIA'
WHEN "city" = 'FRANCE 3' THEN 'FRANCE'
WHEN "city" = 'FRANCE 7' THEN 'FRANCE'
WHEN "city" = 'GERMANY 2' THEN 'GERMANY'
WHEN "city" = 'INDONESIA0' THEN 'INDONESIA'
WHEN "city" = 'IRAN 2' THEN 'IRAN'
WHEN "city" = 'IRAQ 1' THEN 'IRAQ'
WHEN "city" = 'IRAQ 8' THEN 'IRAQ'
WHEN "city" = 'KENYA 5' THEN 'KENYA'
WHEN "city" = 'MOZAMBIQU8' THEN 'MOZAMBIQUE'
WHEN "city" = 'PERU 3' THEN 'PERU'
WHEN "city" = 'UNITED KI4' THEN 'UNITED KINGDOM'
WHEN "city" = 'ALGERIA 7' THEN 'ALGERIA'
WHEN "city" = 'CANADA 7' THEN 'CANADA'
WHEN "city" = 'CANADA 8' THEN 'CANADA'
WHEN "city" = 'IRAQ 2' THEN 'IRAQ'
WHEN "city" = 'SAUDI ARA5' THEN 'SAUDI ARABIA'
WHEN "city" = 'UNITED ST9' THEN 'UNITED STATES'
WHEN "city" = 'VIETNAM 4' THEN 'VIETNAM'
WHEN "city" = 'GERMANY 3' THEN 'GERMANY'
WHEN "city" = 'IRAQ 0' THEN 'IRAQ'
WHEN "city" = 'JORDAN 0' THEN 'JORDAN'
WHEN "city" = 'MOROCCO 0' THEN 'MOROCCO'
WHEN "city" = 'MOROCCO 3' THEN 'MOROCCO'
WHEN "city" = 'ROMANIA 9' THEN 'ROMANIA'
WHEN "city" = 'SAUDI ARA3' THEN 'SAUDI ARABIA'
WHEN "city" = 'UNITED KI9' THEN 'UNITED KINGDOM'
WHEN "city" = 'UNITED ST7' THEN 'UNITED STATES'
WHEN "city" = 'BRAZIL 1' THEN 'BRAZIL'
WHEN "city" = 'BRAZIL 9' THEN 'BRAZIL'
WHEN "city" = 'EGYPT 6' THEN 'EGYPT'
WHEN "city" = 'FRANCE 4' THEN 'FRANCE'
WHEN "city" = 'JAPAN 7' THEN 'JAPAN'
WHEN "city" = 'JAPAN 8' THEN 'JAPAN'
WHEN "city" = 'MOZAMBIQU6' THEN 'MOZAMBIQUE'
WHEN "city" = 'PERU 5' THEN 'PERU'
WHEN "city" = 'PERU 7' THEN 'PERU'
WHEN "city" = 'RUSSIA 2' THEN 'RUSSIA'
WHEN "city" = 'UNITED KI6' THEN 'UNITED KINGDOM'
WHEN "city" = 'ARGENTINA1' THEN 'ARGENTINA'
WHEN "city" = 'BRAZIL 5' THEN 'BRAZIL'
WHEN "city" = 'CHINA 7' THEN 'CHINA'
WHEN "city" = 'FRANCE 1' THEN 'FRANCE'
WHEN "city" = 'INDIA 5' THEN 'INDIA'
WHEN "city" = 'JAPAN 2' THEN 'JAPAN'
WHEN "city" = 'JORDAN 9' THEN 'JORDAN'
WHEN "city" = 'UNITED ST3' THEN 'UNITED STATES'
WHEN "city" = 'ALGERIA 4' THEN 'ALGERIA'
WHEN "city" = 'ARGENTINA6' THEN 'ARGENTINA'
WHEN "city" = 'INDONESIA9' THEN 'INDONESIA'
WHEN "city" = 'MOROCCO 2' THEN 'MOROCCO'
WHEN "city" = 'PERU 1' THEN 'PERU'
WHEN "city" = 'ROMANIA 0' THEN 'ROMANIA'
WHEN "city" = 'RUSSIA 1' THEN 'RUSSIA'
WHEN "city" = 'INDIA 6' THEN 'INDIA'
WHEN "city" = 'INDONESIA4' THEN 'INDONESIA'
WHEN "city" = 'PERU 2' THEN 'PERU'
WHEN "city" = 'SAUDI ARA6' THEN 'SAUDI ARABIA'
WHEN "city" = 'VIETNAM 7' THEN 'VIETNAM'
WHEN "city" = 'BRAZIL 3' THEN 'BRAZIL'
WHEN "city" = 'EGYPT 2' THEN 'EGYPT'
WHEN "city" = 'EGYPT 4' THEN 'EGYPT'
WHEN "city" = 'ETHIOPIA 4' THEN 'ETHIOPIA'
WHEN "city" = 'SAUDI ARA8' THEN 'SAUDI ARABIA'
WHEN "city" = 'ETHIOPIA 2' THEN 'ETHIOPIA'
WHEN "city" = 'MOZAMBIQU4' THEN 'MOZAMBIQUE'
WHEN "city" = 'ROMANIA 8' THEN 'ROMANIA'
WHEN "city" = 'JAPAN 4' THEN 'JAPAN'
WHEN "city" = 'VIETNAM 3' THEN 'VIETNAM'
WHEN "city" = 'INDIA 2' THEN 'INDIA'
WHEN "city" = 'JORDAN 4' THEN 'JORDAN'
WHEN "city" = 'ALGERIA 3' THEN 'ALGERIA'
WHEN "city" = 'ALGERIA 9' THEN 'ALGERIA'
WHEN "city" = 'KENYA 4' THEN 'KENYA'
WHEN "city" = 'MOZAMBIQU0' THEN 'MOZAMBIQUE'
WHEN "city" = 'RUSSIA 0' THEN 'RUSSIA'
WHEN "city" = 'SAUDI ARA0' THEN 'SAUDI ARABIA'
WHEN "city" = 'ALGERIA 6' THEN 'ALGERIA'
WHEN "city" = 'MOROCCO 7' THEN 'MOROCCO'
WHEN "city" = 'CANADA 5' THEN 'CANADA'
ELSE "city"
END AS "city",
"country",
"region",
"phone_number",
"market_segment",
"street_address"
FROM "customer_renamed_trimmed"
)
-- COCOON BLOCK END
SELECT *
FROM "customer_renamed_trimmed_cleaned"
stg_customer.yml (Document the table)
version: 2
models:
- name: stg_customer
description: The table is about customers. It contains customer details including
a unique customer key, name, address, city, nation, region, phone number, and
market segment. Each row represents a single customer with their specific information.
The table appears to be part of a larger database system, likely for managing
customer information in a business context.
columns:
- name: customer_id
description: Unique identifier for each customer
tests:
- not_null
- unique
cocoon_meta:
unique_reason: This column represents a unique identifier for each customer.
For this table, each row is for a single customer. customer_id appears to
be unique across rows, as it's formatted as "Customer#000000001", "Customer#000000002",
etc.
data_type:
current_data_type: INT
- name: customer_name
description: Customer's full name
tests:
- not_null
cocoon_meta:
patterns:
- regex: Customer#\d{9}
summary: Customer IDs in the format of Customer# followed by a 9-digit number
data_type:
current_data_type: VARCHAR
- name: city
description: Customer's city of residence
tests:
- not_null
cocoon_meta:
patterns:
- regex: '[A-Z ]+\s\d$'
summary: Country name followed by a space and a single digit
- regex: ^[A-Z ]+$
summary: Country name only
- regex: ARGENTINA\d
summary: Text starting with "ARGENTINA" followed by a single digit
data_type:
current_data_type: VARCHAR
- name: country
description: Customer's country of residence
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: VARCHAR
- name: region
description: Geographical region of the customer
tests:
- not_null
- accepted_values:
values:
- EUROPE
- AMERICA
- MIDDLE EAST
- ASIA
- AFRICA
cocoon_meta:
future_accepted_values:
- OCEANIA
- ANTARCTICA
- CARIBBEAN
- CENTRAL AMERICA
- SOUTH AMERICA
data_type:
current_data_type: VARCHAR
- name: phone_number
description: Customer's phone number
tests:
- not_null
cocoon_meta:
patterns:
- regex: \d{2}-\d{3}-\d{3}-\d{4}
summary: Phone numbers in the format of XX-XXX-XXX-XXXX, where X represents
a digit
data_type:
current_data_type: VARCHAR
- name: market_segment
description: Market segment the customer belongs to
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: VARCHAR
- name: street_address
description: Customer's street address
tests:
- not_null
cocoon_meta:
unusual_values: Values starting with commas are unusual. Random strings like
'2z1 vZUo2qjaxg7qJouIBb' are not typical for street addresses.
data_type:
current_data_type: VARCHAR
tests: []
stg_supplier
Table Preview (first 5 rows)
supplier_id | supplier_name | city | country | region | phone_number | street_address | |
---|---|---|---|---|---|---|---|
0 | 1 | Supplier#000000001 | PERU 7 | PERU | AMERICA | 27-989-741-2988 | sdrGnXCDRcfriBvY0KL,i |
1 | 2 | Supplier#000000002 | ETHIOPIA 2 | ETHIOPIA | AFRICA | 15-768-687-3665 | TRMhVHz3XiFu |
2 | 3 | Supplier#000000003 | ARGENTINA 8 | ARGENTINA | AMERICA | 11-719-748-3364 | BZ0kXcHUcHjx62L7CjZS |
3 | 4 | Supplier#000000004 | MOROCCO 5 | MOROCCO | AFRICA | 25-128-190-5944 | qGTQJXogS83a7MB |
4 | 5 | Supplier#000000005 | IRAQ 1 | IRAQ | MIDDLE EAST | 21-750-942-6364 | lONEYAh9sF |
stg_supplier.sql (clean the table)
-- COCOON BLOCK START: PLEASE DO NOT MODIFY THIS BLOCK FOR SELF-MAINTENANCE
-- Generated at 2024-08-26 00:52:51.226296+00:00
WITH
"supplier_renamed" AS (
-- Rename: Renaming columns
-- S_SUPPKEY -> supplier_id
-- S_NAME -> supplier_name
-- S_ADDRESS -> street_address
-- S_CITY -> city
-- S_NATION -> country
-- S_REGION -> region
-- S_PHONE -> phone_number
SELECT
"S_SUPPKEY" AS "supplier_id",
"S_NAME" AS "supplier_name",
"S_ADDRESS" AS "street_address",
"S_CITY" AS "city",
"S_NATION" AS "country",
"S_REGION" AS "region",
"S_PHONE" AS "phone_number"
FROM "memory"."main"."supplier"
),
"supplier_renamed_trimmed" AS (
-- Trim Leading and Trailing Spaces
SELECT
"supplier_id",
"supplier_name",
"city",
"country",
"region",
"phone_number",
TRIM("street_address") AS "street_address"
FROM "supplier_renamed"
),
"supplier_renamed_trimmed_cleaned" AS (
-- Clean unusual string values:
-- city: The problem is that some city names are truncated or have inconsistent spacing. The "SAUDI ARA" entries are truncated and should be "SAUDI ARABIA". The "INDONESIA" entries have inconsistent spacing, with "INDONESIA4" having no space and "INDONESIA1" having one space. All other entries have 2 spaces between the country name and number. The correct values should have consistent formatting with 2 spaces between the country name and number for all entries.
-- street_address: All values are unusual because they start with commas and consist of random alphanumeric strings instead of typical address components.
SELECT
"supplier_id",
"supplier_name",
CASE
WHEN "city" = 'INDONESIA4' THEN 'INDONESIA 4'
WHEN "city" = 'INDONESIA1' THEN 'INDONESIA 1'
WHEN "city" = 'SAUDI ARA0' THEN 'SAUDI ARABIA 0'
WHEN "city" = 'SAUDI ARA7' THEN 'SAUDI ARABIA 7'
WHEN "city" = 'SAUDI ARA5' THEN 'SAUDI ARABIA 5'
WHEN "city" = 'SAUDI ARA8' THEN 'SAUDI ARABIA 8'
WHEN "city" = 'SAUDI ARA1' THEN 'SAUDI ARA 1'
WHEN "city" = 'SAUDI ARA6' THEN 'SAUDI ARA 6'
WHEN "city" = 'INDONESIA8' THEN 'INDONESIA 8'
WHEN "city" = 'INDONESIA3' THEN 'INDONESIA 3'
WHEN "city" = 'INDONESIA7' THEN 'INDONESIA 7'
WHEN "city" = 'INDONESIA9' THEN 'INDONESIA 9'
WHEN "city" = 'INDONESIA0' THEN 'INDONESIA 0'
WHEN "city" = 'UNITED KI9' THEN 'UNITED KI 9'
WHEN "city" = 'UNITED ST0' THEN 'UNITED ST 0'
WHEN "city" = 'UNITED ST8' THEN 'UNITED ST 8'
WHEN "city" = 'UNITED ST9' THEN 'UNITED ST 9'
WHEN "city" = 'ARGENTINA7' THEN 'ARGENTINA 7'
WHEN "city" = 'ARGENTINA8' THEN 'ARGENTINA 8'
WHEN "city" = 'SAUDI ARA2' THEN 'SAUDI AR 2'
WHEN "city" = 'SAUDI ARA9' THEN 'SAUDI AR 9'
WHEN "city" = 'INDONESIA2' THEN 'INDONESI 2'
WHEN "city" = 'UNITED ST4' THEN 'UNITED S 4'
WHEN "city" = 'UNITED ST5' THEN 'UNITED S 5'
WHEN "city" = 'UNITED ST6' THEN 'UNITED S 6'
WHEN "city" = 'UNITED ST7' THEN 'UNITED S 7'
WHEN "city" = 'UNITED KI0' THEN 'UNITED K 0'
WHEN "city" = 'MOZAMBIQU2' THEN 'MOZAMBIQ 2'
WHEN "city" = 'MOZAMBIQU6' THEN 'MOZAMBIQ 6'
WHEN "city" = 'INDONESIA5' THEN 'INDONESIA 5'
WHEN "city" = 'INDONESIA6' THEN 'INDONESIA 6'
ELSE "city"
END AS "city",
"country",
"region",
"phone_number",
"street_address"
FROM "supplier_renamed_trimmed"
)
-- COCOON BLOCK END
SELECT *
FROM "supplier_renamed_trimmed_cleaned"
stg_supplier.yml (Document the table)
version: 2
models:
- name: stg_supplier
description: The table is about suppliers. It contains details of each supplier
including their unique key, name, address, city, nation, region, and phone number.
Each row represents a single supplier with their specific information. The table
appears to be a part of a larger database system, likely used for supply chain
management or procurement processes.
columns:
- name: supplier_id
description: Unique identifier for the supplier
tests:
- not_null
- unique
cocoon_meta:
unique_reason: This column represents a unique identifier for each supplier.
For this table, each row represents a single supplier, and supplier_id appears
to be unique across rows.
data_type:
current_data_type: INT
- name: supplier_name
description: Name of the supplier
tests:
- not_null
cocoon_meta:
patterns:
- regex: Supplier#\d{9}
summary: Supplier names in the format of "Supplier#" followed by a 9-digit
number
data_type:
current_data_type: VARCHAR
- name: city
description: City where the supplier is located
tests:
- not_null
cocoon_meta:
patterns:
- regex: '[A-Z\s]{3,14}\s\d'
summary: Country name (uppercase, potentially truncated) followed by space
and single digit
- regex: '[A-Z\s]{5,11}\d'
summary: Abbreviated country names (in uppercase) followed by a single digit
data_type:
current_data_type: VARCHAR
- name: country
description: Country where the supplier is based
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: VARCHAR
- name: region
description: Geographical region of the supplier
tests:
- not_null
- accepted_values:
values:
- ASIA
- MIDDLE EAST
- AFRICA
- EUROPE
- AMERICA
cocoon_meta:
future_accepted_values:
- NORTH AMERICA
- SOUTH AMERICA
- OCEANIA
data_type:
current_data_type: VARCHAR
- name: phone_number
description: Contact phone number for the supplier
tests:
- not_null
cocoon_meta:
patterns:
- regex: 1[01]-\d{3}-\d{3}-\d{4}
summary: Phone numbers in the format of XX-XXX-XXX-XXXX, where X is a digit
- regex: \d{2}-\d{3}-\d{3}-\d{4}
summary: Phone numbers in the format of 12-xxx-xxx-xxxx where x represents
digits
data_type:
current_data_type: VARCHAR
- name: street_address
description: Street address of the supplier
tests:
- not_null
cocoon_meta:
unusual_values: All values are unusual because they start with commas and consist
of random alphanumeric strings instead of typical address components.
patterns:
- regex: ^,[\w\s]+$
summary: Values starting with a comma followed by alphanumeric characters
and spaces
- regex: ^\d+,[\w\s]+$
summary: Values starting with digits followed by a comma and alphanumeric
characters
- regex: ^[\w\s,]+$
summary: Alphanumeric strings with possible spaces and commas
data_type:
current_data_type: VARCHAR
tests: []
stg_lineorder
Table Preview (first 5 rows)
order_priority | quantity | shipping_mode | line_number | shipping_priority | product_id | order_id | customer_id | supplier_id | commit_date | discount_percentage | line_item_revenue | line_item_total | order_date | order_total | supply_cost | tax_percentage | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | None | 13 | REG AIR | 7 | 0 | 99546 | 277316 | 19190 | 1170 | 1995-05-25 | 0.0 | 2009202.0 | 2009202.0 | 1995-02-28 | 23455557.0 | 92732.0 | 3.0 |
1 | 1-URGENT | 14 | SHIP | 4 | 0 | 48214 | 5929219 | 28511 | 1427 | 1998-09-22 | 4.0 | 1562010.0 | 1627094.0 | 1998-07-02 | 20688659.0 | 69732.0 | 0.0 |
2 | 2-HIGH | 4 | FOB | 4 | 0 | 100325 | 3693318 | 15811 | 1170 | 1992-10-13 | 9.0 | 482416.0 | 530128.0 | 1992-08-07 | 19688727.0 | 79519.0 | 0.0 |
3 | None | 15 | 2 | 0 | 126666 | 2765669 | 2258 | 35 | 1992-07-03 | 8.0 | 2335870.0 | 2538990.0 | 1992-05-04 | 3523267.0 | 101559.0 | 8.0 | |
4 | 1-URGENT | 17 | TRUCK | 1 | 0 | 172615 | 686017 | 23836 | 804 | 1995-12-13 | 7.0 | 2668111.0 | 2868937.0 | 1995-10-14 | 4014677.0 | 101256.0 | 8.0 |
stg_lineorder.sql (clean the table)
-- COCOON BLOCK START: PLEASE DO NOT MODIFY THIS BLOCK FOR SELF-MAINTENANCE
-- Generated at 2024-08-26 00:50:02.289320+00:00
WITH
"lineorder_renamed" AS (
-- Rename: Renaming columns
-- LO_ORDERKEY -> order_id
-- LO_LINENUMBER -> line_number
-- LO_CUSTKEY -> customer_id
-- LO_PARTKEY -> product_id
-- LO_SUPPKEY -> supplier_id
-- LO_ORDERDATE -> order_date
-- LO_ORDERPRIORITY -> order_priority
-- LO_SHIPPRIORITY -> shipping_priority
-- LO_QUANTITY -> quantity
-- LO_EXTENDEDPRICE -> line_item_total
-- LO_ORDTOTALPRICE -> order_total
-- LO_DISCOUNT -> discount_percentage
-- LO_REVENUE -> line_item_revenue
-- LO_SUPPLYCOST -> supply_cost
-- LO_TAX -> tax_percentage
-- LO_COMMITDATE -> commit_date
-- LO_SHIPMODE -> shipping_mode
SELECT
"LO_ORDERKEY" AS "order_id",
"LO_LINENUMBER" AS "line_number",
"LO_CUSTKEY" AS "customer_id",
"LO_PARTKEY" AS "product_id",
"LO_SUPPKEY" AS "supplier_id",
"LO_ORDERDATE" AS "order_date",
"LO_ORDERPRIORITY" AS "order_priority",
"LO_SHIPPRIORITY" AS "shipping_priority",
"LO_QUANTITY" AS "quantity",
"LO_EXTENDEDPRICE" AS "line_item_total",
"LO_ORDTOTALPRICE" AS "order_total",
"LO_DISCOUNT" AS "discount_percentage",
"LO_REVENUE" AS "line_item_revenue",
"LO_SUPPLYCOST" AS "supply_cost",
"LO_TAX" AS "tax_percentage",
"LO_COMMITDATE" AS "commit_date",
"LO_SHIPMODE" AS "shipping_mode"
FROM "memory"."main"."lineorder"
),
"lineorder_renamed_null" AS (
-- NULL Imputation: Impute Null to Disguised Missing Values
-- order_priority: ['4-NOT SPECIFIED']
SELECT
CASE
WHEN "order_priority" = '4-NOT SPECIFIED' THEN NULL
ELSE "order_priority"
END AS "order_priority",
"supply_cost",
"quantity",
"line_item_total",
"tax_percentage",
"commit_date",
"shipping_mode",
"line_number",
"shipping_priority",
"product_id",
"discount_percentage",
"order_id",
"customer_id",
"order_total",
"line_item_revenue",
"order_date",
"supplier_id"
FROM "lineorder_renamed"
),
"lineorder_renamed_null_casted" AS (
-- Column Type Casting:
-- commit_date: from INT to DATE
-- discount_percentage: from INT to DECIMAL
-- line_item_revenue: from INT to DECIMAL
-- line_item_total: from INT to DECIMAL
-- order_date: from INT to DATE
-- order_total: from INT to DECIMAL
-- supply_cost: from INT to DECIMAL
-- tax_percentage: from INT to DECIMAL
SELECT
"order_priority",
"quantity",
"shipping_mode",
"line_number",
"shipping_priority",
"product_id",
"order_id",
"customer_id",
"supplier_id",
strptime(CAST("commit_date" AS VARCHAR), '%Y%m%d')
AS "commit_date",
CAST("discount_percentage" AS DECIMAL)
AS "discount_percentage",
CAST("line_item_revenue" AS DECIMAL)
AS "line_item_revenue",
CAST("line_item_total" AS DECIMAL)
AS "line_item_total",
strptime(CAST("order_date" AS VARCHAR), '%Y%m%d')
AS "order_date",
CAST("order_total" AS DECIMAL)
AS "order_total",
CAST("supply_cost" AS DECIMAL)
AS "supply_cost",
CAST("tax_percentage" AS DECIMAL)
AS "tax_percentage"
FROM "lineorder_renamed_null"
)
-- COCOON BLOCK END
SELECT *
FROM "lineorder_renamed_null_casted"
stg_lineorder.yml (Document the table)
version: 2
models:
- name: stg_lineorder
description: The table is about line items in orders. It includes order details
like order key, line number, customer key, part key, supplier key, and order date.
It also contains shipping information, quantities, prices, discounts, and revenue.
The table provides a comprehensive view of each line item in an order, including
financial and logistical aspects.
columns:
- name: order_priority
description: Priority level of the order
tests:
- accepted_values:
values:
- 5-LOW
- 2-HIGH
- 1-URGENT
- 3-MEDIUM
cocoon_meta:
missing_reason: Not applicable for non-priority or standard orders.
future_accepted_values:
- 4-NORMAL
data_type:
current_data_type: VARCHAR
- name: quantity
description: Quantity of items ordered
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: INT
- name: shipping_mode
description: Mode of shipping for the line item
tests:
- not_null
- accepted_values:
values:
- REG AIR
- FOB
- SHIP
- TRUCK
- AIR
- RAIL
- MAIL
cocoon_meta:
data_type:
current_data_type: VARCHAR
- name: line_number
description: Line item number within an order
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: INT
- name: shipping_priority
description: Priority level for shipping the order
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: INT
- name: product_id
description: Unique identifier for the product or part
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: INT
- name: order_id
description: Unique identifier for each order
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: INT
- name: customer_id
description: Unique identifier for the customer
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: INT
- name: supplier_id
description: Unique identifier for the supplier
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: INT
- name: commit_date
description: Committed delivery date for the line item
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: TIMESTAMP
expected_data_type: DATE
- name: discount_percentage
description: Discount percentage applied to the line item
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: DECIMAL
- name: line_item_revenue
description: Revenue generated from the line item
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: DECIMAL
- name: line_item_total
description: Total price for the line item
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: DECIMAL
- name: order_date
description: Date when the order was placed
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: TIMESTAMP
expected_data_type: DATE
- name: order_total
description: Total price for the entire order
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: DECIMAL
- name: supply_cost
description: Cost of supplying the item
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: DECIMAL
- name: tax_percentage
description: Tax percentage applied to the line item
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: DECIMAL
tests: []
Join Graph (FK to PK)
cocoon_join.yml (Document the joins)
join_graph:
- table_name: stg_customer
primary_key: customer_id
foreign_keys: []
- table_name: stg_lineorder
foreign_keys:
- column: customer_id
reference:
table_name: stg_customer
column: customer_id
- column: product_id
reference:
table_name: stg_part
column: part_id
- column: supplier_id
reference:
table_name: stg_supplier
column: supplier_id
time_keys:
- commit_date
- order_date
- table_name: stg_part
primary_key: part_id
foreign_keys: []
- table_name: stg_supplier
primary_key: supplier_id
foreign_keys: []
- table_name: stg_date
time_keys:
- full_date
All
Process Story
cocoon_er.yml (Document the entity relationships)
groups:
- group_name: DateDimension
group_summary: This group focuses on date-related information, providing a comprehensive
breakdown of calendar dates and their various attributes for data analysis and
reporting purposes.
tables:
- stg_date
join_info: As there is only one table in this group, no join information is applicable.
However, this table can potentially be used to join with other tables that have
date-related fields using the date_key or full_date attributes.
entities:
- entity_name: Customers
entity_description: Represents individual customers with their personal and contact
information.
table_name: stg_customer
primary_key: customer_id
- entity_name: Parts
entity_description: Represents individual parts or products with their specifications
and classifications.
table_name: stg_part
primary_key: part_id
- entity_name: Suppliers
entity_description: Represents individual suppliers with their company and contact
information.
table_name: stg_supplier
primary_key: supplier_id
relations:
- relation_name: OrderLineItems
relation_description: Customers place orders for Parts, which are supplied by Suppliers,
creating line items in the order.
table_name: stg_lineorder
entities:
- Customers
- Parts
- Suppliers
story:
- name: DateDimension
description: Calendar dates track order timelines and supplier deliveries.
type: group
- name: OrderLineItems
description: Customers order parts from suppliers, creating detailed invoices.
type: relation