EN VI

Arrays - extract address from a text in pyspark?

2024-03-11 11:00:05
How to Arrays - extract address from a text in pyspark

I have some descriptions and tags for each of the token in descriptions. Tags specify the type of token. I want to extract the address out of the descriptions, that is: all tokens corresponding to , and . How can I achieve this in pyspark.

|description                                |tags|
+-------------------------------------------+--------------------------------------------------------------------------+
|"aci*credit one bank, n"                   |<vendor_name> <vendor_name> <vendor_name> <vendor_name>                   |
|odot dmv2u 503-9455400 or 06/30            |<vendor_name> <vendor_name> <phone_number> <state> <trans_date>           |
|# 7-eleven 41066 5050 hunter rd ooltewah tn|<other> <vendor_name> <store_id> <street> <street> <street> <city> <state>|

Output I am looking for is:

NULL
OR 
5050 hunter rd ooltewah tn

Anything which is not an address tag, should not be included.

Solution:

Check out this solution:

import pyspark.sql.functions as f

df = spark.createDataFrame([
    ('"aci*credit one bank, n"', '<vendor_name> <vendor_name> <vendor_name> <vendor_name>'),
    ('odot dmv2u 503-9455400 or 06/30', '<vendor_name> <vendor_name> <phone_number> <state> <trans_date>'),
    ('# 7-eleven 41066 5050 hunter rd ooltewah tn', '<other> <vendor_name> <store_id> <street> <street> <street> <city> <state>')
], ['description', 'tags'])

address_tags = ['<state>', '<street>', '<city>']
address_tags_concatenated = '"' + '","'.join(address_tags) + '"'
df = (
    df
    # Can't use maps because there are duplicate tag values.
    .withColumn('content_zip', f.arrays_zip(f.split(f.col('description'), ' ').alias('description'), f.split(f.col('tags'), ' ').alias('tag')))
    .withColumn('content_zip_filtered', f.expr(f'filter(content_zip, x -> x.tag in ({address_tags_concatenated}))'))
    .select(f.concat_ws(" ", f.col('content_zip_filtered.description')).alias('address'))
)

df.show(truncate=False)

And the output:

+--------------------------+                                                    
|address                   |
+--------------------------+
|                          |
|or                        |
|5050 hunter rd ooltewah tn|
+--------------------------+
Answer

Login


Forgot Your Password?

Create Account


Lost your password? Please enter your email address. You will receive a link to create a new password.

Reset Password

Back to login